Supporting additional formats#

emsarray allows developers to add support for additional formats by creating a new subclass of the emsarray.formats.Format class. These additional formats can be either packaged as part of your application or distributed as reusable and installable plugins.

Creating a subclass#

You’ve just made a new dataset format called “Gigantic Revolutionary Awesome Spatial System” - GRASS for short. You’re making a Python package with a bunch of utilities to help developers and scientists work with GRASS datasets, called grass.

To add support for GRASS to emsarray, make a new Format subclass. For this example, we will make a new module named grass.format. The complete implementation of the Grass class is available. The following is a guided walk through developing this class.

We will need the following imports:

import enum
from functools import cached_property
from typing import Optional, Tuple, Dict

import numpy as np
import xarray as xr
from shapely.geometry import Polygon
from shapely.geometry.base import BaseGeometry

from emsarray.formats import Format, Specificity
from emsarray.masking import blur_mask
from emsarray.types import Pathish
from emsarray.utils import linearise_dimensions

Grids and indexes#

A Format must specify an enum of the different grids that it supports. If it only supports one grid, make an enum with a single member.

class GrassGridKind(enum.Enum):
    blade = 'blade'
    meadow = 'meadow'

A Format must specify the format native index types it uses. GRASS has two-dimensional indexes for both the blade and meadow grids, making indexes like (kind, warp, weft):

GrassIndex = Tuple[GrassGridKind, int, int]

Format class#

Create a emsarray.formats.Format subclass named Grass, and implement all the methods below.

class Grass(Format[GrassGridKind, GrassIndex]):

    #: All the grid kinds this dataset has
    grid_kinds = frozenset(GrassGridKind)

    #: Indicates the grid kind of cells
    default_grid_kind = GrassGridKind.blade

Format.check_dataset() introspects a xarray.Dataset and returns a value indicating whether this format class can understand the dataset.

    @classmethod
    def check_dataset(cls, dataset: xr.Dataset) -> Optional[int]:
        # A Grass dataset is recognised by the 'Conventions' global attribute
        if dataset.attrs['Conventions'] == 'Grass 1.0':
            return Specificity.HIGH
        return None

The Format.ravel_index() and Format.unravel_index() methods convert between linear and native indexes. It is important that the functions are the inverse of one another — that is grass.unravel_index(grass.ravel_index(index)) == index.

    def ravel_index(self, index: GrassIndex) -> int:
        """Make a linear index from a native index"""
        kind, warp, weft = index
        # Meadows indexes are transposed from blade indexes
        if kind is GrassGridKind.meadow:
            return warp * self.dataset.dims['weft'] + weft
        else:
            return weft * self.dataset.dims['warp'] + warp
    def unravel_index(
        self,
        index: int,
        grid_kind: Optional[GrassGridKind] = None,
    ) -> GrassIndex:
        """Make a native index from a linear index"""
        grid_kind = grid_kind or self.default_grid_kind
        if grid_kind is GrassGridKind.meadow:
            warp, weft = divmod(index, self.dataset.dims['weft'])
        else:
            weft, warp = divmod(index, self.dataset.dims['warp'])
        return (grid_kind, warp, weft)

Format.get_grid_kind_and_size() determines what kind of grid a xarray.DataArray contains. This is usually done by examining the dimensions of the DataArray.

    def get_grid_kind_and_size(
        self, data_array: xr.DataArray,
    ) -> Tuple[GrassGridKind, int]:
        """
        For the given DataArray from this Dataset,
        find out what kind of grid it is, and the linear size of that grid.
        """
        size = self.dataset.dims['warp'] * self.dataset.dims['weft']
        if data_array.dims[-2:] == ('warp', 'weft'):
            return GrassGridKind.meadow, size
        if data_array.dims[-2:] == ('weft', 'warp'):
            return GrassGridKind.blade, size
        raise ValueError(
            "DataArray does not appear to be either a blade or meadow grid")

Format.make_linear() takes a DataArray and flattens the grid dimensions. The returned data array can be indexed using linear indexes. Non-grid dimensions such as time and depth should be left as-is. emsarray.utils.linearise_dimensions() is very useful here.

    def make_linear(self, data_array: xr.DataArray) -> xr.DataArray:
        """
        Make the given DataArray linear in its grid dimensions.
        """
        grid_kind, size = self.get_grid_kind_and_size(data_array)
        if grid_kind is GrassGridKind.meadow:
            dimensions = ['warp', 'weft']
        else:
            dimensions = ['weft', 'warp']
        return linearise_dimensions(data_array, dimensions)

Format.selector_for_index() takes a native index and returns a dict that can be passed to xarray.Dataset.isel(). This selector can be used to subset the dataset to a single grid index.

    def selector_for_index(self, index: GrassIndex) -> Dict[str, int]:
        """
        Make a selector for a particular index.
        This selector can be passed to Dataset.isel().
        """
        kind, warp, weft = index
        return {'warp': warp, 'weft': weft}

Format.polygons is an array of shapely Polygon instances, one for each face in the dataset. If a cell does not have a valid polygon — for example, if the coordinates for that polygon have been dropped or are outside of the valid region — that index must be None. It is strongly encouraged to use @cached_property for this property, as it is typically slow to run.

    @cached_property
    def polygons(self) -> np.ndarray:
        def make_polygon_for_cell(warp: int, weft: int) -> Polygon:
            # Implementation left as an exercise for the reader
            return Polygon(...)

        return np.array([
            make_polygon_for_cell(warp, weft)
            for warp in range(self.dataset.dimensions['warp'])
            for weft in range(self.dataset.dimensions['weft'])
        ])

The last thing to implement is clipping datasets, via the Format.make_clip_mask() and Format.apply_clip_mask() methods. Implementers are encouraged to look at existing Format implementations for concrete examples.

    def make_clip_mask(
        self,
        clip_geometry: BaseGeometry,
        buffer: int = 0,
    ) -> xr.Dataset:
        # Find all the blades that intersect the clip geometry
        intersecting_blades = [
            item
            for item, polygon in self.spatial_index.query(clip_geometry)
            if polygon.intersects(clip_geometry)
        ]
        # Get all the linear indexes of the intersecting blades
        blade_indexes = np.array([i.linear_index for i in intersecting_blades])
        # Find all the meadows associated with each intesecting blade
        meadow_indexes = np.unique([
            self.ravel_index(blade_index)
            for item in intersecting_blades
            for blade_index in self.get_meadows_for_blade(item.index)
        ])

        warp = self.dataset.dims['warp']
        weft = self.dataset.dims['weft']

        # Make a 2d array of which blades to keep
        keep_blades = np.zeros((weft, warp), dtype=bool)
        keep_blades.ravel()[blade_indexes] = True

        # Same for meadows
        keep_meadows = np.zeros((warp, weft), dtype=bool)
        keep_meadows.ravel()[meadow_indexes] = True

        # Blur the masks a bit if the clip region needs buffering
        if buffer > 0:
            keep_blades = blur_mask(keep_blades, size=buffer)
            keep_meadows = blur_mask(keep_meadows, size=buffer)

        # Make a dataset out of these masks
        return xr.Dataset(
            data_vars={
                'blades': xr.DataArray(data=keep_blades, dims=['weft', 'warp']),
                'meadows': xr.DataArray(data=keep_meadows, dims=['warp', 'weft']),
            },
        )
    def apply_clip_mask(self, clip_mask: xr.Dataset, work_dir: Pathish) -> xr.Dataset:
        # You're on your own, here.
        # This depends entirely on how the mask and datasets interact.
        pass

Registering as part of an application#

If you are making an application that needs to support GRASS, or just experimenting with a new format type, but don’t intend on distributing the new format as a plugin, you can use the register_format() function. This will add the format to the internal format registry. It can be used as a decorator or called directly:

from emsarray.formats import Format, Specificity, register_format

@register_format
class Grass(...):

The format class will not be automatically discovered by emsarray, so you must ensure that the Python file containing the Grass subclass is imported before you attempt to use it. This can be done in your applications __init__.py as import grass.format.

Distributing as a plugin#

emsarray uses entry points to find format classes distributed as plugins. Users can install your plugin and emsarray will automatically find the included subclass.

If you have created a Format subclass called Grass in the module grass.format include the following entry point in your setup.cfg:

[entry_points]
emsarray.formats =
    Grass = grass.format:Grass

The name portion before the = is not used, however we suggest using the same class name as your new format class. The value portion after the = is the import path to your class, then a :, then the name of your class. If your package contains multiple format classes, add one per line.

As a real world example, emsarray defines the following entry points:

[entry_points]
emsarray.formats =
    ArakawaC = emsarray.formats.arakawa_c:ArakawaC
    CFGrid1D = emsarray.formats.grid:CFGrid1D
    CFGrid2D = emsarray.formats.grid:CFGrid2D
    ShocSimple = emsarray.formats.shoc:ShocSimple
    ShocStandard = emsarray.formats.shoc:ShocStandard
    UGrid = emsarray.formats.ugrid:UGrid