Supporting additional formats
Contents
Supporting additional formats#
emsarray
allows developers to add support for additional formats
by creating a new subclass of the emsarray.formats.Format
class.
These additional formats can be either packaged as part of your application
or distributed as reusable and installable plugins.
Creating a subclass#
You’ve just made a new dataset format called
“Gigantic Revolutionary Awesome Spatial System” - GRASS for short.
You’re making a Python package with a bunch of utilities
to help developers and scientists work with GRASS datasets,
called grass
.
To add support for GRASS to emsarray
, make a new Format subclass.
For this example, we will make a new module named grass.format
.
The complete implementation of the Grass class is available.
The following is a guided walk through developing this class.
We will need the following imports:
import enum
from functools import cached_property
from typing import Optional, Tuple, Dict
import numpy as np
import xarray as xr
from shapely.geometry import Polygon
from shapely.geometry.base import BaseGeometry
from emsarray.formats import Format, Specificity
from emsarray.masking import blur_mask
from emsarray.types import Pathish
from emsarray.utils import linearise_dimensions
Grids and indexes#
A Format must specify an enum of the different grids that it supports. If it only supports one grid, make an enum with a single member.
class GrassGridKind(enum.Enum):
blade = 'blade'
meadow = 'meadow'
A Format must specify the format native index types it uses.
GRASS has two-dimensional indexes for both the blade and meadow grids,
making indexes like (kind, warp, weft)
:
GrassIndex = Tuple[GrassGridKind, int, int]
Format class#
Create a emsarray.formats.Format
subclass named Grass
,
and implement all the methods below.
class Grass(Format[GrassGridKind, GrassIndex]):
#: All the grid kinds this dataset has
grid_kinds = frozenset(GrassGridKind)
#: Indicates the grid kind of cells
default_grid_kind = GrassGridKind.blade
Format.check_dataset()
introspects a xarray.Dataset
and returns a value indicating whether this format class can understand the dataset.
@classmethod
def check_dataset(cls, dataset: xr.Dataset) -> Optional[int]:
# A Grass dataset is recognised by the 'Conventions' global attribute
if dataset.attrs['Conventions'] == 'Grass 1.0':
return Specificity.HIGH
return None
The Format.ravel_index()
and Format.unravel_index()
methods
convert between linear and native indexes.
It is important that the functions are the inverse of one another —
that is grass.unravel_index(grass.ravel_index(index)) == index
.
def ravel_index(self, index: GrassIndex) -> int:
"""Make a linear index from a native index"""
kind, warp, weft = index
# Meadows indexes are transposed from blade indexes
if kind is GrassGridKind.meadow:
return warp * self.dataset.dims['weft'] + weft
else:
return weft * self.dataset.dims['warp'] + warp
def unravel_index(
self,
index: int,
grid_kind: Optional[GrassGridKind] = None,
) -> GrassIndex:
"""Make a native index from a linear index"""
grid_kind = grid_kind or self.default_grid_kind
if grid_kind is GrassGridKind.meadow:
warp, weft = divmod(index, self.dataset.dims['weft'])
else:
weft, warp = divmod(index, self.dataset.dims['warp'])
return (grid_kind, warp, weft)
Format.get_grid_kind_and_size()
determines
what kind of grid a xarray.DataArray
contains.
This is usually done by examining the dimensions of the DataArray.
def get_grid_kind_and_size(
self, data_array: xr.DataArray,
) -> Tuple[GrassGridKind, int]:
"""
For the given DataArray from this Dataset,
find out what kind of grid it is, and the linear size of that grid.
"""
size = self.dataset.dims['warp'] * self.dataset.dims['weft']
if data_array.dims[-2:] == ('warp', 'weft'):
return GrassGridKind.meadow, size
if data_array.dims[-2:] == ('weft', 'warp'):
return GrassGridKind.blade, size
raise ValueError(
"DataArray does not appear to be either a blade or meadow grid")
Format.make_linear()
takes a DataArray
and flattens the grid dimensions.
The returned data array can be indexed using linear indexes.
Non-grid dimensions such as time and depth should be left as-is.
emsarray.utils.linearise_dimensions()
is very useful here.
def make_linear(self, data_array: xr.DataArray) -> xr.DataArray:
"""
Make the given DataArray linear in its grid dimensions.
"""
grid_kind, size = self.get_grid_kind_and_size(data_array)
if grid_kind is GrassGridKind.meadow:
dimensions = ['warp', 'weft']
else:
dimensions = ['weft', 'warp']
return linearise_dimensions(data_array, dimensions)
Format.selector_for_index()
takes a native index
and returns a dict that can be passed to xarray.Dataset.isel()
.
This selector can be used to subset the dataset to a single grid index.
def selector_for_index(self, index: GrassIndex) -> Dict[str, int]:
"""
Make a selector for a particular index.
This selector can be passed to Dataset.isel().
"""
kind, warp, weft = index
return {'warp': warp, 'weft': weft}
Format.polygons
is an array of shapely Polygon
instances,
one for each face in the dataset.
If a cell does not have a valid polygon
— for example, if the coordinates for that polygon have been dropped
or are outside of the valid region
— that index must be None
.
It is strongly encouraged to use @cached_property
for this property,
as it is typically slow to run.
@cached_property
def polygons(self) -> np.ndarray:
def make_polygon_for_cell(warp: int, weft: int) -> Polygon:
# Implementation left as an exercise for the reader
return Polygon(...)
return np.array([
make_polygon_for_cell(warp, weft)
for warp in range(self.dataset.dimensions['warp'])
for weft in range(self.dataset.dimensions['weft'])
])
The last thing to implement is clipping datasets,
via the Format.make_clip_mask()
and Format.apply_clip_mask()
methods.
Implementers are encouraged to look at existing Format implementations
for concrete examples.
def make_clip_mask(
self,
clip_geometry: BaseGeometry,
buffer: int = 0,
) -> xr.Dataset:
# Find all the blades that intersect the clip geometry
intersecting_blades = [
item
for item, polygon in self.spatial_index.query(clip_geometry)
if polygon.intersects(clip_geometry)
]
# Get all the linear indexes of the intersecting blades
blade_indexes = np.array([i.linear_index for i in intersecting_blades])
# Find all the meadows associated with each intesecting blade
meadow_indexes = np.unique([
self.ravel_index(blade_index)
for item in intersecting_blades
for blade_index in self.get_meadows_for_blade(item.index)
])
warp = self.dataset.dims['warp']
weft = self.dataset.dims['weft']
# Make a 2d array of which blades to keep
keep_blades = np.zeros((weft, warp), dtype=bool)
keep_blades.ravel()[blade_indexes] = True
# Same for meadows
keep_meadows = np.zeros((warp, weft), dtype=bool)
keep_meadows.ravel()[meadow_indexes] = True
# Blur the masks a bit if the clip region needs buffering
if buffer > 0:
keep_blades = blur_mask(keep_blades, size=buffer)
keep_meadows = blur_mask(keep_meadows, size=buffer)
# Make a dataset out of these masks
return xr.Dataset(
data_vars={
'blades': xr.DataArray(data=keep_blades, dims=['weft', 'warp']),
'meadows': xr.DataArray(data=keep_meadows, dims=['warp', 'weft']),
},
)
def apply_clip_mask(self, clip_mask: xr.Dataset, work_dir: Pathish) -> xr.Dataset:
# You're on your own, here.
# This depends entirely on how the mask and datasets interact.
pass
Registering as part of an application#
If you are making an application that needs to support GRASS,
or just experimenting with a new format type,
but don’t intend on distributing the new format as a plugin,
you can use the register_format()
function.
This will add the format to the internal format registry.
It can be used as a decorator or called directly:
from emsarray.formats import Format, Specificity, register_format
@register_format
class Grass(...):
The format class will not be automatically discovered by emsarray
,
so you must ensure that the Python file containing the Grass subclass is imported
before you attempt to use it.
This can be done in your applications __init__.py
as import grass.format
.
Distributing as a plugin#
emsarray
uses entry points
to find format classes distributed as plugins.
Users can install your plugin and emsarray
will automatically find the included subclass.
If you have created a Format subclass called Grass
in the module grass.format
include the following entry point in your setup.cfg
:
[entry_points]
emsarray.formats =
Grass = grass.format:Grass
The name
portion before the =
is not used,
however we suggest using the same class name as your new format class.
The value
portion after the =
is the import path to your class,
then a :
, then the name of your class.
If your package contains multiple format classes, add one per line.
As a real world example, emsarray
defines the following entry points:
[entry_points]
emsarray.formats =
ArakawaC = emsarray.formats.arakawa_c:ArakawaC
CFGrid1D = emsarray.formats.grid:CFGrid1D
CFGrid2D = emsarray.formats.grid:CFGrid2D
ShocSimple = emsarray.formats.shoc:ShocSimple
ShocStandard = emsarray.formats.shoc:ShocStandard
UGrid = emsarray.formats.ugrid:UGrid