Skip to content

Conversation

@robmarkcole
Copy link
Contributor

@robmarkcole robmarkcole commented Sep 22, 2025

Address #281 (& #239)

Zarr Data Source Support

  • Introduces rslearn.data_sources.zarr.ZarrDataSource/ZarrItem, enabling rslearn to treat chunked spatio‑temporal Zarr cubes as first-class raster sources. The implementation exposes both ingestion (writing chunks into the dataset tile store) and direct materialization (acting as a read-only TileStore) while reusing existing band-set selection and compositing logic.
  • Configuration contracts documented in docs/DatasetConfig.md, covering required keys (store_uri, axis mapping, pixel size/origin, dtype, bands, optional chunk hints) plus an example under docs/examples/ZarrDataSource.md. README now lists the dev extra install (uv pip install -e ".[dev]") so integration tests have their fixtures.
  • Added optional dependencies (xarray, zarr, rioxarray) to the extra extra, keeping the base install light but ensuring the Zarr source works when the extra is enabled.

Testing

  • Unit test (tests/unit/data_sources/test_zarr.py) builds an in-memory cube, exercises get_items → ingest → read_raster, and verifies serialization round-trips.
  • Also performed smoke test using private data on S3

Key Technical Notes

  • Zarr chunks are pre-indexed using GridIndex, with optional chunk_shape overrides; time slices default to one cube time-step per item while honoring the dataset QueryConfig.
  • Tile reads handle projection differences via rasterio.warp.reproject, but when the requested projection matches the cube’s native CRS the path short-circuits to avoid reprojection overhead.
  • Serialization persists chunk offsets (x_offset, y_offset) so pixel bounds reconstruct correctly across prepare → ingest → materialize cycles.

Example config.json:

{
    "tile_store": {
        "name": "file",
        "root_dir": "tiles"
    },
    "layers": {
        "label": {
            "type": "vector"
        },
        "output": {
            "type": "vector"
        },
        "era5_precip": {
        "type": "raster",
        "band_sets": [
            {
            "dtype": "float32",
            "bands": ["precipitation"]
            }
        ],
        "data_source": {
            "name": "rslearn.data_sources.zarr.ZarrDataSource",
            "store_uri": "s3://..precipitation.zarr",
            "data_variable": "daily_precipitation",
            "crs": "EPSG:4326",
            "pixel_size": { "x": 0.1, "y": -0.1 },
            "origin": [-130.0, -60.0],
            "axis_names": { "x": "x", "y": "y", "time": "time" },
            "bands": ["precipitation"],
            "dtype": "float32",
            "chunk_shape": { "y": 256, "x": 256 },
            "query_config": {
            "space_mode": "MOSAIC",
            "time_mode": "WITHIN",
            "min_matches": 0,
            "max_matches": 1
            },
            "ingest": true
        }
        }
    }
}

Note: "ingest": false to stream directly from the Zarr buckets without caching.

Visual validation of materialized data vs raw

image

-----

rslearn requires Python 3.10+ (Python 3.12 is recommended).
rslearn requires Python 3.11+ (Python 3.12 is recommended).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since requires-python = ">=3.11"

@robmarkcole robmarkcole changed the title Support zarr 281 [DRAFT] Support zarr 281 Sep 25, 2025
@robmarkcole
Copy link
Contributor Author

I wont be merging this but perhaps it will serve as a useful reference if anyone needs zarr support

@robmarkcole robmarkcole closed this Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant