Improve dataset configuration #371

favyen2 · 2025-11-18T21:36:08Z

This change will require modifications to all existing dataset configs.

Switch from manual parsing to pydantic models for the bulk of the dataset config.
Switch from manual parsing to jsonargparse for initializing data sources. Now, the dataset config has class_path and init_args fields for the data source, where the latter is a dict handled by jsonargparse.
Add DataSourceContext to pass the dataset path and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. Not the cleanest solution, but it seemed better than passing it via a method after instantiation.
Remove the RasterFormats and VectorFormats class registries, and the manual argument parsing. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config.
Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type.
Remove rslearn.data_sources.raster_source. This module provided is_raster_needed, but that was only still used in gcp_public_data, and I have changed gcp_public_data now to determine the needed assets upon initialization (similar to other data sources).
Remove TileStore configuration backwards compatibility. Now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config, and the old format has been deprecated for a while (although it looks like it was still copied and pasted recently into a few places, mostly in tests which I have updated).

This change will require modifications to all existing dataset configs. - Switch from manual parsing to pydantic models for the bulk of the dataset config. - Use jsonargparse for initializing data sources. The dataset config has a class_path and init_args for the data source, where the latter is arbitrary dict that is to be handled by jsonargparse. - Add DataSourceContext to pass the dataset and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. - Remove the RasterFormats and VectorFormats class registries. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config. - Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type. - Remove rslearn.data_sources.raster_source, it provides is_raster_needed but this was only used in gcp_public_data and I changed that now to determine the needed assets upon initialization (similar to other data sources). - Remove tile store backwards compatibility, now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config.

cmwilhelm · 2025-11-18T21:45:54Z

@favyen2 can you elaborate on the scope of the changes you're referring to here?

This change will require modifications to all existing dataset configs.

I haven't looked at this PR deeply; the summary notes seem positive. Still, I'm wondering if changes that break core API contracts should be presented in design docs to the broader group at this point.

favyen2 · 2025-11-18T22:11:27Z

Here is an example.

Old version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "cache_dir": "cache/planetary_computer",
        "duration": "270d",
        "harmonize": true,
        "ingest": false,
        "name": "rslearn.data_sources.planetary_computer.Sentinel2",
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        },
        "sort_by": "eo:cloud_cover",
        "time_offset": "-90d"
      },
      "type": "raster"
    }

New version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "class_path": "rslearn.data_sources.planetary_computer.Sentinel2",
        "init_args": {
          "cache_dir": "cache/planetary_computer",
          "harmonize": true,
          "sort_by": "eo:cloud_cover",
        },
        "duration": "270d",
        "time_offset": "-90d",
        "ingest": false,
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        }
      },
      "type": "raster"
    }

The main change is the separation of generic data source configuration options (like duration, time_offset, and query_config) from source-specific ones (like cache_dir, harmonize, and sort_by that are arguments to rslearn.data_sources.planetary_computer.Sentinel2). It is hard to avoid since in some ways that is the point of this change, otherwise there isn't a good way to e.g. throw an error if an unknown key is passed, because different parts of the system won't know if there are extra config options that will be read from the same config section by other parts of the system.

… are parsed in same as before

…aset-config

APatrickJ

Thanks for taking this on!

APatrickJ · 2025-11-20T07:00:24Z

rslearn/data_sources/planetary_computer.py

+        if assets is not None:
            asset_bands = {asset_key: self.BANDS[asset_key] for asset_key in assets}


This will overwrite asset_bands if it was previously set above, is that the intention or should this be an elif?

APatrickJ · 2025-11-20T07:03:18Z

rslearn/data_sources/planetary_computer.py

+                for band_set in context.layer_config.band_sets:
+                    if not set(band_set.bands).intersection(set(band_names)):
+                        continue
+                    asset_bands[asset_key] = band_names


Should we add a break here to exit the inner loop once a match is found?

APatrickJ · 2025-11-20T07:04:00Z

rslearn/data_sources/usgs_landsat.py

        """Initialize a new LandsatOliTirs instance.

        Args:
            config: the LayerConfig of the layer containing this data source


Remove this

APatrickJ · 2025-11-20T07:04:30Z

rslearn/data_sources/planet.py

        """Initialize a new Planet instance.

        Args:
            config: the LayerConfig of the layer containing this data source


Remove this

favyen2 requested a review from APatrickJ November 18, 2025 21:36

favyen2 closed this Nov 18, 2025

favyen2 reopened this Nov 18, 2025

favyen2 added 5 commits November 18, 2025 14:26

rename layer_type back to type and capitalize the enum values so they…

0f25817

… are parsed in same as before

fix test_s3_dataset test

ca78f83

actually fix test

101950c

Merge remote-tracking branch 'origin/master' into favyen/20251118-dat…

59afec7

…aset-config

various test fixes

0d65f6d

This was referenced Nov 19, 2025

max_cloud_cover is ignored by Planetary Computer Sentinel‑2 data source #361

Open

Configuration overhaul #2

Closed

APatrickJ reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dataset configuration #371

Improve dataset configuration #371

Uh oh!

favyen2 commented Nov 18, 2025 •

edited

Loading

Uh oh!

cmwilhelm commented Nov 18, 2025

Uh oh!

favyen2 commented Nov 18, 2025 •

edited

Loading

Uh oh!

APatrickJ left a comment

Uh oh!

APatrickJ Nov 20, 2025

Uh oh!

APatrickJ Nov 20, 2025

Uh oh!

APatrickJ Nov 20, 2025

Uh oh!

APatrickJ Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if assets is not None:
		asset_bands = {asset_key: self.BANDS[asset_key] for asset_key in assets}

Improve dataset configuration #371

Are you sure you want to change the base?

Improve dataset configuration #371

Uh oh!

Conversation

favyen2 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmwilhelm commented Nov 18, 2025

Uh oh!

favyen2 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

APatrickJ left a comment

Choose a reason for hiding this comment

Uh oh!

APatrickJ Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

APatrickJ Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

APatrickJ Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

APatrickJ Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

favyen2 commented Nov 18, 2025 •

edited

Loading

favyen2 commented Nov 18, 2025 •

edited

Loading