Skip to content

Conversation

@favyen2
Copy link
Collaborator

@favyen2 favyen2 commented Nov 18, 2025

This change will require modifications to all existing dataset configs.

  • Switch from manual parsing to pydantic models for the bulk of the dataset config.
  • Switch from manual parsing to jsonargparse for initializing data sources. Now, the dataset config has class_path and init_args fields for the data source, where the latter is a dict handled by jsonargparse.
  • Add DataSourceContext to pass the dataset path and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. Not the cleanest solution, but it seemed better than passing it via a method after instantiation.
  • Remove the RasterFormats and VectorFormats class registries, and the manual argument parsing. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config.
  • Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type.
  • Remove rslearn.data_sources.raster_source. This module provided is_raster_needed, but that was only still used in gcp_public_data, and I have changed gcp_public_data now to determine the needed assets upon initialization (similar to other data sources).
  • Remove TileStore configuration backwards compatibility. Now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config, and the old format has been deprecated for a while (although it looks like it was still copied and pasted recently into a few places, mostly in tests which I have updated).

This change will require modifications to all existing dataset configs.

- Switch from manual parsing to pydantic models for the bulk of the dataset config.
- Use jsonargparse for initializing data sources. The dataset config has a class_path
  and init_args for the data source, where the latter is arbitrary dict that is to be
  handled by jsonargparse.
- Add DataSourceContext to pass the dataset and LayerConfig to the data source. Most
  data sources use the context to do things like adjust file paths that are relative
  to the dataset root directory. The context is passed to the data source by injecting
  it into the init_args.
- Remove the RasterFormats and VectorFormats class registries. Instead, these are now
  also initialized via jsonargparse, and the class_path is directly set from the
  dataset config.
- Remove the Materializers registry. We have been only using RasterMaterializer and
  VectorMaterializer for quite some time, so now we just directly initialize them
  depending on the layer type.
- Remove rslearn.data_sources.raster_source, it provides is_raster_needed but this was
  only used in gcp_public_data and I changed that now to determine the needed assets
  upon initialization (similar to other data sources).
- Remove tile store backwards compatibility, now only the jsonargparse format is
  accepted. This shouldn't break much because we rarely specify the tile_store in the
  dataset config.
@favyen2 favyen2 requested a review from APatrickJ November 18, 2025 21:36
@cmwilhelm
Copy link
Contributor

@favyen2 can you elaborate on the scope of the changes you're referring to here?

This change will require modifications to all existing dataset configs.

I haven't looked at this PR deeply; the summary notes seem positive. Still, I'm wondering if changes that break core API contracts should be presented in design docs to the broader group at this point.

@favyen2
Copy link
Collaborator Author

favyen2 commented Nov 18, 2025

Here is an example.

Old version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "cache_dir": "cache/planetary_computer",
        "duration": "270d",
        "harmonize": true,
        "ingest": false,
        "name": "rslearn.data_sources.planetary_computer.Sentinel2",
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        },
        "sort_by": "eo:cloud_cover",
        "time_offset": "-90d"
      },
      "type": "raster"
    }

New version:

    "sentinel2": {
      "band_sets": [{
          "bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
          "dtype": "uint16"
      }],
      "data_source": {
        "class_path": "rslearn.data_sources.planetary_computer.Sentinel2",
        "init_args": {
          "cache_dir": "cache/planetary_computer",
          "harmonize": true,
          "sort_by": "eo:cloud_cover",
        },
        "duration": "270d",
        "time_offset": "-90d",
        "ingest": false,
        "query_config": {
          "max_matches": 6,
          "min_matches": 6,
          "period_duration": "30d",
          "space_mode": "PER_PERIOD_MOSAIC"
        }
      },
      "type": "raster"
    }

The main change is the separation of generic data source configuration options (like duration, time_offset, and query_config) from source-specific ones (like cache_dir, harmonize, and sort_by that are arguments to rslearn.data_sources.planetary_computer.Sentinel2). It is hard to avoid since in some ways that is the point of this change, otherwise there isn't a good way to e.g. throw an error if an unknown key is passed, because different parts of the system won't know if there are extra config options that will be read from the same config section by other parts of the system.

@favyen2 favyen2 closed this Nov 18, 2025
@favyen2 favyen2 reopened this Nov 18, 2025
Copy link
Collaborator

@APatrickJ APatrickJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

Comment on lines +587 to 588
if assets is not None:
asset_bands = {asset_key: self.BANDS[asset_key] for asset_key in assets}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will overwrite asset_bands if it was previously set above, is that the intention or should this be an elif?

for band_set in context.layer_config.band_sets:
if not set(band_set.bands).intersection(set(band_names)):
continue
asset_bands[asset_key] = band_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a break here to exit the inner loop once a match is found?

"""Initialize a new LandsatOliTirs instance.
Args:
config: the LayerConfig of the layer containing this data source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

"""Initialize a new Planet instance.
Args:
config: the LayerConfig of the layer containing this data source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants