-
Notifications
You must be signed in to change notification settings - Fork 8
Improve dataset configuration #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This change will require modifications to all existing dataset configs. - Switch from manual parsing to pydantic models for the bulk of the dataset config. - Use jsonargparse for initializing data sources. The dataset config has a class_path and init_args for the data source, where the latter is arbitrary dict that is to be handled by jsonargparse. - Add DataSourceContext to pass the dataset and LayerConfig to the data source. Most data sources use the context to do things like adjust file paths that are relative to the dataset root directory. The context is passed to the data source by injecting it into the init_args. - Remove the RasterFormats and VectorFormats class registries. Instead, these are now also initialized via jsonargparse, and the class_path is directly set from the dataset config. - Remove the Materializers registry. We have been only using RasterMaterializer and VectorMaterializer for quite some time, so now we just directly initialize them depending on the layer type. - Remove rslearn.data_sources.raster_source, it provides is_raster_needed but this was only used in gcp_public_data and I changed that now to determine the needed assets upon initialization (similar to other data sources). - Remove tile store backwards compatibility, now only the jsonargparse format is accepted. This shouldn't break much because we rarely specify the tile_store in the dataset config.
|
@favyen2 can you elaborate on the scope of the changes you're referring to here?
I haven't looked at this PR deeply; the summary notes seem positive. Still, I'm wondering if changes that break core API contracts should be presented in design docs to the broader group at this point. |
|
Here is an example. Old version: "sentinel2": {
"band_sets": [{
"bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
"dtype": "uint16"
}],
"data_source": {
"cache_dir": "cache/planetary_computer",
"duration": "270d",
"harmonize": true,
"ingest": false,
"name": "rslearn.data_sources.planetary_computer.Sentinel2",
"query_config": {
"max_matches": 6,
"min_matches": 6,
"period_duration": "30d",
"space_mode": "PER_PERIOD_MOSAIC"
},
"sort_by": "eo:cloud_cover",
"time_offset": "-90d"
},
"type": "raster"
}New version: "sentinel2": {
"band_sets": [{
"bands": ["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B09", "B11", "B12"],
"dtype": "uint16"
}],
"data_source": {
"class_path": "rslearn.data_sources.planetary_computer.Sentinel2",
"init_args": {
"cache_dir": "cache/planetary_computer",
"harmonize": true,
"sort_by": "eo:cloud_cover",
},
"duration": "270d",
"time_offset": "-90d",
"ingest": false,
"query_config": {
"max_matches": 6,
"min_matches": 6,
"period_duration": "30d",
"space_mode": "PER_PERIOD_MOSAIC"
}
},
"type": "raster"
}The main change is the separation of generic data source configuration options (like duration, time_offset, and query_config) from source-specific ones (like cache_dir, harmonize, and sort_by that are arguments to rslearn.data_sources.planetary_computer.Sentinel2). It is hard to avoid since in some ways that is the point of this change, otherwise there isn't a good way to e.g. throw an error if an unknown key is passed, because different parts of the system won't know if there are extra config options that will be read from the same config section by other parts of the system. |
APatrickJ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
| if assets is not None: | ||
| asset_bands = {asset_key: self.BANDS[asset_key] for asset_key in assets} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overwrite asset_bands if it was previously set above, is that the intention or should this be an elif?
| for band_set in context.layer_config.band_sets: | ||
| if not set(band_set.bands).intersection(set(band_names)): | ||
| continue | ||
| asset_bands[asset_key] = band_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a break here to exit the inner loop once a match is found?
| """Initialize a new LandsatOliTirs instance. | ||
| Args: | ||
| config: the LayerConfig of the layer containing this data source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
| """Initialize a new Planet instance. | ||
| Args: | ||
| config: the LayerConfig of the layer containing this data source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
This change will require modifications to all existing dataset configs.