Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mulitband rasters saved incorrectly in kedro_datasets_experimental.rioxarray.GeoTIFFDataset #980

Open
fgassert opened this issue Jan 9, 2025 · 4 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@fgassert
Copy link

fgassert commented Jan 9, 2025

Description

Hi, thanks for creating this dataset plugin for us geospatial folks.

The GeoTIFFDataset uses rioxarray to open and write datasets, except in the case of multi-band rasters. In this case, a custom _save_multiband function will use rasterio to write the data.

if "band" in data.dims:
self._save_multiband(data, save_path)

There are three issues with this implementation, which result in the saved dataset to not matching the loaded dataset.

  1. The raster transform is calculated from the minimum and maximum lon/lat values, shrinking the extent by half a pixel on all sides.
  2. xarray attrs are not written to geotiff tags as expected.
  3. A nodata value is assigned when none exists.

In my pipelines, I've worked around the issue by writing a simple custom dataset that always uses the standard dataarray.rio.to_raster(), which appears to work as expected. However, given that effort was put into writing the _save_multiband in the first place, I feel like I'm missing something.

# custom dataset
class GeoTIFFDataset(geotiff_dataset.GeoTIFFDataset):
    def _save(self, data: xarray.DataArray) -> None:
        self._sanity_check(data)
        save_path = get_filepath_str(self._get_save_path(), self._protocol)
        data.rio.to_raster(save_path, **self._save_args)
        self._fs.invalidate_cache(save_path)

Context

This bug affects processing all raster datasets with a band dimension, e.g. landsat/sentinel data

Steps to Reproduce

Load and save a geotiff dataset with bands

# node
def copy_geotiff(input_geotiff: xarray.DataArray) -> xarray.DataArray:
    return input_geotiff

Expected Result

The output should be identical to the input

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Jan 9, 2025
@rashidakanchwala
Copy link
Contributor

Thank you for your query. I am looping in the creator of this dataset @tgoelles to assist with the above.

@tgoelles
Copy link
Contributor

Happy to see that geospatial folks are using it.

I had a reason to make a separate _save_multiband, but don't remember 🫣. I think something was missing in the rio.to_raster or it did not work back then.

Would be great if we could eliminate the need for _save_multiband.

I guess your issue 1 is the most critical one and a bit surprising. Maybe you can alaborate a bit on it.

I think number 2 can easily be implemented.

Number 3 is an opinionated move I guess, I added a default NAN value to be sure to have one. I had a horrible bug down the line related to pytorch and torchgeo which was hard to find because of it.

@fgassert
Copy link
Author

Thanks, for sure.

On why you might have written this. I see in the documentation for rioxarray, that xr.Dataset.rio.to_raster() only supports 2d arrays. Whereas xr.DataArray.rio.to_raster() doesn't have that restriction. The type annotations suggest you should always be getting DataArrays, but maybe somewhere along the way you were working with xr.Datasets?

For 1. Here, the data.x.min() will give you the center of the leftmost pixel, rather than the edge.

transform = from_bounds(
west=data.x.min(),
south=data.y.min(),
east=data.x.max(),
north=data.y.max(),
width=data[0].shape[1],
height=data[0].shape[0],
)

You can compare:

In [19]: rio.transform.from_bounds(arr.x.min(), arr.y.min(), arr.x.max(), arr.y.max(), arr[0].shape[1], arr[0].shape[0])
Out[19]: 
Affine(<xarray.DataArray 'x' ()> Size: 8B
array(9.9609375)
Coordinates:
    spatial_ref  int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(0.)
Coordinates:
    spatial_ref  int64 8B 0, <xarray.DataArray 'x' ()> Size: 8B
array(250162.63166085)
Coordinates:
    spatial_ref  int64 8B 0,
       <xarray.DataArray 'x' ()> Size: 8B
array(0.)
Coordinates:
    spatial_ref  int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(-9.9609375)
Coordinates:
    spatial_ref  int64 8B 0, <xarray.DataArray 'y' ()> Size: 8B
array(4317235.02328553)
Coordinates:
    spatial_ref  int64 8B 0)

In [20]: arr.rio.transform()
Out[20]: 
Affine(10.0, 0.0, 250157.63166085022,
       0.0, -10.0, 4317240.023285527)

@gionnid
Copy link

gionnid commented Jan 25, 2025

Hi, I also belong to the group of geo folks using kedro!
I want to get involved with the project and taking a look around I found this issue.

I'd like to help with this since I already had to make a rasterio-based GeoTiffDataset for our company.
I think tests could highlight all the issues you have been discussing and I'm willing to write them if it can be helpful.
Do experimental datasets have a test folder? I can't find one.

Also, I think it could be useful to have a rasterio-numpy implementation of this class, to be used in light environments.
Tests could be beneficial to confirm different implementations of the same class using different underlying libraries work in similar ways.

EDIT:
I forked the repo and tried to draft a couple of tests, and indeed I could replicate the first bug mentioned (wrong transform):

AssertionError: assert Affine(0.9, 0....0, -0.9, 9.0) == Affine(1.0, 0....0, 1.0, -0.5)

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

5 participants