Skip to content

Use geotrellis-contrib to perform windowed reads on remote rasters #689

Open
@echeipesh

Description

@echeipesh

A typical GPS workflow might start like this:

pats = ["s3://bucket/one.tiff"]
raster_layer = gps.RasterLayer.from_numpy_rdd(gps.LayerType.SPATIAL, gps.rasterio.get(paths))
tiled_layer = raster_layer.tile_to_layout(gps.GlobalLayout(256), target_crs=3857)
tiled_layer.count()

This is great for various reasons but has following cost:

  • Conversion from numpy to GeoTrellis Tiles on read (low cost)
  • Iterate over full dataset to figure out the data resolution and extent
    • Seen as RasterSummary spark job
  • Iterate over full dataset again to read it
  • Spark Shuffle to input tiles to some layout
  • Spark Shuffle to generate BufferedTiles for reproject operation
    • reproject operation is implied by inclusion of target_crs
  • Spark Shuffle to compute result of reproject operation on BufferedTiles

Prototyped here is an alternative workflow that avoids the high cost of spark shuffles on pixel tiles in this process: https://github.com/geotrellis/geotrellis-contrib/blob/demo/wsel/wse/src/main/scala/Main.scala

This can be encapsulated in a GeoPySpark operation that combines the read and tile_to_layout (with optional reprojection) steps to produce a TiledRasterLayer.

Ideally API can be similar to tile_to_layout perhaps:

    def read_to_layout(paths,
                       layout=LocalLayout(),
                       target_crs=None,
                       resample_method=ResampleMethod.NEAREST_NEIGHBOR,
                       partition_strategy=None,
                       reader=ReadMethod.GEOTRELLIS):

There is an outstanding question of what should happen to merge order. I think currently any overlapping raster is going to get merged in arbitrary order as result of tile_to_layout operation and that would be an easy default to start from here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions