Description
A typical GPS workflow might start like this:
pats = ["s3://bucket/one.tiff"]
raster_layer = gps.RasterLayer.from_numpy_rdd(gps.LayerType.SPATIAL, gps.rasterio.get(paths))
tiled_layer = raster_layer.tile_to_layout(gps.GlobalLayout(256), target_crs=3857)
tiled_layer.count()
This is great for various reasons but has following cost:
- Conversion from
numpy
to GeoTrellis Tiles on read (low cost) - Iterate over full dataset to figure out the data resolution and extent
- Seen as
RasterSummary
spark job
- Seen as
- Iterate over full dataset again to read it
- Spark Shuffle to input tiles to some layout
- Spark Shuffle to generate
BufferedTiles
for reproject operation- reproject operation is implied by inclusion of
target_crs
- reproject operation is implied by inclusion of
- Spark Shuffle to compute result of reproject operation on
BufferedTiles
Prototyped here is an alternative workflow that avoids the high cost of spark shuffles on pixel tiles in this process: https://github.com/geotrellis/geotrellis-contrib/blob/demo/wsel/wse/src/main/scala/Main.scala
This can be encapsulated in a GeoPySpark operation that combines the read
and tile_to_layout
(with optional reprojection) steps to produce a TiledRasterLayer
.
Ideally API can be similar to tile_to_layout
perhaps:
def read_to_layout(paths,
layout=LocalLayout(),
target_crs=None,
resample_method=ResampleMethod.NEAREST_NEIGHBOR,
partition_strategy=None,
reader=ReadMethod.GEOTRELLIS):
There is an outstanding question of what should happen to merge order. I think currently any overlapping raster is going to get merged in arbitrary order as result of tile_to_layout
operation and that would be an easy default to start from here.