Difficulty optimizing a simple query against parquet on object store #723

C-Loftus · 2025-12-04T21:20:27Z

C-Loftus
Dec 4, 2025

Thank you for the excellent work everyone does on duckdb; I have an optimization question. I am struggling with and wondering if folks had opinions. Thank you

Context

I have a geoparquet file hosted on gcs
- It is public and not behind any auth since its opendata
- I prefer to keep it in gcs since I am trying to access it from an ephemeral container without a persistent disk
I can query it <20ms locally but it takes >7 minutes against gcs
I am generating it from gdal using this command
- I have explored both snappy and zstd
- I have tried tweaking the row group size (both the default and bumping it up higher/lower)

ogr2ogr -f "Parquet" reference_flowline.parquet reference_flowline.gpkg -t_srs EPSG:4326  -lco COMPRESSION=ZSTD -lco ROW_GROUP_SIZE=122880

EXPLAIN SELECT fid
FROM read_parquet(
    'gcs://MY_FILE.parquet'
) AS catchments
WHERE ST_Intersects(geom, ST_Point(-108.50231860661755, 39.05108882481538))
LIMIT 1

Physical Plan

┌───────────────────────────┐
│      STREAMING_LIMIT      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│    ────────────────────   │
│         Function:         │
│        READ_PARQUET       │
│                           │
│      Projections: fid     │
│                           │
│          Filters:         │
│ ST_Intersects(geom, '\x00 │
│  \x00\x00\x00\x00\x00\x00 │
│  \x00\x00\x00\x00\x00\x01 │
│  \x00\x00\x00\x0F\xE6\xF0 │
│  \xFC% [\xC0\xF6\xE2\x1F  │
│ \x14\x8A\x86C@'::GEOMETRY)│
│                           │
│       ~533,551 rows       │
└───────────────────────────┘

Geoparquet Metadata

╭────────────────────┬─────────┬────────────┬────────────┬─────────────┬──────────┬─────────────────┬──────────────────────────────────────────┬──────────────────────────────────╮
│ COLUMN             │ TYPE    │ ANNOTATION │ REPETITION │ COMPRESSION │ ENCODING │ GEOMETRY TYPES  │ BOUNDS                                   │ DETAIL                           │
├────────────────────┼─────────┼────────────┼────────────┼─────────────┼──────────┼─────────────────┼──────────────────────────────────────────┼──────────────────────────────────┤
│ fid                │ int64   │            │ 1          │ zstd        │          │                 │                                          │                                  │
│ COMID              │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ FromNode           │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ ToNode             │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ StartFlag          │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ StreamCalc         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ Divergence         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ DnMinorHyd         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ toCOMID            │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ FCODE              │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ LENGTHKM           │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ REACHCODE          │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ FromMeas           │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ ToMeas             │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ AreaSqKM           │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ ArbolateSu         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ TerminalPa         │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ Hydroseq           │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ LevelPathI         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ Pathlength         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ DnLevelPat         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ DnHydroseq         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ TotDASqKM          │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ TerminalFl         │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ streamleve         │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ StreamOrde         │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ vpuin              │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ vpuout             │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ wbareatype         │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ slope              │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ slopelenkm         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ FTYPE              │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ gnis_name          │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ gnis_id            │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ WBAREACOMI         │ int32   │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ hwnodesqkm         │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ RPUID              │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ VPUID              │ binary  │ string     │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ roughness          │ double  │            │ 0..1       │ zstd        │          │                 │                                          │                                  │
│ geom               │ binary  │            │ 0..1       │ zstd        │ WKB      │ MultiLineString │ [-124.72442320437919,                    │  orientation │ counterclockwise  │
│                    │         │            │            │             │          │                 │ 24.953495478277205, -66.98839544435954,  │                                  │
│                    │         │            │            │             │          │                 │ 49.37661780470554]                       │                                  │
│ geom_bbox          │         │ group      │ 0..1       │             │          │                 │                                          │                                  │
├────────────────────┼─────────┴────────────┴────────────┴─────────────┴──────────┴─────────────────┴──────────────────────────────────────────┴──────────────────────────────────┤
│ Rows               │ 2667756                                                                                                                                                    │
│ Row Groups         │ 22                                                                                                                                                         │
│ GeoParquet Version │ 1.1.0                                                                                                                                                      │
╰────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

What works

I am finding that I can use the following, but this seems to just download all the data directly and put it in memory. So at this point it seems like it makes more sense to just download the parquet file on disk.

CREATE TABLEflowlines as SELECT fid, geom
FROM read_parquet(
    'gcs://national-hydrologic-geospatial-fabric-reference-hydrofabric/reference_flowlines.parquet'
)

This is the output of the macos system monitor after running that then building the rtree spatial index.

cboettig · 2025-12-04T21:25:39Z

cboettig
Dec 4, 2025

@C-Loftus Pretty sure you just want to create a VIEW instead of a table. this allows duckdb to query the remote parquet via range requests instead of downloading the entire thing (and storing it in RAM if you are using an in-memory invocation).

1 reply

C-Loftus Dec 4, 2025
Author

Thank you very much for your response. I tried to do that but perhaps I am missing something. It seems quite slow since my understanding is that I cannot build the rtree index on the VIEW. As such, without the rtree index, I have to do a sequential scan, right?

I haven't timed this yet but the intersection is currently taking >4 minutes (EDIT: It appears to have completed in 4 minutes 30 seconds)

DROP VIEW IF EXISTS flowline;
CREATE VIEW flowline as SELECT fid, geom
FROM read_parquet(
    'gcs://national-hydrologic-geospatial-fabric-reference-hydrofabric/reference_flowline.parquet'
)

SELECT * FROM flowline
WHERE ST_Intersects(geom, ST_Point(-108.50231860661755, 39.05108882481538))
LIMIT 1

C-Loftus · 2025-12-04T21:56:27Z

C-Loftus
Dec 4, 2025
Author

From discussing with a coworker it seems like a way to solve this might be to partition the parquet dataset into 50 separate parquet files by their geometry, i.e. a separate parquet file for each state and have another table with the geometry of each state. Then a user could

find what state the point is in first
use the state as a way to find the appropriate parquet file in the partitioned dataset, thus reducing the amount of rows to sequentially scan

This would likely work although I was hoping to find a way to do this without needing to have an extra states table which adds a bit of overhead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difficulty optimizing a simple query against parquet on object store #723

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Difficulty optimizing a simple query against parquet on object store #723

Uh oh!

C-Loftus Dec 4, 2025

Context

Geoparquet Metadata

What works

Replies: 2 comments · 1 reply

Uh oh!

cboettig Dec 4, 2025

Uh oh!

Uh oh!

C-Loftus Dec 4, 2025 Author

Uh oh!

Uh oh!

C-Loftus Dec 4, 2025 Author

C-Loftus
Dec 4, 2025

Replies: 2 comments 1 reply

cboettig
Dec 4, 2025

C-Loftus Dec 4, 2025
Author

C-Loftus
Dec 4, 2025
Author