Skip to content

Investigate large parquet sort performance #802

@pgarrison

Description

@pgarrison

Investigation so far

Sorting 10M or 20M rows (even by an integer) produces quite poor performance. The following chart comes from the performance investigation described in #787.

Image

The performance is poor despite duckdb-wasm recognizing SORT ... LIMIT as a TOP_N query, as evidenced by the query plan:

┌───────────────────────────┐
│           TOP_N           │
│    ────────────────────   │
│          Top: 70          │
│                           │
│         Order By:         │
│ memory.main."synthetic-10m│
│ .parquet (5/18/2026 12:11 │
│ :43 PM)"."File Size" DESC │
│ memory.main."synthetic-10m│
│ .parquet (5/18/2026 12:11 │
│:43 PM)".hidden_bff_uid ASC│
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│    ────────────────────   │
│         Function:         │
│        PARQUET_SCAN       │
│                           │
│        Projections:       │
│          File ID          │
│         File Path         │
│         File Name         │
│         File Size         │
│          Uploaded         │
│         Thumbnail         │
│         cell_line         │
│          plate_id         │
│          channel          │
│          well_row         │
│        well_column        │
│          z_slice          │
│         timepoint         │
│       passage_number      │
│            ...            │
│        imaging_date       │
│       last_modified       │
│         created_at        │
│      processing_start     │
│       processing_end      │
│       last_accessed       │
│      export_timestamp     │
│      ingest_timestamp     │
│        qc_timestamp       │
│     publish_timestamp     │
│      time_to_flat_ms      │
│  acquisition_duration_ms  │
│   processing_duration_ms  │
│    transfer_duration_ms   │
│    analysis_duration_ms   │
│                           │
│          Filters:         │
│ optional: Dynamic Filter  │
│        (File Size)        │
│                           │
│      ~10,000,000 rows     │
└───────────────────────────┘
  • I did a manual test (N=1) with this query in the browser, reading these files from S3. These results suggest trouble with aggregate parquets.
  • I don't think it makes sense to focus on 10M+10M aggregate sort performance up to par, since "par" here is already unusably slow.
  • I canceled the 10M+10M query because it took so long.
  • The page size for this test was 52, similar to the small page size of 50 used by the benchmark harness.
  • In all three cases, the total data transferred is two orders of magnitude larger than I think an optimized query could be, suggesting substantial room for sort performance improvement.
    • The sort queries are done against the File Size column. One row group of this column is about 0.9MB on-disk.
    • Basic query optimization could look like: inspect all row group metadata to find the 52 row groups with the largest maxima, then scan all the rows in those groups to find the 52 largest, then sort those.
    • This would require reading 47MB, plus metadata for all 164 row groups.
Case Sort time Number of requests Data transferred
10M 5.44 min 1,677 4.73 GB
20M 7.65 min 1,657 4.69 GB
10M+10M 48+ min 47,794+ 5.4+ GB

Questions to investigate

  • Why does the sort load so much more of the parquet data than I expect? In the 10M case, the whole file is only 3GB on disk, but the TOP_N operation loads over 150% of that (as measured by network tab). In the 20M case, the whole file is 6GB, but I expect much less data.
  • What logic does duckdb do with the row group statistics? Which row groups are queried?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions