Investigate large parquet sort performance

# Investigation so far
Sorting 10M or 20M rows (even by an integer) produces quite poor performance. The following chart comes from the performance investigation described in #787.

<img width="599" height="372" alt="Image" src="https://github.com/user-attachments/assets/4ca121b1-b7a1-45f7-8ee3-5b1f785994d2" />

The performance is poor despite duckdb-wasm recognizing `SORT ... LIMIT` as a `TOP_N` query, as evidenced by the query plan:

```
┌───────────────────────────┐
│           TOP_N           │
│    ────────────────────   │
│          Top: 70          │
│                           │
│         Order By:         │
│ memory.main."synthetic-10m│
│ .parquet (5/18/2026 12:11 │
│ :43 PM)"."File Size" DESC │
│ memory.main."synthetic-10m│
│ .parquet (5/18/2026 12:11 │
│:43 PM)".hidden_bff_uid ASC│
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│    ────────────────────   │
│         Function:         │
│        PARQUET_SCAN       │
│                           │
│        Projections:       │
│          File ID          │
│         File Path         │
│         File Name         │
│         File Size         │
│          Uploaded         │
│         Thumbnail         │
│         cell_line         │
│          plate_id         │
│          channel          │
│          well_row         │
│        well_column        │
│          z_slice          │
│         timepoint         │
│       passage_number      │
│            ...            │
│        imaging_date       │
│       last_modified       │
│         created_at        │
│      processing_start     │
│       processing_end      │
│       last_accessed       │
│      export_timestamp     │
│      ingest_timestamp     │
│        qc_timestamp       │
│     publish_timestamp     │
│      time_to_flat_ms      │
│  acquisition_duration_ms  │
│   processing_duration_ms  │
│    transfer_duration_ms   │
│    analysis_duration_ms   │
│                           │
│          Filters:         │
│ optional: Dynamic Filter  │
│        (File Size)        │
│                           │
│      ~10,000,000 rows     │
└───────────────────────────┘
```

* I did a manual test (N=1) with this query in the browser, reading these files from S3. These results suggest trouble with aggregate parquets.
* I don't think it makes sense to focus on 10M+10M aggregate sort performance up to par, since "par" here is already unusably slow.
* I canceled the 10M+10M query because it took so long.
* The page size for this test was 52, similar to the small page size of 50 used by the benchmark harness.
* In all three cases, the total data transferred is two orders of magnitude larger than I think an optimized query could be, suggesting substantial room for sort performance improvement.
  * The sort queries are done against the File Size column. One row group of this column is about 0.9MB on-disk.
  * Basic query optimization could look like: inspect all row group metadata to find the 52 row groups with the largest maxima, then scan all the rows in those groups to find the 52 largest, then sort those.
  * This would require reading 47MB, plus metadata for all 164 row groups.

| Case | Sort time | Number of requests | Data transferred |  
|--|-|-|-|
| 10M | 5.44 min | 1,677 | 4.73 GB |
| 20M | 7.65 min | 1,657 | 4.69 GB |
| 10M+10M | 48+ min | 47,794+ | 5.4+ GB |

# Questions to investigate
* Why does the sort load so much more of the parquet data than I expect? In the 10M case, the whole file is only 3GB on disk, but the TOP_N operation loads over 150% of that (as measured by network tab). In the 20M case, the whole file is 6GB, but I expect much less data.
* What logic does duckdb do with the row group statistics? Which row groups are queried?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate large parquet sort performance #802

Investigation so far

Questions to investigate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Case	Sort time	Number of requests	Data transferred
10M	5.44 min	1,677	4.73 GB
20M	7.65 min	1,657	4.69 GB
10M+10M	48+ min	47,794+	5.4+ GB

Investigate large parquet sort performance #802

Description

Investigation so far

Questions to investigate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions