Support aggregate parquet data sources by pgarrison · Pull Request #787 · AllenInstitute/biofile-finder

pgarrison · 2026-05-12T22:34:36Z

Context

In #638 we dropped support for aggregating parquet files temporarily in order to ship improved parquet read performance. This PR brings it back.

Changes

Aggregate parquet sources are implemented using duckdb's internal parquet aggregation feature, passing an array to read_parquet.

This option should allow duckdb full opportunity to optimize by pushing queries into the reader.
The column renaming semantics aren't as nice as the CREATE TABLE implementation for CSVs. Column names are matched exactly: inconsistent casing across inputs creates split columns.

Performance

I evaluated performance of the aggregate tables by using the benchmarking harness under review in #739.

I conducted 80 trials on each of three cases: a 10M row test file, a 20M row test file, and an aggregate source composed of two copies of the same 10M test file.

To ensure duckdb doesn't recognize the two copies of the input file as the same and just download one of them, I uploaded a copy of the input file to the test bucket. In the github actions test configuration, the file is loaded from http://localhost:18765/fixtures/synthetic-10m.parquet and http://localhost:18765/fixtures/synthetic-10m-copy.parquet.
I ran these trials in 4 batches on two github actions runners, each with the following structure.
- For each input file case, start a new chromium instance
  - 1 warmup round
  - 20 trials
The input parquets have a row group size of 122,880
The 20M test file is just the 10M file, concatenated with itself.
- Its row group layout looks like: Group 1, Group 2, ..., Group 82, Group 1, Group 2, ..., Group 82.

Naively, we might expect most queries to take about twice as long on the 20M row table or 10M+10M aggregate, compared to the 10M row table. For example, fast queries that just need row group metadata might have to look at metadata from twice as many row groups, and queries that need to scan the full table need to load twice as much data.

Comparing 10M+10M to 10M, the results indicate that most queries on the 10M+10M aggregate come with moderate performance overhead, slower than linear scaling would suggest.

Comparing 10M+10M to 20M, the 20M file shows consistent, and sometimes substantial, performance improvements.

Aside from the surprising large sort result, the greatest relative difference is in the "expand group" operation, where the 10M+10M aggregate takes on average 6.0s to the 20M row file's 2.6s. This is a SELECT DISTINCT ... WHERE query, so it does require a full table scan. Examining results for this query alone show low noise relative to the magnitude of the performance difference.

Thoughts for future performance improvements

The sort performance is surprising. Sorting a 20M row table should take at least as much time as sorting a 10M row table. See more discussion in Investigate large parquet sort performance #802

pgarrison added 2 commits April 29, 2026 15:58

638: Aggregate parquets with read_parquet()

c587186

Tests for aggregating parquets

1f36252

pgarrison marked this pull request as ready for review May 18, 2026 17:30

pgarrison requested review from BrianWhitneyAI, SeanDuHare and aswallace as code owners May 18, 2026 17:30

Merge branch 'main' into feature/638-aggregate-parquet-duckdb

a20b5f9

pgarrison mentioned this pull request May 18, 2026

Investigate large parquet sort performance #802

Open

Copilot AI mentioned this pull request May 18, 2026

Investigate DuckDB TOP_N + Parquet behavior behind large-sort I/O #803

Closed

BrianWhitneyAI approved these changes May 18, 2026

View reviewed changes

Comment thread packages/core/services/DatabaseService/index.ts

Comment thread packages/core/services/DatabaseService/index.ts Outdated

Comment thread packages/core/services/DatabaseService/index.ts

pgarrison added 2 commits May 18, 2026 14:05

Remove extra TODO

a185a75

Don't wrap DataSourcePreparationError twice

ae75299

aswallace approved these changes May 19, 2026

View reviewed changes

pgarrison mentioned this pull request May 20, 2026

Add test cases: aggregate 10M+10M, 20M #806

Merged

SeanDuHare approved these changes May 20, 2026

View reviewed changes

pgarrison added 2 commits May 21, 2026 10:32

Merge branch 'main' into feature/638-aggregate-parquet-duckdb

d9fb17f

Merge branch 'main' into feature/638-aggregate-parquet-duckdb

0dbc4a6

pgarrison merged commit 8474aae into main May 22, 2026
7 checks passed

pgarrison deleted the feature/638-aggregate-parquet-duckdb branch May 22, 2026 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support aggregate parquet data sources#787

Support aggregate parquet data sources#787
pgarrison merged 7 commits into
mainfrom
feature/638-aggregate-parquet-duckdb

pgarrison commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pgarrison commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Performance

Thoughts for future performance improvements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pgarrison commented May 12, 2026 •

edited

Loading