Skip to content

Support aggregate parquet data sources#787

Merged
pgarrison merged 7 commits into
mainfrom
feature/638-aggregate-parquet-duckdb
May 22, 2026
Merged

Support aggregate parquet data sources#787
pgarrison merged 7 commits into
mainfrom
feature/638-aggregate-parquet-duckdb

Conversation

@pgarrison
Copy link
Copy Markdown
Contributor

@pgarrison pgarrison commented May 12, 2026

Context

In #638 we dropped support for aggregating parquet files temporarily in order to ship improved parquet read performance. This PR brings it back.

Changes

Aggregate parquet sources are implemented using duckdb's internal parquet aggregation feature, passing an array to read_parquet.

  • This option should allow duckdb full opportunity to optimize by pushing queries into the reader.
  • The column renaming semantics aren't as nice as the CREATE TABLE implementation for CSVs. Column names are matched exactly: inconsistent casing across inputs creates split columns.

Performance

I evaluated performance of the aggregate tables by using the benchmarking harness under review in #739.

I conducted 80 trials on each of three cases: a 10M row test file, a 20M row test file, and an aggregate source composed of two copies of the same 10M test file.

  • To ensure duckdb doesn't recognize the two copies of the input file as the same and just download one of them, I uploaded a copy of the input file to the test bucket. In the github actions test configuration, the file is loaded from http://localhost:18765/fixtures/synthetic-10m.parquet and http://localhost:18765/fixtures/synthetic-10m-copy.parquet.
  • I ran these trials in 4 batches on two github actions runners, each with the following structure.
    • For each input file case, start a new chromium instance
      • 1 warmup round
      • 20 trials
  • The input parquets have a row group size of 122,880
  • The 20M test file is just the 10M file, concatenated with itself.
    • Its row group layout looks like: Group 1, Group 2, ..., Group 82, Group 1, Group 2, ..., Group 82.

Naively, we might expect most queries to take about twice as long on the 20M row table or 10M+10M aggregate, compared to the 10M row table. For example, fast queries that just need row group metadata might have to look at metadata from twice as many row groups, and queries that need to scan the full table need to load twice as much data.

Comparing 10M+10M to 10M, the results indicate that most queries on the 10M+10M aggregate come with moderate performance overhead, slower than linear scaling would suggest.

Comparing 10M+10M to 20M, the 20M file shows consistent, and sometimes substantial, performance improvements.

Screenshot 2026-05-18 at 10 13 43 AM

Aside from the surprising large sort result, the greatest relative difference is in the "expand group" operation, where the 10M+10M aggregate takes on average 6.0s to the 20M row file's 2.6s. This is a SELECT DISTINCT ... WHERE query, so it does require a full table scan. Examining results for this query alone show low noise relative to the magnitude of the performance difference.

Screenshot 2026-05-18 at 10 13 55 AM

Thoughts for future performance improvements

@pgarrison pgarrison marked this pull request as ready for review May 18, 2026 17:30
Comment thread packages/core/services/DatabaseService/index.ts
Comment thread packages/core/services/DatabaseService/index.ts Outdated
Comment thread packages/core/services/DatabaseService/index.ts
@pgarrison pgarrison merged commit 8474aae into main May 22, 2026
7 checks passed
@pgarrison pgarrison deleted the feature/638-aggregate-parquet-duckdb branch May 22, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants