Support aggregate parquet data sources#787
Merged
Merged
Conversation
BrianWhitneyAI
approved these changes
May 18, 2026
aswallace
approved these changes
May 19, 2026
SeanDuHare
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
In #638 we dropped support for aggregating parquet files temporarily in order to ship improved parquet read performance. This PR brings it back.
Changes
Aggregate parquet sources are implemented using duckdb's internal parquet aggregation feature, passing an array to
read_parquet.CREATE TABLEimplementation for CSVs. Column names are matched exactly: inconsistent casing across inputs creates split columns.Performance
I evaluated performance of the aggregate tables by using the benchmarking harness under review in #739.
I conducted 80 trials on each of three cases: a 10M row test file, a 20M row test file, and an aggregate source composed of two copies of the same 10M test file.
http://localhost:18765/fixtures/synthetic-10m.parquetandhttp://localhost:18765/fixtures/synthetic-10m-copy.parquet.Naively, we might expect most queries to take about twice as long on the 20M row table or 10M+10M aggregate, compared to the 10M row table. For example, fast queries that just need row group metadata might have to look at metadata from twice as many row groups, and queries that need to scan the full table need to load twice as much data.
Comparing 10M+10M to 10M, the results indicate that most queries on the 10M+10M aggregate come with moderate performance overhead, slower than linear scaling would suggest.
Comparing 10M+10M to 20M, the 20M file shows consistent, and sometimes substantial, performance improvements.
Aside from the surprising large sort result, the greatest relative difference is in the "expand group" operation, where the 10M+10M aggregate takes on average 6.0s to the 20M row file's 2.6s. This is a
SELECT DISTINCT ... WHEREquery, so it does require a full table scan. Examining results for this query alone show low noise relative to the magnitude of the performance difference.Thoughts for future performance improvements