Skip to content

feat: query benchmark system with DuckDB-WASM#739

Merged
BrianWhitneyAI merged 36 commits into
mainfrom
feature/query-benchmark
May 21, 2026
Merged

feat: query benchmark system with DuckDB-WASM#739
BrianWhitneyAI merged 36 commits into
mainfrom
feature/query-benchmark

Conversation

@BrianWhitneyAI
Copy link
Copy Markdown
Contributor

@BrianWhitneyAI BrianWhitneyAI commented Apr 15, 2026

Summary

Adds three tools for measuring and monitoring DuckDB-WASM query performance.

  • Local benchmark (run-local.js) — runs the task suite in headless Chromium against local or S3 fixtures, prints a p50/p95 table, and writes a result JSON for comparison
  • CI regression workflow (benchmark.yml) — workflow_dispatch that benchmarks two branches sequentially on the same VM and posts a Markdown diff to the workflow summary
  • Dev console timing — enable per-query DuckDB timing in the running app via localStorage.setItem("bff_query_timing", "1")

See dev-docs/07-query-benchmarking.md for full usage.


Description of added code

CI / Workflow

  • benchmark.yml — new workflow_dispatch workflow; checks out two branches sequentially on the same VM and posts a Markdown comparison to the step summary

Browser-side benchmark bundle (packages/web/benchmark/)

  • src/index.ts — Playwright page entry; registers sources, runs warmup + timed iterations in shuffled order, signals results back
  • src/tasks.ts — 11 benchmark tasks defined at the service layer (same code paths as the real app)
  • src/types.ts — shared config/result types
  • index.html — minimal HTML host for the benchmark bundle
  • webpack.config.js — webpack config for the benchmark bundle

Node.js scripts (packages/web/scripts/)

  • run-local.js — developer tool; local or S3 fixtures, any scale, side-by-side cloud vs local (--full)
  • run-regression.js — CI tool; local fixtures only, writes branch-stamped JSON
  • compare-results.js — diffs two JSONs into a Markdown table with regression badges
  • summarize-results.js — pretty-prints a single result file
  • lib/run-benchmark-page.js — shared Playwright runner used by the two script tools above

Important changes to review

Core service layer (packages/core/)

  • DatabaseAnnotationService, DatabaseFileService, DatabaseService — SQL-building logic extracted into exported functions so the benchmark calls the exact same SQL paths as the app

DuckDB web worker (packages/web/src/services/DatabaseServiceWeb/)

  • duckdb-worker.worker.ts — adds query timing instrumentation: queryTimingEnabled flag, accumulatedTimings map, per-query console logging, enableQueryTiming() / clearTimings() / sumTimings() / clearAnnotationCache() for benchmark use
  • index.ts — wires localStorage flag to timing activation on init
  • types.ts — minor type updates for the INIT payload

BrianWhitneyAI and others added 2 commits April 15, 2026 13:19
Adds a Playwright-based benchmark that runs in real DuckDB-WASM (headless
Chromium) and posts a comparison table as a PR comment on every pull request.

What's included:
- benchmark/src/ — runner, synthetic data generator, query suite, types
- scripts/run-benchmark.js — builds bundle, serves with COOP/COEP + range
  request support, coordinates two-phase run (in-memory → cloud)
- scripts/compare-results.js — diffs two result JSONs, outputs Markdown with
  regression/improvement badges and a plain-English summary section
- scripts/summarize-results.js — pretty-prints a single result file to stdout
- .github/workflows/benchmark.yml — runs base + PR branches in parallel,
  uploads artifacts, posts/updates a PR comment
- npm scripts: benchmark, benchmark:compare, benchmark:summary

Key design decisions:
- Schema configs (narrow/wide) run independently to stay within the ~3 GB
  WASM heap; each table is dropped immediately after benchmarking
- Wide schema capped at 1M rows; narrow runs to 10M
- Cloud phase exports the 10k narrow table as parquet, serves it over
  localhost HTTP with byte-range support, and registers it via DuckDB's
  HTTP protocol — exercising the same code path as real S3 sources
- Results are written to benchmark-results-<branch>.json so runs on
  different branches never clobber each other

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@BrianWhitneyAI BrianWhitneyAI force-pushed the feature/query-benchmark branch from 8a646d6 to 34bc28f Compare April 15, 2026 20:20
BrianWhitneyAI and others added 6 commits April 15, 2026 13:28
Removes automatic PR trigger in favor of a manual run where the user
selects base_branch and compare_branch. Results are written to the
workflow run summary instead of a PR comment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract buildFetchAnnotationsSQL, buildGetFilesSQL, buildGetCountSQL,
and buildDistinctValuesSQL from their respective services. Services now
call these functions internally, and the benchmark imports them directly
so query changes are automatically reflected without manual sync.

Also adds hidden_bff_uid to the synthetic benchmark table so SQLBuilder's
deterministic-pagination ORDER BY clause resolves correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Playwright's waitForFunction signature is (fn, arg, options) — passing
{ timeout } as the second argument was treating it as the page function
arg, leaving the actual timeout at Playwright's 30s default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…een queries

Two structural problems caused inconsistent results:

1. Fixed query order — early queries always ran cold, later ones always
   ran warm. This made queries toward the end of the list look
   artificially faster regardless of branch.

2. Single per-query warmup — the warmup for the first query was truly
   cold (table just created) while later queries got a free warm buffer
   pool. Warmup quality was inconsistent across queries.

Fix: replace per-query serial timing with a round-robin approach:
  - WARMUP_ROUNDS (3) full passes through all queries before any timing,
    so every query starts from a consistently warm DuckDB state.
  - ITERATIONS (20) timed rounds where all queries run in a freshly
    shuffled order each round, distributing cache effects evenly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace generic "Base" / "PR" / "Benchmark (base)" labels with the
actual branch names so the workflow steps and result table are
self-describing without a base/compare framing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GITHUB_REF_NAME is a protected GitHub Actions env var set by the runner
to the workflow trigger ref — step-level env overrides are silently
ignored. Both jobs were getting the same trigger branch name.

Switch to a custom BENCHMARK_BRANCH var that each job sets to its
respective input branch, and read that in run-benchmark.js.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BrianWhitneyAI and others added 7 commits April 15, 2026 14:51
The previous two-job design ran base and compare on separate VMs. Since
GitHub Actions runners vary in CPU speed and load, the hardware variance
(observed at ~15%) swamped the signal from actual query changes — queries
we never touched showed "improvements" simply because the compare runner
was faster.

Fix: collapse to a single job that runs both benchmarks back-to-back on
the same machine. Identical hardware means differences in the comparison
table reflect genuine query performance changes.

Tradeoffs accepted:
- Sequential (takes ~2× longer) vs parallel
- Compare run starts with a warmer OS page cache, but this is consistent
  across all queries so it doesn't bias individual query comparisons

Also bumps the job timeout to 60 minutes to accommodate sequential runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the in-memory TABLE approach with the same pattern DatabaseService
uses in production:
  1. Create a staging table from range() (no hidden_bff_uid stored)
  2. COPY to parquet, register the buffer back into the WASM filesystem
  3. DROP the staging table, CREATE a VIEW over parquet_scan with
     file_row_number AS hidden_bff_uid — identical to createParquetDirectView

This means DuckDB now runs benchmark queries through its parquet reader,
enabling row group skipping, column projection, and predicate pushdown —
the same optimizations that apply to real S3/HTTP data sources. Previously
all data lived fully deserialized in the WASM heap with no parquet metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
registerFileBuffer transfers the ArrayBuffer to the WASM worker via
postMessage, detaching it in the JS context. Save a .slice() copy of the
fixture buffer before the transfer so it remains readable when Playwright
reads it as __fixtureParquet.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
registerFileBuffer transfers ownership of the ArrayBuffer to the WASM
worker. Rather than trying to preserve a copy with .slice(), re-export
the fixture via a fresh COPY after the parquet_scan view is set up.
This produces a clean buffer that was never transferred and avoids any
lifecycle ambiguity with the WASM worker's memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove 10,000 rows from ALL_SCALES — too few rows produce noisy results
  and false positives from random variance
- Cloud benchmark now runs at 100k / 1M / 10M rows (narrow schema) to
  test HTTP range-request predicate pushdown at realistic data sizes
- CloudQueryResult gains a `scale` field; fixtures are keyed by scale
  so Playwright writes fixture-<scale>.parquet for each
- compare-results.js groups cloud results by scale with sub-headers and
  reports per-scale network baselines

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Array.from() on a 10M row parquet buffer (~50-100MB) crashes V8 when
trying to allocate the intermediate plain JS array for CDP transfer via
page.evaluate. Cap CLOUD_FIXTURE_SCALES to ≤1M rows — sufficient to
test HTTP range-request predicate pushdown, and safely within CDP limits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@SeanDuHare SeanDuHare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't finish, but here are some thoughts so far

Comment thread packages/web/benchmark/src/queries.ts Outdated
Comment thread packages/web/scripts/run-benchmark.js Outdated
Comment thread packages/web/scripts/run-benchmark.js Outdated
Comment thread packages/web/scripts/run-benchmark.js Outdated
Comment thread packages/web/benchmark/src/index.ts Outdated
@SeanDuHare
Copy link
Copy Markdown
Contributor

Remembered I wasn't sure if I checked the following in this, can you add the following if it isn't yet (or just lmk your thoughts):

  • add to the test matrix testing for cardinality? "Metadata cardinality defines the number of unique values in a data field (column) or the numeric relationship between entities in a dataset"
  • add testing varying column amounts (I think it is always 100 right now?)

BrianWhitneyAI and others added 2 commits April 22, 2026 16:22
Replaces the raw-SQL benchmark with a task-based suite that calls the same
service layer methods the app calls at runtime (fetchValues, getFiles,
fetchAvailableAnnotationsForHierarchy, etc.), making benchmark times directly
comparable to real user-perceived latency.

Key changes:
- benchmark/src/tasks.ts: 11 tasks covering fetch_annotations, filter picker
  opens at 3 cardinality levels, browse/sort/filter file list, null group
  counts, change_grouping, and folder expansion
- benchmark/src/index.ts: times full task round-trips (service → IPC → DuckDB
  → deserialization) instead of raw queryWorker calls
- duckdb-worker.worker.ts: bff_query_timing flag logs every query with label
  and elapsed time; accumulates timings for __getQueryTimings() report
- DatabaseServiceWeb: exposes __getQueryTimings() on window when timing enabled,
  dumps p50/p95/max table from accumulated worker timings
- run-local.js: --scale, --iterations, --warmup, --local, --full flags
- run-regression.js: uses local fixtures (no S3 protocol issues), supports
  --iterations and --warmup flags
- benchmark.yml: downloads fixtures via curl, adds iterations/warmup inputs,
  raises timeout to 180 min, drops S3 env var dependency from run steps
- Removes one-time-use generate-fixtures.py and upload-synthetic-fixtures.js

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
git clean -ffdx (run by actions/checkout) was wiping the fixtures
directory before the benchmark could use them. Moving the download step
after checkout fixes this. Cache keyed on fixture version (v1) so
subsequent runs skip the ~3GB download entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@BrianWhitneyAI BrianWhitneyAI force-pushed the feature/query-benchmark branch from a865df9 to f49eb3d Compare April 23, 2026 03:33
BrianWhitneyAI and others added 3 commits April 23, 2026 12:37
Replace main-thread performance.now() wrapping with worker-side timing
that measures only connection.query() — excluding Arrow toArray() and
JSON serialization overhead that was inflating results 2-6x.

Key changes:
- Add enableQueryTiming(), clearTimings(), sumTimings(),
  clearAnnotationCache() to DatabaseServiceWebWorker
- benchmarkSource clears timings per-task and reads sumTimings() instead
  of wrapping task.run() with performance.now()
- fetch_annotations: clear annotation cache before each timed iteration
  so warmup cache hits don't report 0ms
- change_grouping: use wall-clock timing — parallel queries produce O(N²)
  worker-side sums due to cumulative wait time in DuckDB's serial executor
- Split sort_file_list into two tasks (limit 50 / limit 100) to cover
  both DuckDB execution plans: Top-N heap+pruning vs full materialized sort
- Pre-warm DuckDB's parquet scanner before timing registrations to avoid
  620ms cold-start on the first source
- Hardcode staging S3 fixture URLs as defaults in run-local.js
- Make forceFullHTTPReads a parameter on initializeDuckDB

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keeps this.sourceMetadataName assignment added in main alongside our
buildFetchAnnotationsSQL refactor in DatabaseService.fetchAnnotations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@BrianWhitneyAI
Copy link
Copy Markdown
Contributor Author

BFF Query Benchmark — Local Run
feature/query-benchmark @ 4d209f5b · 2026-04-24 · DuckDB init: 822ms
Timings shown as p50 (1 iteration). Scales: local fixtures served over localhost; cloud via S3.

Query 100k-cloud 100k-local 1m-cloud 1m-local 10m-cloud 10m-local
registration 55.9ms 6.91ms 589.3ms 17.0ms 1023.2ms 118.6ms
fetch_annotations 1.82ms 2.12ms 1.64ms 2.02ms 1.63ms 2.00ms
open_filter_picker_low_cardinality 2.49ms 3.85ms 14.1ms 22.0ms 97.2ms 203.5ms
open_filter_picker_medium_cardinality 2.97ms 7.73ms 57.1ms 22.3ms ⚠️ 18218.9ms 185.3ms
open_filter_picker_high_cardinality 6.06ms 6.29ms 1679.8ms 65.3ms ⚠️ 40263.8ms 1029.3ms
browse_file_list 40.1ms 85.9ms 55.4ms 93.1ms ⚠️ 8676.3ms 201.2ms
sort_file_list 82.5ms 178.4ms ⚠️ 111064.3ms 1594.8ms ⚠️ 288693.8ms 5817.3ms
sort_file_list_large_page 85.5ms 178.7ms ⚠️ 120892.8ms 1914.3ms ⚠️ 1932389.7ms 18555.9ms
filter_count 11.4ms 9.67ms 65.1ms 74.9ms 611.1ms 729.9ms
filter_browse 41.2ms 75.7ms 54.1ms 91.3ms ⚠️ 7109.0ms 209.6ms
null_group_count 2.89ms 2.32ms 6.87ms 6.57ms 61.5ms 55.4ms
change_grouping 299.5ms 412.8ms 777.6ms 906.8ms 5549.5ms 5848.6ms
expand_folder 9.78ms 11.3ms 68.7ms 88.8ms 652.7ms 868.9ms

Observations:

  • Cloud vs. local gap widens sharply at 10M rows for sort/browse queries — expected since cloud reads require S3 range requests per query while local reads hit the in-memory buffer. The sort queries at 10m-cloud (288s, 1932s) are likely doing full-table downloads on each iteration with no warm cache.
  • Filter and count queries scale well — filter_count and null_group_count stay fast even at 10M because DuckDB can prune row groups.
  • open_filter_picker_high_cardinality at 1m-cloud (1679ms) and 10m-cloud (40s) confirms that high-cardinality DISTINCT scans are the most expensive user-facing operation at scale.

Comment thread packages/core/services/DatabaseService/index.ts Outdated
Comment thread packages/web/benchmark/src/tasks.ts
Comment thread packages/core/services/AnnotationService/DatabaseAnnotationService/index.ts Outdated
Comment thread packages/core/services/DatabaseService/index.ts Outdated
Comment thread packages/core/services/FileService/DatabaseFileService/index.ts Outdated
Comment thread packages/web/benchmark/src/tasks.ts
Comment thread packages/web/src/services/DatabaseServiceWeb/duckdb-worker.worker.ts Outdated
Comment thread packages/web/src/services/DatabaseServiceWeb/duckdb-worker.worker.ts Outdated
Comment thread packages/web/scripts/run-local.js
BrianWhitneyAI and others added 8 commits May 6, 2026 13:43
…ker.ts

Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
…ker.ts

Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
…ker.ts

Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
Reverts exported SQL builder helpers (buildGetFilesSQL, buildGetCountSQL,
buildDistinctValuesSQL, buildFetchAnnotationsSQL) and createParquetDirectView
visibility change back to their original forms — these were added only to
support the benchmark but the benchmark now calls service methods directly.
Adds a design rationale comment at the top of tasks.ts explaining the
service-layer approach and maintenance expectations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace timingsMap.get(task.name)! with a ?? fallback pattern to satisfy
the no-non-null-assertion lint rule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread dev-docs/07-query-benchmarking.md
Copy link
Copy Markdown
Contributor

@pgarrison pgarrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been able to build on this successfully in #806 and #809, and it's been useful for benchmarking my code, so I think it's a good starting point that we should move forward with.

Some of my feedback are things that I've addressed already in #806, so I think we can just move forward from there.

At a high level, though, I have a concern about large LLM-generated PRs. They are difficult to review due to their length, and code readability is hampered by verbosity, poor use of DRY, vague or over-abbreviated variable names, and inconsistent comment quality. If we want to maintain high readability standards and thorough understanding of the codebase, there is a risk that future PRs like this shift effort from PR authors to reviewers.

For these reasons, I have opted not to fully read and understand the details of all code here. Instead, I've focused my reading on looking for security concerns.

Comment thread packages/web/scripts/summarize-results.js Outdated
Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
@BrianWhitneyAI BrianWhitneyAI merged commit 507b94d into main May 21, 2026
7 checks passed
@BrianWhitneyAI BrianWhitneyAI deleted the feature/query-benchmark branch May 21, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants