feat: query benchmark system with DuckDB-WASM#739
Conversation
Adds a Playwright-based benchmark that runs in real DuckDB-WASM (headless Chromium) and posts a comparison table as a PR comment on every pull request. What's included: - benchmark/src/ — runner, synthetic data generator, query suite, types - scripts/run-benchmark.js — builds bundle, serves with COOP/COEP + range request support, coordinates two-phase run (in-memory → cloud) - scripts/compare-results.js — diffs two result JSONs, outputs Markdown with regression/improvement badges and a plain-English summary section - scripts/summarize-results.js — pretty-prints a single result file to stdout - .github/workflows/benchmark.yml — runs base + PR branches in parallel, uploads artifacts, posts/updates a PR comment - npm scripts: benchmark, benchmark:compare, benchmark:summary Key design decisions: - Schema configs (narrow/wide) run independently to stay within the ~3 GB WASM heap; each table is dropped immediately after benchmarking - Wide schema capped at 1M rows; narrow runs to 10M - Cloud phase exports the 10k narrow table as parquet, serves it over localhost HTTP with byte-range support, and registers it via DuckDB's HTTP protocol — exercising the same code path as real S3 sources - Results are written to benchmark-results-<branch>.json so runs on different branches never clobber each other Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8a646d6 to
34bc28f
Compare
Removes automatic PR trigger in favor of a manual run where the user selects base_branch and compare_branch. Results are written to the workflow run summary instead of a PR comment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract buildFetchAnnotationsSQL, buildGetFilesSQL, buildGetCountSQL, and buildDistinctValuesSQL from their respective services. Services now call these functions internally, and the benchmark imports them directly so query changes are automatically reflected without manual sync. Also adds hidden_bff_uid to the synthetic benchmark table so SQLBuilder's deterministic-pagination ORDER BY clause resolves correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Playwright's waitForFunction signature is (fn, arg, options) — passing
{ timeout } as the second argument was treating it as the page function
arg, leaving the actual timeout at Playwright's 30s default.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…een queries
Two structural problems caused inconsistent results:
1. Fixed query order — early queries always ran cold, later ones always
ran warm. This made queries toward the end of the list look
artificially faster regardless of branch.
2. Single per-query warmup — the warmup for the first query was truly
cold (table just created) while later queries got a free warm buffer
pool. Warmup quality was inconsistent across queries.
Fix: replace per-query serial timing with a round-robin approach:
- WARMUP_ROUNDS (3) full passes through all queries before any timing,
so every query starts from a consistently warm DuckDB state.
- ITERATIONS (20) timed rounds where all queries run in a freshly
shuffled order each round, distributing cache effects evenly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace generic "Base" / "PR" / "Benchmark (base)" labels with the actual branch names so the workflow steps and result table are self-describing without a base/compare framing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GITHUB_REF_NAME is a protected GitHub Actions env var set by the runner to the workflow trigger ref — step-level env overrides are silently ignored. Both jobs were getting the same trigger branch name. Switch to a custom BENCHMARK_BRANCH var that each job sets to its respective input branch, and read that in run-benchmark.js. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous two-job design ran base and compare on separate VMs. Since GitHub Actions runners vary in CPU speed and load, the hardware variance (observed at ~15%) swamped the signal from actual query changes — queries we never touched showed "improvements" simply because the compare runner was faster. Fix: collapse to a single job that runs both benchmarks back-to-back on the same machine. Identical hardware means differences in the comparison table reflect genuine query performance changes. Tradeoffs accepted: - Sequential (takes ~2× longer) vs parallel - Compare run starts with a warmer OS page cache, but this is consistent across all queries so it doesn't bias individual query comparisons Also bumps the job timeout to 60 minutes to accommodate sequential runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the in-memory TABLE approach with the same pattern DatabaseService
uses in production:
1. Create a staging table from range() (no hidden_bff_uid stored)
2. COPY to parquet, register the buffer back into the WASM filesystem
3. DROP the staging table, CREATE a VIEW over parquet_scan with
file_row_number AS hidden_bff_uid — identical to createParquetDirectView
This means DuckDB now runs benchmark queries through its parquet reader,
enabling row group skipping, column projection, and predicate pushdown —
the same optimizations that apply to real S3/HTTP data sources. Previously
all data lived fully deserialized in the WASM heap with no parquet metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
registerFileBuffer transfers the ArrayBuffer to the WASM worker via postMessage, detaching it in the JS context. Save a .slice() copy of the fixture buffer before the transfer so it remains readable when Playwright reads it as __fixtureParquet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
registerFileBuffer transfers ownership of the ArrayBuffer to the WASM worker. Rather than trying to preserve a copy with .slice(), re-export the fixture via a fresh COPY after the parquet_scan view is set up. This produces a clean buffer that was never transferred and avoids any lifecycle ambiguity with the WASM worker's memory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove 10,000 rows from ALL_SCALES — too few rows produce noisy results and false positives from random variance - Cloud benchmark now runs at 100k / 1M / 10M rows (narrow schema) to test HTTP range-request predicate pushdown at realistic data sizes - CloudQueryResult gains a `scale` field; fixtures are keyed by scale so Playwright writes fixture-<scale>.parquet for each - compare-results.js groups cloud results by scale with sub-headers and reports per-scale network baselines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Array.from() on a 10M row parquet buffer (~50-100MB) crashes V8 when trying to allocate the intermediate plain JS array for CDP transfer via page.evaluate. Cap CLOUD_FIXTURE_SCALES to ≤1M rows — sufficient to test HTTP range-request predicate pushdown, and safely within CDP limits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SeanDuHare
left a comment
There was a problem hiding this comment.
didn't finish, but here are some thoughts so far
|
Remembered I wasn't sure if I checked the following in this, can you add the following if it isn't yet (or just lmk your thoughts):
|
Replaces the raw-SQL benchmark with a task-based suite that calls the same service layer methods the app calls at runtime (fetchValues, getFiles, fetchAvailableAnnotationsForHierarchy, etc.), making benchmark times directly comparable to real user-perceived latency. Key changes: - benchmark/src/tasks.ts: 11 tasks covering fetch_annotations, filter picker opens at 3 cardinality levels, browse/sort/filter file list, null group counts, change_grouping, and folder expansion - benchmark/src/index.ts: times full task round-trips (service → IPC → DuckDB → deserialization) instead of raw queryWorker calls - duckdb-worker.worker.ts: bff_query_timing flag logs every query with label and elapsed time; accumulates timings for __getQueryTimings() report - DatabaseServiceWeb: exposes __getQueryTimings() on window when timing enabled, dumps p50/p95/max table from accumulated worker timings - run-local.js: --scale, --iterations, --warmup, --local, --full flags - run-regression.js: uses local fixtures (no S3 protocol issues), supports --iterations and --warmup flags - benchmark.yml: downloads fixtures via curl, adds iterations/warmup inputs, raises timeout to 180 min, drops S3 env var dependency from run steps - Removes one-time-use generate-fixtures.py and upload-synthetic-fixtures.js Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
git clean -ffdx (run by actions/checkout) was wiping the fixtures directory before the benchmark could use them. Moving the download step after checkout fixes this. Cache keyed on fixture version (v1) so subsequent runs skip the ~3GB download entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a865df9 to
f49eb3d
Compare
Replace main-thread performance.now() wrapping with worker-side timing that measures only connection.query() — excluding Arrow toArray() and JSON serialization overhead that was inflating results 2-6x. Key changes: - Add enableQueryTiming(), clearTimings(), sumTimings(), clearAnnotationCache() to DatabaseServiceWebWorker - benchmarkSource clears timings per-task and reads sumTimings() instead of wrapping task.run() with performance.now() - fetch_annotations: clear annotation cache before each timed iteration so warmup cache hits don't report 0ms - change_grouping: use wall-clock timing — parallel queries produce O(N²) worker-side sums due to cumulative wait time in DuckDB's serial executor - Split sort_file_list into two tasks (limit 50 / limit 100) to cover both DuckDB execution plans: Top-N heap+pruning vs full materialized sort - Pre-warm DuckDB's parquet scanner before timing registrations to avoid 620ms cold-start on the first source - Hardcode staging S3 fixture URLs as defaults in run-local.js - Make forceFullHTTPReads a parameter on initializeDuckDB Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Keeps this.sourceMetadataName assignment added in main alongside our buildFetchAnnotationsSQL refactor in DatabaseService.fetchAnnotations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
BFF Query Benchmark — Local Run
Observations:
|
…ker.ts Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
…ker.ts Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
…ker.ts Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
Reverts exported SQL builder helpers (buildGetFilesSQL, buildGetCountSQL, buildDistinctValuesSQL, buildFetchAnnotationsSQL) and createParquetDirectView visibility change back to their original forms — these were added only to support the benchmark but the benchmark now calls service methods directly. Adds a design rationale comment at the top of tasks.ts explaining the service-layer approach and maintenance expectations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace timingsMap.get(task.name)! with a ?? fallback pattern to satisfy the no-non-null-assertion lint rule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pgarrison
left a comment
There was a problem hiding this comment.
I have been able to build on this successfully in #806 and #809, and it's been useful for benchmarking my code, so I think it's a good starting point that we should move forward with.
Some of my feedback are things that I've addressed already in #806, so I think we can just move forward from there.
At a high level, though, I have a concern about large LLM-generated PRs. They are difficult to review due to their length, and code readability is hampered by verbosity, poor use of DRY, vague or over-abbreviated variable names, and inconsistent comment quality. If we want to maintain high readability standards and thorough understanding of the codebase, there is a risk that future PRs like this shift effort from PR authors to reviewers.
For these reasons, I have opted not to fully read and understand the details of all code here. Instead, I've focused my reading on looking for security concerns.
Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>
Summary
Adds three tools for measuring and monitoring DuckDB-WASM query performance.
run-local.js) — runs the task suite in headless Chromium against local or S3 fixtures, prints a p50/p95 table, and writes a result JSON for comparisonbenchmark.yml) —workflow_dispatchthat benchmarks two branches sequentially on the same VM and posts a Markdown diff to the workflow summarylocalStorage.setItem("bff_query_timing", "1")See
dev-docs/07-query-benchmarking.mdfor full usage.Description of added code
CI / Workflow
Browser-side benchmark bundle (packages/web/benchmark/)
Node.js scripts (packages/web/scripts/)
Important changes to review
Core service layer (packages/core/)
DuckDB web worker (packages/web/src/services/DatabaseServiceWeb/)