feat: query benchmark system with DuckDB-WASM by BrianWhitneyAI · Pull Request #739 · AllenInstitute/biofile-finder

BrianWhitneyAI · 2026-04-15T20:17:56Z

Summary

Adds three tools for measuring and monitoring DuckDB-WASM query performance.

Local benchmark (run-local.js) — runs the task suite in headless Chromium against local or S3 fixtures, prints a p50/p95 table, and writes a result JSON for comparison
CI regression workflow (benchmark.yml) — workflow_dispatch that benchmarks two branches sequentially on the same VM and posts a Markdown diff to the workflow summary
Dev console timing — enable per-query DuckDB timing in the running app via localStorage.setItem("bff_query_timing", "1")

See `dev-docs/07-query-benchmarking.md` for full usage.

Description of added code

CI / Workflow

benchmark.yml — new workflow_dispatch workflow; checks out two branches sequentially on the same VM and posts a Markdown comparison to the step summary

Browser-side benchmark bundle (packages/web/benchmark/)

src/index.ts — Playwright page entry; registers sources, runs warmup + timed iterations in shuffled order, signals results back
src/tasks.ts — 11 benchmark tasks defined at the service layer (same code paths as the real app)
src/types.ts — shared config/result types
index.html — minimal HTML host for the benchmark bundle
webpack.config.js — webpack config for the benchmark bundle

Node.js scripts (packages/web/scripts/)

run-local.js — developer tool; local or S3 fixtures, any scale, side-by-side cloud vs local (--full)
run-regression.js — CI tool; local fixtures only, writes branch-stamped JSON
compare-results.js — diffs two JSONs into a Markdown table with regression badges
summarize-results.js — pretty-prints a single result file
lib/run-benchmark-page.js — shared Playwright runner used by the two script tools above

Important changes to review

Core service layer (packages/core/)

DatabaseAnnotationService, DatabaseFileService, DatabaseService — SQL-building logic extracted into exported functions so the benchmark calls the exact same SQL paths as the app

DuckDB web worker (packages/web/src/services/DatabaseServiceWeb/)

duckdb-worker.worker.ts — adds query timing instrumentation: queryTimingEnabled flag, accumulatedTimings map, per-query console logging, enableQueryTiming() / clearTimings() / sumTimings() / clearAnnotationCache() for benchmark use
index.ts — wires localStorage flag to timing activation on init
types.ts — minor type updates for the INIT payload

Adds a Playwright-based benchmark that runs in real DuckDB-WASM (headless Chromium) and posts a comparison table as a PR comment on every pull request. What's included: - benchmark/src/ — runner, synthetic data generator, query suite, types - scripts/run-benchmark.js — builds bundle, serves with COOP/COEP + range request support, coordinates two-phase run (in-memory → cloud) - scripts/compare-results.js — diffs two result JSONs, outputs Markdown with regression/improvement badges and a plain-English summary section - scripts/summarize-results.js — pretty-prints a single result file to stdout - .github/workflows/benchmark.yml — runs base + PR branches in parallel, uploads artifacts, posts/updates a PR comment - npm scripts: benchmark, benchmark:compare, benchmark:summary Key design decisions: - Schema configs (narrow/wide) run independently to stay within the ~3 GB WASM heap; each table is dropped immediately after benchmarking - Wide schema capped at 1M rows; narrow runs to 10M - Cloud phase exports the 10k narrow table as parquet, serves it over localhost HTTP with byte-range support, and registers it via DuckDB's HTTP protocol — exercising the same code path as real S3 sources - Results are written to benchmark-results-<branch>.json so runs on different branches never clobber each other Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removes automatic PR trigger in favor of a manual run where the user selects base_branch and compare_branch. Results are written to the workflow run summary instead of a PR comment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract buildFetchAnnotationsSQL, buildGetFilesSQL, buildGetCountSQL, and buildDistinctValuesSQL from their respective services. Services now call these functions internally, and the benchmark imports them directly so query changes are automatically reflected without manual sync. Also adds hidden_bff_uid to the synthetic benchmark table so SQLBuilder's deterministic-pagination ORDER BY clause resolves correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Playwright's waitForFunction signature is (fn, arg, options) — passing { timeout } as the second argument was treating it as the page function arg, leaving the actual timeout at Playwright's 30s default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…een queries Two structural problems caused inconsistent results: 1. Fixed query order — early queries always ran cold, later ones always ran warm. This made queries toward the end of the list look artificially faster regardless of branch. 2. Single per-query warmup — the warmup for the first query was truly cold (table just created) while later queries got a free warm buffer pool. Warmup quality was inconsistent across queries. Fix: replace per-query serial timing with a round-robin approach: - WARMUP_ROUNDS (3) full passes through all queries before any timing, so every query starts from a consistently warm DuckDB state. - ITERATIONS (20) timed rounds where all queries run in a freshly shuffled order each round, distributing cache effects evenly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace generic "Base" / "PR" / "Benchmark (base)" labels with the actual branch names so the workflow steps and result table are self-describing without a base/compare framing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GITHUB_REF_NAME is a protected GitHub Actions env var set by the runner to the workflow trigger ref — step-level env overrides are silently ignored. Both jobs were getting the same trigger branch name. Switch to a custom BENCHMARK_BRANCH var that each job sets to its respective input branch, and read that in run-benchmark.js. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous two-job design ran base and compare on separate VMs. Since GitHub Actions runners vary in CPU speed and load, the hardware variance (observed at ~15%) swamped the signal from actual query changes — queries we never touched showed "improvements" simply because the compare runner was faster. Fix: collapse to a single job that runs both benchmarks back-to-back on the same machine. Identical hardware means differences in the comparison table reflect genuine query performance changes. Tradeoffs accepted: - Sequential (takes ~2× longer) vs parallel - Compare run starts with a warmer OS page cache, but this is consistent across all queries so it doesn't bias individual query comparisons Also bumps the job timeout to 60 minutes to accommodate sequential runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the in-memory TABLE approach with the same pattern DatabaseService uses in production: 1. Create a staging table from range() (no hidden_bff_uid stored) 2. COPY to parquet, register the buffer back into the WASM filesystem 3. DROP the staging table, CREATE a VIEW over parquet_scan with file_row_number AS hidden_bff_uid — identical to createParquetDirectView This means DuckDB now runs benchmark queries through its parquet reader, enabling row group skipping, column projection, and predicate pushdown — the same optimizations that apply to real S3/HTTP data sources. Previously all data lived fully deserialized in the WASM heap with no parquet metadata. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

registerFileBuffer transfers the ArrayBuffer to the WASM worker via postMessage, detaching it in the JS context. Save a .slice() copy of the fixture buffer before the transfer so it remains readable when Playwright reads it as __fixtureParquet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

registerFileBuffer transfers ownership of the ArrayBuffer to the WASM worker. Rather than trying to preserve a copy with .slice(), re-export the fixture via a fresh COPY after the parquet_scan view is set up. This produces a clean buffer that was never transferred and avoids any lifecycle ambiguity with the WASM worker's memory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove 10,000 rows from ALL_SCALES — too few rows produce noisy results and false positives from random variance - Cloud benchmark now runs at 100k / 1M / 10M rows (narrow schema) to test HTTP range-request predicate pushdown at realistic data sizes - CloudQueryResult gains a `scale` field; fixtures are keyed by scale so Playwright writes fixture-<scale>.parquet for each - compare-results.js groups cloud results by scale with sub-headers and reports per-scale network baselines Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Array.from() on a 10M row parquet buffer (~50-100MB) crashes V8 when trying to allocate the intermediate plain JS array for CDP transfer via page.evaluate. Cap CLOUD_FIXTURE_SCALES to ≤1M rows — sufficient to test HTTP range-request predicate pushdown, and safely within CDP limits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SeanDuHare

didn't finish, but here are some thoughts so far

SeanDuHare · 2026-04-22T20:31:57Z

Remembered I wasn't sure if I checked the following in this, can you add the following if it isn't yet (or just lmk your thoughts):

add to the test matrix testing for cardinality? "Metadata cardinality defines the number of unique values in a data field (column) or the numeric relationship between entities in a dataset"
add testing varying column amounts (I think it is always 100 right now?)

Replaces the raw-SQL benchmark with a task-based suite that calls the same service layer methods the app calls at runtime (fetchValues, getFiles, fetchAvailableAnnotationsForHierarchy, etc.), making benchmark times directly comparable to real user-perceived latency. Key changes: - benchmark/src/tasks.ts: 11 tasks covering fetch_annotations, filter picker opens at 3 cardinality levels, browse/sort/filter file list, null group counts, change_grouping, and folder expansion - benchmark/src/index.ts: times full task round-trips (service → IPC → DuckDB → deserialization) instead of raw queryWorker calls - duckdb-worker.worker.ts: bff_query_timing flag logs every query with label and elapsed time; accumulates timings for __getQueryTimings() report - DatabaseServiceWeb: exposes __getQueryTimings() on window when timing enabled, dumps p50/p95/max table from accumulated worker timings - run-local.js: --scale, --iterations, --warmup, --local, --full flags - run-regression.js: uses local fixtures (no S3 protocol issues), supports --iterations and --warmup flags - benchmark.yml: downloads fixtures via curl, adds iterations/warmup inputs, raises timeout to 180 min, drops S3 env var dependency from run steps - Removes one-time-use generate-fixtures.py and upload-synthetic-fixtures.js Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

git clean -ffdx (run by actions/checkout) was wiping the fixtures directory before the benchmark could use them. Moving the download step after checkout fixes this. Cache keyed on fixture version (v1) so subsequent runs skip the ~3GB download entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace main-thread performance.now() wrapping with worker-side timing that measures only connection.query() — excluding Arrow toArray() and JSON serialization overhead that was inflating results 2-6x. Key changes: - Add enableQueryTiming(), clearTimings(), sumTimings(), clearAnnotationCache() to DatabaseServiceWebWorker - benchmarkSource clears timings per-task and reads sumTimings() instead of wrapping task.run() with performance.now() - fetch_annotations: clear annotation cache before each timed iteration so warmup cache hits don't report 0ms - change_grouping: use wall-clock timing — parallel queries produce O(N²) worker-side sums due to cumulative wait time in DuckDB's serial executor - Split sort_file_list into two tasks (limit 50 / limit 100) to cover both DuckDB execution plans: Top-N heap+pruning vs full materialized sort - Pre-warm DuckDB's parquet scanner before timing registrations to avoid 620ms cold-start on the first source - Hardcode staging S3 fixture URLs as defaults in run-local.js - Make forceFullHTTPReads a parameter on initializeDuckDB Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Keeps this.sourceMetadataName assignment added in main alongside our buildFetchAnnotationsSQL refactor in DatabaseService.fetchAnnotations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BrianWhitneyAI · 2026-04-24T19:59:50Z

BFF Query Benchmark — Local Run
feature/query-benchmark @ 4d209f5b · 2026-04-24 · DuckDB init: 822ms
Timings shown as p50 (1 iteration). Scales: local fixtures served over localhost; cloud via S3.

Query	100k-cloud	100k-local	1m-cloud	1m-local	10m-cloud	10m-local
registration	55.9ms	6.91ms	589.3ms	17.0ms	1023.2ms	118.6ms
fetch_annotations	1.82ms	2.12ms	1.64ms	2.02ms	1.63ms	2.00ms
open_filter_picker_low_cardinality	2.49ms	3.85ms	14.1ms	22.0ms	97.2ms	203.5ms
open_filter_picker_medium_cardinality	2.97ms	7.73ms	57.1ms	22.3ms	⚠️ 18218.9ms	185.3ms
open_filter_picker_high_cardinality	6.06ms	6.29ms	1679.8ms	65.3ms	⚠️ 40263.8ms	1029.3ms
browse_file_list	40.1ms	85.9ms	55.4ms	93.1ms	⚠️ 8676.3ms	201.2ms
sort_file_list	82.5ms	178.4ms	⚠️ 111064.3ms	1594.8ms	⚠️ 288693.8ms	5817.3ms
sort_file_list_large_page	85.5ms	178.7ms	⚠️ 120892.8ms	1914.3ms	⚠️ 1932389.7ms	18555.9ms
filter_count	11.4ms	9.67ms	65.1ms	74.9ms	611.1ms	729.9ms
filter_browse	41.2ms	75.7ms	54.1ms	91.3ms	⚠️ 7109.0ms	209.6ms
null_group_count	2.89ms	2.32ms	6.87ms	6.57ms	61.5ms	55.4ms
change_grouping	299.5ms	412.8ms	777.6ms	906.8ms	5549.5ms	5848.6ms
expand_folder	9.78ms	11.3ms	68.7ms	88.8ms	652.7ms	868.9ms

Observations:

Cloud vs. local gap widens sharply at 10M rows for sort/browse queries — expected since cloud reads require S3 range requests per query while local reads hit the in-memory buffer. The sort queries at 10m-cloud (288s, 1932s) are likely doing full-table downloads on each iteration with no warm cache.
Filter and count queries scale well — filter_count and null_group_count stay fast even at 10M because DuckDB can prune row groups.
open_filter_picker_high_cardinality at 1m-cloud (1679ms) and 10m-cloud (40s) confirms that high-cardinality DISTINCT scans are the most expensive user-facing operation at scale.

…ker.ts Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>

Reverts exported SQL builder helpers (buildGetFilesSQL, buildGetCountSQL, buildDistinctValuesSQL, buildFetchAnnotationsSQL) and createParquetDirectView visibility change back to their original forms — these were added only to support the benchmark but the benchmark now calls service methods directly. Adds a design rationale comment at the top of tasks.ts explaining the service-layer approach and maintenance expectations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace timingsMap.get(task.name)! with a ?? fallback pattern to satisfy the no-non-null-assertion lint rule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pgarrison

I have been able to build on this successfully in #806 and #809, and it's been useful for benchmarking my code, so I think it's a good starting point that we should move forward with.

Some of my feedback are things that I've addressed already in #806, so I think we can just move forward from there.

At a high level, though, I have a concern about large LLM-generated PRs. They are difficult to review due to their length, and code readability is hampered by verbosity, poor use of DRY, vague or over-abbreviated variable names, and inconsistent comment quality. If we want to maintain high readability standards and thorough understanding of the codebase, there is a risk that future PRs like this shift effort from PR authors to reviewers.

For these reasons, I have opted not to fully read and understand the details of all code here. Instead, I've focused my reading on looking for security concerns.

Co-authored-by: Philip Garrison <pgarrison@users.noreply.github.com>

BrianWhitneyAI and others added 2 commits April 15, 2026 13:19

poc commit

df902b5

BrianWhitneyAI force-pushed the feature/query-benchmark branch from 8a646d6 to 34bc28f Compare April 15, 2026 20:20

BrianWhitneyAI and others added 6 commits April 15, 2026 13:28

BrianWhitneyAI mentioned this pull request Apr 15, 2026

test: source-level query regressions to verify benchmark detection #740

Closed

8 tasks

BrianWhitneyAI and others added 7 commits April 15, 2026 14:51

docs: explain why benchmark runs sequentially on the same VM

b3d1449

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SeanDuHare reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/web/benchmark/src/queries.ts Outdated

Comment thread packages/web/scripts/run-benchmark.js Outdated

Comment thread packages/web/scripts/run-benchmark.js Outdated

Comment thread packages/web/scripts/run-benchmark.js Outdated

SeanDuHare reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/web/benchmark/src/index.ts Outdated

BrianWhitneyAI and others added 2 commits April 22, 2026 16:22

BrianWhitneyAI force-pushed the feature/query-benchmark branch from a865df9 to f49eb3d Compare April 23, 2026 03:33

BrianWhitneyAI and others added 3 commits April 23, 2026 12:37

cleanup

fb85966

merge: resolve conflicts with main

a0b20a8

Keeps this.sourceMetadataName assignment added in main alongside our buildFetchAnnotationsSQL refactor in DatabaseService.fetchAnnotations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

BrianWhitneyAI commented Apr 27, 2026

View reviewed changes

Comment thread packages/core/services/DatabaseService/index.ts Outdated

BrianWhitneyAI added 2 commits April 27, 2026 11:32

cleanup

ee5c03f

benchmark documentation

5bd2bbd

Merge branch 'main' into feature/query-benchmark

6a8f88a