Benchmark suite for driving BTCA prompts through multiple model providers and scoring the answers with a judge council.
Install dependencies:
bun installEnsure:
btcais installed and onPATH(the script runsbtca serveinternally).OPENCODE_API_KEYis set (required for primary models + judge models via the OpenCode endpoint).OPENROUTER_API_KEYis set (required for judge model access).
bun run benchFlags:
--runs <n>or-r <n>: set per-test run count (default1).--model <name>or-m <name>: run only one model.
Example:
bun run bench --runs 3 --model gpt-5.3-codexIf --model is not in the default list, it still runs with provider opencode.
The default model set is:
gpt-5.3-codex(openai)gpt-5.3-codex-spark(openai)claude-opus-4-6(opencode)claude-haiku-4-5(opencode)gemini-3-flash(opencode)minimax-m2.5-free(opencode)z-ai/glm-5(openrouter)
The suite runs the fixed TESTS list in src/bench.ts:
- 10 test prompts total
- Resources used:
svelte,tailwindcss,justBash,sveltePkg,daytona npmresources are supported (type: "npm"entries).
For each selected model, src/bench.ts:
- Creates a model-specific workspace under
.btca-bench/<model>-<timestamp>. - Writes
btca.config.jsoncwith model/provider and resources. - Starts
btca serveon a random port in that workspace. - Ensures each required resource exists, posting missing resources to
POST /config/resources. - Sets the active model with
PUT /config/model. - Executes every test for each run via
POST /question/stream. - Parses streaming SSE events for:
- final text
- tool updates/calls
- token usage
- throughput and timing metrics
- cost (USD)
- Scores each answer with a 3-model judge council:
gpt-5.2-codexgemini-3-proclaude-opus-4-6
When running all models, raw records are written to:
results/bench-results-<timestamp>.jsonl
Each line is a full JSON record with:
- question metadata
- raw answer
- token/cost/timing metrics
- tool usage and error fields
- judge score/clarity/votes and disagreement stats
The terminal summary prints:
- average duration
- average tool calls
- average judge score
- average clarity
- average input/output tokens
- average output tokens/sec
- average cost
- judge disagreement
- failed run count
With --model, the summary is grouped by testId instead of model.