grep-bench

Benchmark suite for driving BTCA prompts through multiple model providers and scoring the answers with a judge council.

Setup

Install dependencies:

bun install

Ensure:

btca is installed and on PATH (the script runs btca serve internally).
OPENCODE_API_KEY is set (required for primary models + judge models via the OpenCode endpoint).
OPENROUTER_API_KEY is set (required for judge model access).

Run

bun run bench

Flags:

--runs <n> or -r <n>: set per-test run count (default 1).
--model <name> or -m <name>: run only one model.

Example:

bun run bench --runs 3 --model gpt-5.3-codex

If --model is not in the default list, it still runs with provider opencode.

What is benchmarked

The default model set is:

gpt-5.3-codex (openai)
gpt-5.3-codex-spark (openai)
claude-opus-4-6 (opencode)
claude-haiku-4-5 (opencode)
gemini-3-flash (opencode)
minimax-m2.5-free (opencode)
z-ai/glm-5 (openrouter)

The suite runs the fixed TESTS list in src/bench.ts:

10 test prompts total
Resources used: svelte, tailwindcss, justBash, sveltePkg, daytona
npm resources are supported (type: "npm" entries).

Runtime flow

For each selected model, src/bench.ts:

Creates a model-specific workspace under .btca-bench/<model>-<timestamp>.
Writes btca.config.jsonc with model/provider and resources.
Starts btca serve on a random port in that workspace.
Ensures each required resource exists, posting missing resources to POST /config/resources.
Sets the active model with PUT /config/model.
Executes every test for each run via POST /question/stream.
Parses streaming SSE events for:
- final text
- tool updates/calls
- token usage
- throughput and timing metrics
- cost (USD)
Scores each answer with a 3-model judge council:
- gpt-5.2-codex
- gemini-3-pro
- claude-opus-4-6

Output

When running all models, raw records are written to:

results/bench-results-<timestamp>.jsonl

Each line is a full JSON record with:

question metadata
raw answer
token/cost/timing metrics
tool usage and error fields
judge score/clarity/votes and disagreement stats

The terminal summary prints:

average duration
average tool calls
average judge score
average clarity
average input/output tokens
average output tokens/sec
average cost
judge disagreement
failed run count

With --model, the summary is grouped by testId instead of model.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
public		public
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

grep-bench

Setup

Run

What is benchmarked

Runtime flow

Output

About

Uh oh!

Releases

Packages

Languages

davis7dotsh/grep-bench

Folders and files

Latest commit

History

Repository files navigation

grep-bench

Setup

Run

What is benchmarked

Runtime flow

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages