Skip to content

Benchmark for how well different models search codebases

Notifications You must be signed in to change notification settings

davis7dotsh/grep-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

grep-bench

Benchmark suite for driving BTCA prompts through multiple model providers and scoring the answers with a judge council.

Setup

Install dependencies:

bun install

Ensure:

  • btca is installed and on PATH (the script runs btca serve internally).
  • OPENCODE_API_KEY is set (required for primary models + judge models via the OpenCode endpoint).
  • OPENROUTER_API_KEY is set (required for judge model access).

Run

bun run bench

Flags:

  • --runs <n> or -r <n>: set per-test run count (default 1).
  • --model <name> or -m <name>: run only one model.

Example:

bun run bench --runs 3 --model gpt-5.3-codex

If --model is not in the default list, it still runs with provider opencode.

What is benchmarked

The default model set is:

  • gpt-5.3-codex (openai)
  • gpt-5.3-codex-spark (openai)
  • claude-opus-4-6 (opencode)
  • claude-haiku-4-5 (opencode)
  • gemini-3-flash (opencode)
  • minimax-m2.5-free (opencode)
  • z-ai/glm-5 (openrouter)

The suite runs the fixed TESTS list in src/bench.ts:

  • 10 test prompts total
  • Resources used: svelte, tailwindcss, justBash, sveltePkg, daytona
  • npm resources are supported (type: "npm" entries).

Runtime flow

For each selected model, src/bench.ts:

  • Creates a model-specific workspace under .btca-bench/<model>-<timestamp>.
  • Writes btca.config.jsonc with model/provider and resources.
  • Starts btca serve on a random port in that workspace.
  • Ensures each required resource exists, posting missing resources to POST /config/resources.
  • Sets the active model with PUT /config/model.
  • Executes every test for each run via POST /question/stream.
  • Parses streaming SSE events for:
    • final text
    • tool updates/calls
    • token usage
    • throughput and timing metrics
    • cost (USD)
  • Scores each answer with a 3-model judge council:
    • gpt-5.2-codex
    • gemini-3-pro
    • claude-opus-4-6

Output

When running all models, raw records are written to:

results/bench-results-<timestamp>.jsonl

Each line is a full JSON record with:

  • question metadata
  • raw answer
  • token/cost/timing metrics
  • tool usage and error fields
  • judge score/clarity/votes and disagreement stats

The terminal summary prints:

  • average duration
  • average tool calls
  • average judge score
  • average clarity
  • average input/output tokens
  • average output tokens/sec
  • average cost
  • judge disagreement
  • failed run count

With --model, the summary is grouped by testId instead of model.

About

Benchmark for how well different models search codebases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published