EVMBench is a benchmark and execution harness for evaluating AI agents on smart
contract security tasks. This repository now documents only EVMBench. The
in-repo ploit and veto projects are support tools used by EVMBench for
exploit setup, transaction extraction, and RPC filtering; they are not separate
benchmark suites.
EVMBench supports three task modes:
detect: write an audit report tosubmission/audit.md.patch: produce a patch diff for known vulnerable code.exploit: produce executable attack transactions insubmission/txs.json.
The current Forest-of-Thought test-time scaling work targets detect mode.
- Repository Layout
- Execution Architecture
- Forest-of-Thought Test-Time Scaling
- Where Forest-of-Thought Is Implemented
- Getting Started
- Build Docker Images
- Configure API Keys
- Run EVMBench
- Run the Baseline vs Forest Comparison
- Outputs
- Troubleshooting
| Path | Purpose |
|---|---|
audits/ |
EVMBench audit tasks, ground-truth findings, hints, patches, exploit harnesses, and per-audit Dockerfiles. |
splits/ |
Task lists such as debug, detect-tasks, patch-tasks, and exploit-tasks. |
evmbench/nano/ |
Nanoeval task, solver, runtime, entrypoint, and grading integration. |
evmbench/agents/ |
Agent registry plus container, Codex, Claude, Gemini, OpenCode, and mini-swe-agent adapters. |
evmbench/agents/mini-swe-agent/ |
Local mini-swe-agent runner, Modal baseline runner, Forest-of-Thought runner, and Phase 6 comparison harness. |
docker_build.py |
Builds the EVMBench base image and per-audit Docker images. |
ploit/ |
Helper binary used by exploit tasks. Built into ploit-builder:latest. |
veto/ |
Optional RPC filtering helper used by selected exploit tasks. |
runs/ |
Default output directory for benchmark logs, submissions, metadata, and Phase 6 summaries. |
flowchart TD
User[User or CI] --> Entry[evmbench.nano.entrypoint]
Entry --> Task[EVMTask setup]
Task --> Solver[EVMbenchSolver]
Solver --> AgentRegistry[Agent registry]
AgentRegistry --> ContainerAgent[Container runner]
AgentRegistry --> ModalAgent[Modal runner]
ContainerAgent --> LocalComputer[Alcatraz Docker task computer]
LocalComputer --> StartSh[start.sh plus AGENTS.md]
StartSh --> SubmissionA[submission/audit.md or agent artifacts]
ModalAgent --> ModalRunner[evmbench.agents.modal_runner]
ModalRunner --> MiniEntrypoint[mini-swe-agent entrypoint.py]
MiniEntrypoint --> ModalBaseline[Modal single-agent baseline]
MiniEntrypoint --> ModalForest[Modal Forest-of-Thought]
ModalBaseline --> ModalSubmission[modal/submission/audit.md]
ModalForest --> ModalSubmission
ModalSubmission --> SubmissionA
SubmissionA --> Grader[EVMBench grader]
Grader --> RunLogs[runs/]
EVMBench owns task construction, Docker image selection, rendered instructions, submission contracts, and grading. Agents own the search strategy used to produce the submission.
Forest-of-Thought is the test-time scaling strategy implemented here for smart contract audits. Instead of asking one agent trajectory to find every bug, the forest spends more inference budget at test time across independent specialist searches and then uses judge passes to merge only high-confidence findings.
The scaling knobs are:
- Breadth: number of specialist trees, controlled by
MAX_TREE_ROLESor--max-tree-roles. - Diversity: number of independent branch workers per tree, controlled by
BRANCHES_PER_TREEor--branches-per-tree. - Verification: tree-local judges synthesize branch reports before global review.
- Aggregation: one global judge writes the only EVMBench-compatible final report.
- Parallelism: Modal runs independent workers concurrently, controlled by
FOREST_WORKER_CONCURRENCY.
The budget shape is:
total_test_time_budget =
scout_budget
+ selected_roles * branches_per_tree * branch_budget
+ selected_roles * tree_judge_budget
+ global_judge_budget
This is useful for audit tasks because different vulnerability classes require different reading strategies. A token-flow specialist may trace balance movement, an accounting specialist may stress share math, and an access-control specialist may inspect privileged paths. Branch workers are isolated so they can form independent hypotheses. Judges are then used to reduce duplicated or weak claims before grading.
The important benchmark discipline is matched comparison: the forest should be compared against single-agent baselines with clear runner labels, budgets, logs, and grading artifacts. Phase 6 is the harness for that comparison.
flowchart TD
Start[EVMBench detect task] --> Scout[Scout worker]
Scout --> Select[Select specialist roles]
Select --> Token[Token-flow tree]
Select --> Accounting[Accounting tree]
Select --> Access[Access-control tree]
Select --> Cross[Cross-contract tree]
Select --> Exploit[Exploitability tree]
Token --> TokenBranches[Branch workers]
Accounting --> AccountingBranches[Branch workers]
Access --> AccessBranches[Branch workers]
Cross --> CrossBranches[Branch workers]
Exploit --> ExploitBranches[Branch workers]
TokenBranches --> TokenJudge[Tree-local judge]
AccountingBranches --> AccountingJudge[Tree-local judge]
AccessBranches --> AccessJudge[Tree-local judge]
CrossBranches --> CrossJudge[Tree-local judge]
ExploitBranches --> ExploitJudge[Tree-local judge]
TokenJudge --> Global[Global judge]
AccountingJudge --> Global
AccessJudge --> Global
CrossJudge --> Global
ExploitJudge --> Global
Global --> Final[submission/audit.md]
Final --> Grade[EVMBench detect grading]
The forest is wired through normal EVMBench agent_id selection. These are the
main implementation points.
evmbench/agents/mini-swe-agent/config.yaml registers the full Modal baseline
and forest variants:
mini-swe-agent-modal-baseline:
<<: *common_settings
runner: modal_baseline
env_vars:
<<: *env-config-vars
MODAL_OPENAI_SECRET_NAME: openai-api-key
mini-swe-agent-modal-forest:
<<: *common_settings
runner: modal_forest
env_vars:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
MODEL: openai/gpt-5
SCOUT_STEP_LIMIT: "16"
BRANCH_STEP_LIMIT: "36"
JUDGE_STEP_LIMIT: "24"
GLOBAL_STEP_LIMIT: "36"
BRANCHES_PER_TREE: "2"
MAX_TREE_ROLES: "4"
FOREST_WORKER_CONCURRENCY: "4"evmbench/nano/solver.py chooses between the normal container path and the
Modal path. Modal output is copied back into the EVMBench task computer so the
standard grader remains the source of truth:
agent = agent_registry.get_agent(self.runtime_config.agent_id)
if agent.runner == "container":
await self._prepare_container_agent(computer, task, agent)
agent_output = await self._run_agent(computer, task)
else:
agent_output = await self._run_modal_agent(computer, task, agent)
grade: EVMbenchGrade = await task.grade(computer, self.runtime_config)
grade.evmbench_result.agent_output = agent_outputevmbench/agents/modal_runner.py maps agent runner types onto the
mini-swe-agent entrypoint and forwards the forest budget controls:
MODAL_RUNNER_COMMANDS = {
"modal_baseline": "baseline",
"modal_forest": "forest",
}
elif agent.runner == "modal_forest":
_append_env_flag(command, env, "--scout-step-limit", "SCOUT_STEP_LIMIT")
_append_env_flag(command, env, "--branch-step-limit", "BRANCH_STEP_LIMIT")
_append_env_flag(command, env, "--judge-step-limit", "JUDGE_STEP_LIMIT")
_append_env_flag(command, env, "--global-step-limit", "GLOBAL_STEP_LIMIT")
_append_env_flag(command, env, "--branches-per-tree", "BRANCHES_PER_TREE")
_append_env_flag(command, env, "--max-tree-roles", "MAX_TREE_ROLES")
_append_env_flag(command, env, "--worker-concurrency", "FOREST_WORKER_CONCURRENCY")evmbench/agents/mini-swe-agent/modal_forest.py runs the staged workflow:
scout_result = _run_worker(config, audit, instructions, _scout_spec(config), openai_api_key=openai_api_key)
worker_results.append(scout_result)
scout_decision, selected_roles = _select_roles(config)
branch_results = _run_specs_parallel(
config,
audit,
instructions,
_worker_specs_for_branches(config, selected_roles),
openai_api_key=openai_api_key,
)
tree_judge_results = _run_specs_parallel(
config,
audit,
instructions,
_worker_specs_for_tree_judges(config, selected_roles),
openai_api_key=openai_api_key,
)
global_result = _run_worker(
config,
audit,
instructions,
_global_judge_spec(config, selected_roles),
openai_api_key=openai_api_key,
)evmbench/agents/mini-swe-agent/scout.py defines the initial forest role set:
TREE_ROLES = {
"token-flow": TreeRole(...),
"accounting": TreeRole(...),
"access-control": TreeRole(...),
"cross-contract": TreeRole(...),
"exploitability": TreeRole(...),
}evmbench/agents/mini-swe-agent/judge.py keeps branch outputs separate and
allows only the global judge to write the final graded report:
FOREST_DIR = "/home/agent/forest"
FINAL_SUBMISSION_PATH = "/home/agent/submission/audit.md"
def branch_report_remote_path(role: TreeRole, branch_index: int) -> str:
return f"{FOREST_DIR}/{role.name}/{branch_id(branch_index)}/branch.md"
def tree_judge_remote_path(role: TreeRole) -> str:
return f"{FOREST_DIR}/{role.name}/judge.md"Run all commands from the repository root.
- Python 3.11 or newer.
uv.- Docker with build access.
- Modal CLI/account for Modal baseline and forest runs.
- An OpenAI API key for Codex and
mini-swe-agentruns. - Optional provider keys for non-OpenAI agents.
Install Python dependencies:
uv syncCheck that the main entrypoints import:
uv run python -m evmbench.nano.entrypoint --help
uv run python evmbench/agents/mini-swe-agent/entrypoint.py baseline --help
uv run python evmbench/agents/mini-swe-agent/entrypoint.py forest --helpEVMBench uses one base image plus one image per audit. The base image depends on
the local ploit-builder:latest image, so build that first.
docker build -f ploit/Dockerfile -t ploit-builder:latest --target ploit-builder .For the fastest smoke, build only the debug split:
uv run docker_build.py --split debugFor the Forest-of-Thought comparison, build detect images:
uv run docker_build.py --split detect-tasksFor every EVMBench task image:
uv run docker_build.py --split allBuild one audit when debugging:
uv run docker_build.py --no-build-base --audit 2024-01-cantoIf Docker build networking is flaky, use host networking or a different Ubuntu mirror:
uv run docker_build.py --split detect-tasks --build-network host
uv run docker_build.py --split detect-tasks --ubuntu-mirror http://mirrors.edge.kernel.org/ubuntuModal workers cannot pull local Docker tags. Push audit images to a registry and point the Phase 6 wrappers at that repository.
Example using GHCR:
export MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-audit
uv run docker_build.py --split detect-tasks --tag-prefix "$MODAL_AUDIT_IMAGE_REPO" --build-network hostPush the tags you plan to run. For the default first5 comparison, the audit
IDs are:
2023-07-pooltogether
2023-10-nextgen
2023-12-ethereumcreditguild
2024-01-canto
2024-01-curves
Push one tag like this:
docker push "$MODAL_AUDIT_IMAGE_REPO:2023-07-pooltogether"Push the remaining selected tags the same way. If you do not set
MODAL_AUDIT_IMAGE_REPO, run_phase6_variants.sh defaults to
ghcr.io/pranay5255/evmbench-audit.
Create a local .env file:
OPENAI_API_KEY=sk-...
# Optional, only needed for these agent families.
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=...
# Required for Modal runs unless you are using the wrapper default.
MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-auditLoad it in your shell:
set -a
. ./.env
set +aSet up Modal and create the secret expected by
mini-swe-agent-modal-baseline:
modal setup
modal secret create openai-api-key OPENAI_API_KEY="$OPENAI_API_KEY"The Phase 6 shell wrappers load .env automatically when it exists.
This verifies Docker setup and grading without spending model budget:
uv run python -m evmbench.nano.entrypoint \
evmbench.audit_split=debug \
evmbench.mode=detect \
evmbench.apply_gold_solution=False \
evmbench.log_to_run_dir=True \
evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
evmbench.solver.agent_id=human \
runner.concurrency=1Example Codex detect run on one audit:
uv run python -m evmbench.nano.entrypoint \
evmbench.audit=2024-01-canto \
evmbench.mode=detect \
evmbench.hint_level=none \
evmbench.log_to_run_dir=True \
evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
evmbench.solver.agent_id=codex-default \
runner.concurrency=1| Setting | Values |
|---|---|
evmbench.mode |
detect, patch, exploit |
evmbench.audit |
One audit ID, such as 2024-01-canto |
evmbench.audit_split |
debug, detect-tasks, patch-tasks, exploit-tasks, all |
evmbench.hint_level |
none, low, med, high, max |
evmbench.apply_gold_solution |
True for harness validation, False for real agent runs |
evmbench.solver.agent_id |
Any ID registered in evmbench/agents/**/config.yaml |
runner.concurrency |
Number of EVMBench tasks to run concurrently |
The Phase 6 comparison harness is:
evmbench/agents/mini-swe-agent/evaluate_phase6.py
The recommended wrapper is:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh
It loads .env, sets a default MODAL_AUDIT_IMAGE_REPO, and forwards to the
Python harness.
| Group | Runners |
|---|---|
presentation |
codex-default, modal-baseline, modal-forest |
smoke |
codex-default, mini-smoke-10, modal-baseline-smoke-10, modal-forest-smoke |
modal |
modal-baseline, modal-forest |
local |
local mini-swe-agent variants |
all |
every registered Phase 6 variant |
List every runner:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh variantsPreview the smoke matrix:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh plan --scope smoke --runners smokeRun the low-budget smoke comparison:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope smoke --runners smoke --stop-on-failurePreview the default five-audit baseline comparison:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh plan --scope first5 --runners presentationRun the default comparison:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope first5 --runners presentation --stop-on-failureRun only the Modal forest:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope first5 --runners modal-forest --stop-on-failureSummarize an existing run:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh summarize --output-root runs/phase6/RUN_IDsequenceDiagram
participant User
participant Wrapper as run_phase6_variants.sh
participant Phase6 as evaluate_phase6.py
participant Bench as evmbench.nano.entrypoint
participant Solver as EVMbenchSolver
participant Modal as Modal runners
participant Grade as EVMBench grader
User->>Wrapper: run --scope first5 --runners presentation
Wrapper->>Phase6: load .env and build matrix
Phase6->>Bench: codex-default audit run
Phase6->>Bench: modal-baseline audit run
Phase6->>Bench: modal-forest audit run
Bench->>Solver: setup task computer
Solver->>Modal: dispatch baseline or forest when agent runner is Modal
Modal-->>Solver: modal/submission/audit.md
Solver->>Grade: copy submission and grade normally
Grade-->>Phase6: run.log grade event
Phase6-->>User: phase6-results.json, summary, slide data
Normal EVMBench runs write under runs/. Phase 6 writes a timestamped output
root under runs/phase6/ unless --output-root is provided.
flowchart LR
Root[runs/phase6/timestamp] --> Matrix[phase6-run-matrix.json]
Root --> CommandLogs[_phase6_command_logs]
Root --> RunnerDirs[runner directories]
Root --> Results[phase6-results.json]
Root --> Summary[phase6-summary.md]
Root --> SlideJson[phase6-slide-data.json]
Root --> SlideCsv[phase6-slide-data.csv]
RunnerDirs --> RunDir[group/audit_run]
RunDir --> RunLog[run.log]
RunDir --> Submission[submission/audit.md]
RunDir --> ModalDir[modal]
ModalDir --> ModalLogs[logs]
ModalLogs --> BaselineMeta[modal-baseline-result.json]
ModalLogs --> ForestMeta[modal-forest-result.json]
ModalLogs --> Trajectories[*.traj.json]
Important files:
| File | Purpose |
|---|---|
submission/audit.md |
Grading source of truth for detect mode. |
run.log |
EVMBench task events and grade payloads. |
modal/logs/modal-runner-command.json |
Exact Modal command emitted by the adapter. |
modal/logs/modal-baseline-result.json |
Baseline metadata, runtime, and optional grade. |
modal/logs/modal-forest-result.json |
Forest roles, worker metadata, errors, and runtimes. |
phase6-results.json |
Canonical structured comparison results. |
phase6-summary.md |
Human-readable aggregate and per-audit summary. |
phase6-slide-data.json |
Chart-ready data for presentations. |
phase6-slide-data.csv |
Spreadsheet-friendly per-audit rows. |
Load .env in the shell before running direct Python commands:
set -a
. ./.env
set +aFor Modal, also create the Modal secret:
modal secret create openai-api-key OPENAI_API_KEY="$OPENAI_API_KEY"Use a registry tag, not a local Docker tag:
export MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-audit
uv run docker_build.py --audit 2024-01-canto --tag-prefix "$MODAL_AUDIT_IMAGE_REPO"
docker push "$MODAL_AUDIT_IMAGE_REPO:2024-01-canto"Use:
uv run docker_build.py --split detect-tasks --build-network hostor:
uv run docker_build.py --split detect-tasks --ubuntu-mirror http://mirrors.edge.kernel.org/ubuntuLower the forest knobs in .env or in config.yaml:
BRANCHES_PER_TREE=1
MAX_TREE_ROLES=2
FOREST_WORKER_CONCURRENCY=2
BRANCH_STEP_LIMIT=10
GLOBAL_STEP_LIMIT=10Then run:
evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope smoke --runners modal-forest-smoke --stop-on-failureFocused tests for the mini-swe-agent Modal and Phase 6 path:
uv run pytest tests/test_mini_swe_agent_phase5.py tests/test_mini_swe_agent_forest.py tests/test_mini_swe_agent_phase6.pyLicensed under the Apache License, Version 2.0.