Skip to content

pranay5255/forestOfAudits

Repository files navigation

EVMBench

EVMBench is a benchmark and execution harness for evaluating AI agents on smart contract security tasks. This repository now documents only EVMBench. The in-repo ploit and veto projects are support tools used by EVMBench for exploit setup, transaction extraction, and RPC filtering; they are not separate benchmark suites.

EVMBench supports three task modes:

  • detect: write an audit report to submission/audit.md.
  • patch: produce a patch diff for known vulnerable code.
  • exploit: produce executable attack transactions in submission/txs.json.

The current Forest-of-Thought test-time scaling work targets detect mode.

Contents

Repository Layout

Path Purpose
audits/ EVMBench audit tasks, ground-truth findings, hints, patches, exploit harnesses, and per-audit Dockerfiles.
splits/ Task lists such as debug, detect-tasks, patch-tasks, and exploit-tasks.
evmbench/nano/ Nanoeval task, solver, runtime, entrypoint, and grading integration.
evmbench/agents/ Agent registry plus container, Codex, Claude, Gemini, OpenCode, and mini-swe-agent adapters.
evmbench/agents/mini-swe-agent/ Local mini-swe-agent runner, Modal baseline runner, Forest-of-Thought runner, and Phase 6 comparison harness.
docker_build.py Builds the EVMBench base image and per-audit Docker images.
ploit/ Helper binary used by exploit tasks. Built into ploit-builder:latest.
veto/ Optional RPC filtering helper used by selected exploit tasks.
runs/ Default output directory for benchmark logs, submissions, metadata, and Phase 6 summaries.

Execution Architecture

flowchart TD
    User[User or CI] --> Entry[evmbench.nano.entrypoint]
    Entry --> Task[EVMTask setup]
    Task --> Solver[EVMbenchSolver]

    Solver --> AgentRegistry[Agent registry]
    AgentRegistry --> ContainerAgent[Container runner]
    AgentRegistry --> ModalAgent[Modal runner]

    ContainerAgent --> LocalComputer[Alcatraz Docker task computer]
    LocalComputer --> StartSh[start.sh plus AGENTS.md]
    StartSh --> SubmissionA[submission/audit.md or agent artifacts]

    ModalAgent --> ModalRunner[evmbench.agents.modal_runner]
    ModalRunner --> MiniEntrypoint[mini-swe-agent entrypoint.py]
    MiniEntrypoint --> ModalBaseline[Modal single-agent baseline]
    MiniEntrypoint --> ModalForest[Modal Forest-of-Thought]
    ModalBaseline --> ModalSubmission[modal/submission/audit.md]
    ModalForest --> ModalSubmission
    ModalSubmission --> SubmissionA

    SubmissionA --> Grader[EVMBench grader]
    Grader --> RunLogs[runs/]
Loading

EVMBench owns task construction, Docker image selection, rendered instructions, submission contracts, and grading. Agents own the search strategy used to produce the submission.

Forest-of-Thought Test-Time Scaling

image

Forest-of-Thought is the test-time scaling strategy implemented here for smart contract audits. Instead of asking one agent trajectory to find every bug, the forest spends more inference budget at test time across independent specialist searches and then uses judge passes to merge only high-confidence findings.

The scaling knobs are:

  • Breadth: number of specialist trees, controlled by MAX_TREE_ROLES or --max-tree-roles.
  • Diversity: number of independent branch workers per tree, controlled by BRANCHES_PER_TREE or --branches-per-tree.
  • Verification: tree-local judges synthesize branch reports before global review.
  • Aggregation: one global judge writes the only EVMBench-compatible final report.
  • Parallelism: Modal runs independent workers concurrently, controlled by FOREST_WORKER_CONCURRENCY.

The budget shape is:

total_test_time_budget =
  scout_budget
  + selected_roles * branches_per_tree * branch_budget
  + selected_roles * tree_judge_budget
  + global_judge_budget

This is useful for audit tasks because different vulnerability classes require different reading strategies. A token-flow specialist may trace balance movement, an accounting specialist may stress share math, and an access-control specialist may inspect privileged paths. Branch workers are isolated so they can form independent hypotheses. Judges are then used to reduce duplicated or weak claims before grading.

The important benchmark discipline is matched comparison: the forest should be compared against single-agent baselines with clear runner labels, budgets, logs, and grading artifacts. Phase 6 is the harness for that comparison.

Forest Pipeline

flowchart TD
    Start[EVMBench detect task] --> Scout[Scout worker]
    Scout --> Select[Select specialist roles]

    Select --> Token[Token-flow tree]
    Select --> Accounting[Accounting tree]
    Select --> Access[Access-control tree]
    Select --> Cross[Cross-contract tree]
    Select --> Exploit[Exploitability tree]

    Token --> TokenBranches[Branch workers]
    Accounting --> AccountingBranches[Branch workers]
    Access --> AccessBranches[Branch workers]
    Cross --> CrossBranches[Branch workers]
    Exploit --> ExploitBranches[Branch workers]

    TokenBranches --> TokenJudge[Tree-local judge]
    AccountingBranches --> AccountingJudge[Tree-local judge]
    AccessBranches --> AccessJudge[Tree-local judge]
    CrossBranches --> CrossJudge[Tree-local judge]
    ExploitBranches --> ExploitJudge[Tree-local judge]

    TokenJudge --> Global[Global judge]
    AccountingJudge --> Global
    AccessJudge --> Global
    CrossJudge --> Global
    ExploitJudge --> Global

    Global --> Final[submission/audit.md]
    Final --> Grade[EVMBench detect grading]
Loading

Where Forest-of-Thought Is Implemented

The forest is wired through normal EVMBench agent_id selection. These are the main implementation points.

Agent Variants

evmbench/agents/mini-swe-agent/config.yaml registers the full Modal baseline and forest variants:

mini-swe-agent-modal-baseline:
  <<: *common_settings
  runner: modal_baseline
  env_vars:
    <<: *env-config-vars
    MODAL_OPENAI_SECRET_NAME: openai-api-key

mini-swe-agent-modal-forest:
  <<: *common_settings
  runner: modal_forest
  env_vars:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    MODEL: openai/gpt-5
    SCOUT_STEP_LIMIT: "16"
    BRANCH_STEP_LIMIT: "36"
    JUDGE_STEP_LIMIT: "24"
    GLOBAL_STEP_LIMIT: "36"
    BRANCHES_PER_TREE: "2"
    MAX_TREE_ROLES: "4"
    FOREST_WORKER_CONCURRENCY: "4"

Solver Dispatch

evmbench/nano/solver.py chooses between the normal container path and the Modal path. Modal output is copied back into the EVMBench task computer so the standard grader remains the source of truth:

agent = agent_registry.get_agent(self.runtime_config.agent_id)
if agent.runner == "container":
    await self._prepare_container_agent(computer, task, agent)
    agent_output = await self._run_agent(computer, task)
else:
    agent_output = await self._run_modal_agent(computer, task, agent)

grade: EVMbenchGrade = await task.grade(computer, self.runtime_config)
grade.evmbench_result.agent_output = agent_output

Modal Command Builder

evmbench/agents/modal_runner.py maps agent runner types onto the mini-swe-agent entrypoint and forwards the forest budget controls:

MODAL_RUNNER_COMMANDS = {
    "modal_baseline": "baseline",
    "modal_forest": "forest",
}

elif agent.runner == "modal_forest":
    _append_env_flag(command, env, "--scout-step-limit", "SCOUT_STEP_LIMIT")
    _append_env_flag(command, env, "--branch-step-limit", "BRANCH_STEP_LIMIT")
    _append_env_flag(command, env, "--judge-step-limit", "JUDGE_STEP_LIMIT")
    _append_env_flag(command, env, "--global-step-limit", "GLOBAL_STEP_LIMIT")
    _append_env_flag(command, env, "--branches-per-tree", "BRANCHES_PER_TREE")
    _append_env_flag(command, env, "--max-tree-roles", "MAX_TREE_ROLES")
    _append_env_flag(command, env, "--worker-concurrency", "FOREST_WORKER_CONCURRENCY")

Forest Orchestration

evmbench/agents/mini-swe-agent/modal_forest.py runs the staged workflow:

scout_result = _run_worker(config, audit, instructions, _scout_spec(config), openai_api_key=openai_api_key)
worker_results.append(scout_result)

scout_decision, selected_roles = _select_roles(config)

branch_results = _run_specs_parallel(
    config,
    audit,
    instructions,
    _worker_specs_for_branches(config, selected_roles),
    openai_api_key=openai_api_key,
)

tree_judge_results = _run_specs_parallel(
    config,
    audit,
    instructions,
    _worker_specs_for_tree_judges(config, selected_roles),
    openai_api_key=openai_api_key,
)

global_result = _run_worker(
    config,
    audit,
    instructions,
    _global_judge_spec(config, selected_roles),
    openai_api_key=openai_api_key,
)

Specialist Roles

evmbench/agents/mini-swe-agent/scout.py defines the initial forest role set:

TREE_ROLES = {
    "token-flow": TreeRole(...),
    "accounting": TreeRole(...),
    "access-control": TreeRole(...),
    "cross-contract": TreeRole(...),
    "exploitability": TreeRole(...),
}

Final Submission Contract

evmbench/agents/mini-swe-agent/judge.py keeps branch outputs separate and allows only the global judge to write the final graded report:

FOREST_DIR = "/home/agent/forest"
FINAL_SUBMISSION_PATH = "/home/agent/submission/audit.md"

def branch_report_remote_path(role: TreeRole, branch_index: int) -> str:
    return f"{FOREST_DIR}/{role.name}/{branch_id(branch_index)}/branch.md"

def tree_judge_remote_path(role: TreeRole) -> str:
    return f"{FOREST_DIR}/{role.name}/judge.md"

Getting Started

Run all commands from the repository root.

Prerequisites

  • Python 3.11 or newer.
  • uv.
  • Docker with build access.
  • Modal CLI/account for Modal baseline and forest runs.
  • An OpenAI API key for Codex and mini-swe-agent runs.
  • Optional provider keys for non-OpenAI agents.

Install Python dependencies:

uv sync

Check that the main entrypoints import:

uv run python -m evmbench.nano.entrypoint --help
uv run python evmbench/agents/mini-swe-agent/entrypoint.py baseline --help
uv run python evmbench/agents/mini-swe-agent/entrypoint.py forest --help

Build Docker Images

EVMBench uses one base image plus one image per audit. The base image depends on the local ploit-builder:latest image, so build that first.

1. Build ploit-builder

docker build -f ploit/Dockerfile -t ploit-builder:latest --target ploit-builder .

2. Build local audit images

For the fastest smoke, build only the debug split:

uv run docker_build.py --split debug

For the Forest-of-Thought comparison, build detect images:

uv run docker_build.py --split detect-tasks

For every EVMBench task image:

uv run docker_build.py --split all

Build one audit when debugging:

uv run docker_build.py --no-build-base --audit 2024-01-canto

If Docker build networking is flaky, use host networking or a different Ubuntu mirror:

uv run docker_build.py --split detect-tasks --build-network host
uv run docker_build.py --split detect-tasks --ubuntu-mirror http://mirrors.edge.kernel.org/ubuntu

3. Build registry-pullable images for Modal

Modal workers cannot pull local Docker tags. Push audit images to a registry and point the Phase 6 wrappers at that repository.

Example using GHCR:

export MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-audit
uv run docker_build.py --split detect-tasks --tag-prefix "$MODAL_AUDIT_IMAGE_REPO" --build-network host

Push the tags you plan to run. For the default first5 comparison, the audit IDs are:

2023-07-pooltogether
2023-10-nextgen
2023-12-ethereumcreditguild
2024-01-canto
2024-01-curves

Push one tag like this:

docker push "$MODAL_AUDIT_IMAGE_REPO:2023-07-pooltogether"

Push the remaining selected tags the same way. If you do not set MODAL_AUDIT_IMAGE_REPO, run_phase6_variants.sh defaults to ghcr.io/pranay5255/evmbench-audit.

Configure API Keys

Create a local .env file:

OPENAI_API_KEY=sk-...

# Optional, only needed for these agent families.
ANTHROPIC_API_KEY=...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=...

# Required for Modal runs unless you are using the wrapper default.
MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-audit

Load it in your shell:

set -a
. ./.env
set +a

Set up Modal and create the secret expected by mini-swe-agent-modal-baseline:

modal setup
modal secret create openai-api-key OPENAI_API_KEY="$OPENAI_API_KEY"

The Phase 6 shell wrappers load .env automatically when it exists.

Run EVMBench

Harness smoke with the no-op human agent

This verifies Docker setup and grading without spending model budget:

uv run python -m evmbench.nano.entrypoint \
  evmbench.audit_split=debug \
  evmbench.mode=detect \
  evmbench.apply_gold_solution=False \
  evmbench.log_to_run_dir=True \
  evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
  evmbench.solver.agent_id=human \
  runner.concurrency=1

Single real-agent run

Example Codex detect run on one audit:

uv run python -m evmbench.nano.entrypoint \
  evmbench.audit=2024-01-canto \
  evmbench.mode=detect \
  evmbench.hint_level=none \
  evmbench.log_to_run_dir=True \
  evmbench.solver=evmbench.nano.solver.EVMbenchSolver \
  evmbench.solver.agent_id=codex-default \
  runner.concurrency=1

Useful configuration switches

Setting Values
evmbench.mode detect, patch, exploit
evmbench.audit One audit ID, such as 2024-01-canto
evmbench.audit_split debug, detect-tasks, patch-tasks, exploit-tasks, all
evmbench.hint_level none, low, med, high, max
evmbench.apply_gold_solution True for harness validation, False for real agent runs
evmbench.solver.agent_id Any ID registered in evmbench/agents/**/config.yaml
runner.concurrency Number of EVMBench tasks to run concurrently

Run the Baseline vs Forest Comparison

The Phase 6 comparison harness is:

evmbench/agents/mini-swe-agent/evaluate_phase6.py

The recommended wrapper is:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh

It loads .env, sets a default MODAL_AUDIT_IMAGE_REPO, and forwards to the Python harness.

Runner groups

Group Runners
presentation codex-default, modal-baseline, modal-forest
smoke codex-default, mini-smoke-10, modal-baseline-smoke-10, modal-forest-smoke
modal modal-baseline, modal-forest
local local mini-swe-agent variants
all every registered Phase 6 variant

List every runner:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh variants

Preview the smoke matrix:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh plan --scope smoke --runners smoke

Run the low-budget smoke comparison:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope smoke --runners smoke --stop-on-failure

Preview the default five-audit baseline comparison:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh plan --scope first5 --runners presentation

Run the default comparison:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope first5 --runners presentation --stop-on-failure

Run only the Modal forest:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope first5 --runners modal-forest --stop-on-failure

Summarize an existing run:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh summarize --output-root runs/phase6/RUN_ID

Phase 6 Comparison Flow

sequenceDiagram
    participant User
    participant Wrapper as run_phase6_variants.sh
    participant Phase6 as evaluate_phase6.py
    participant Bench as evmbench.nano.entrypoint
    participant Solver as EVMbenchSolver
    participant Modal as Modal runners
    participant Grade as EVMBench grader

    User->>Wrapper: run --scope first5 --runners presentation
    Wrapper->>Phase6: load .env and build matrix
    Phase6->>Bench: codex-default audit run
    Phase6->>Bench: modal-baseline audit run
    Phase6->>Bench: modal-forest audit run
    Bench->>Solver: setup task computer
    Solver->>Modal: dispatch baseline or forest when agent runner is Modal
    Modal-->>Solver: modal/submission/audit.md
    Solver->>Grade: copy submission and grade normally
    Grade-->>Phase6: run.log grade event
    Phase6-->>User: phase6-results.json, summary, slide data
Loading

Outputs

Normal EVMBench runs write under runs/. Phase 6 writes a timestamped output root under runs/phase6/ unless --output-root is provided.

flowchart LR
    Root[runs/phase6/timestamp] --> Matrix[phase6-run-matrix.json]
    Root --> CommandLogs[_phase6_command_logs]
    Root --> RunnerDirs[runner directories]
    Root --> Results[phase6-results.json]
    Root --> Summary[phase6-summary.md]
    Root --> SlideJson[phase6-slide-data.json]
    Root --> SlideCsv[phase6-slide-data.csv]

    RunnerDirs --> RunDir[group/audit_run]
    RunDir --> RunLog[run.log]
    RunDir --> Submission[submission/audit.md]
    RunDir --> ModalDir[modal]
    ModalDir --> ModalLogs[logs]
    ModalLogs --> BaselineMeta[modal-baseline-result.json]
    ModalLogs --> ForestMeta[modal-forest-result.json]
    ModalLogs --> Trajectories[*.traj.json]
Loading

Important files:

File Purpose
submission/audit.md Grading source of truth for detect mode.
run.log EVMBench task events and grade payloads.
modal/logs/modal-runner-command.json Exact Modal command emitted by the adapter.
modal/logs/modal-baseline-result.json Baseline metadata, runtime, and optional grade.
modal/logs/modal-forest-result.json Forest roles, worker metadata, errors, and runtimes.
phase6-results.json Canonical structured comparison results.
phase6-summary.md Human-readable aggregate and per-audit summary.
phase6-slide-data.json Chart-ready data for presentations.
phase6-slide-data.csv Spreadsheet-friendly per-audit rows.

Troubleshooting

OPENAI_API_KEY is not set

Load .env in the shell before running direct Python commands:

set -a
. ./.env
set +a

For Modal, also create the Modal secret:

modal secret create openai-api-key OPENAI_API_KEY="$OPENAI_API_KEY"

Modal cannot pull an audit image

Use a registry tag, not a local Docker tag:

export MODAL_AUDIT_IMAGE_REPO=ghcr.io/YOUR_GHCR_OWNER/evmbench-audit
uv run docker_build.py --audit 2024-01-canto --tag-prefix "$MODAL_AUDIT_IMAGE_REPO"
docker push "$MODAL_AUDIT_IMAGE_REPO:2024-01-canto"

Docker build cannot reach Ubuntu mirrors

Use:

uv run docker_build.py --split detect-tasks --build-network host

or:

uv run docker_build.py --split detect-tasks --ubuntu-mirror http://mirrors.edge.kernel.org/ubuntu

Modal baseline works but forest is too expensive

Lower the forest knobs in .env or in config.yaml:

BRANCHES_PER_TREE=1
MAX_TREE_ROLES=2
FOREST_WORKER_CONCURRENCY=2
BRANCH_STEP_LIMIT=10
GLOBAL_STEP_LIMIT=10

Then run:

evmbench/agents/mini-swe-agent/run_phase6_variants.sh run --scope smoke --runners modal-forest-smoke --stop-on-failure

Development Checks

Focused tests for the mini-swe-agent Modal and Phase 6 path:

uv run pytest tests/test_mini_swe_agent_phase5.py tests/test_mini_swe_agent_forest.py tests/test_mini_swe_agent_phase6.py

License

Licensed under the Apache License, Version 2.0.

About

Test time scaling using forest of thoughts paper on EVMbench using mini-swe-agent

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors