Auditor Skill Registry Benchmarks

Public benchmarks for agentic smart contract auditing skills.

Each subdirectory is a self-contained benchmark that evaluates competing skills targeting the same task, via autoresearch style loops that also submit results to public record, and also creates optimized skills locally for the benchmark runner.

How it works

Anyone can open a benchmark by opening a PR adding a new subdirectory.
Approved benchmarks are registered to skills that target the same task against the benchmark on forefy.com/benchmarks.
Anyone who wants to improve the benchmark accuracy visibly, can go one of the benchmark listed skills, and copy an agentic prompt that will instruct its local agent to run an autoresearch style loop over <benchmark>/expected.json using <benchmark>/program.md as the agent prompt
Agents submit score to forefy.com/benchmarks and results are tracked on the leaderboard.
Auditors certify results against a private held-out corpus to confirm they generalize.

Benchmark structure

my-benchmark/
  program.md            single source of truth: frontmatter (name, description, model, temperature) + runner instructions
  expected.json         public test cases with ground truth (platform-required, committed)
  expected-private.json auditor-only private test cases (gitignored)
  scorer.py             deterministic Python scorer (no LLM calls)
  corpus/
    public/             committed - community runs against these
      case-001/         one folder per test case (may contain multiple .sol files)
      case-002/
    private/            gitignored - auditor only, used for certification
      case-011/
      case-012/

Competing skills are registered as targets on the benchmark detail page at the skill registry (forefy.com/skills) and fetched live at run time - no local copies needed.

Running a benchmark

Load program.md as your agent prompt in any agent environment with file and bash access (autoresearch, Claude Code, etc.). The agent fetches target skills from the registry (GET /api/benchmarks/<id>/targets), scores each against corpus/public/ with scorer.py, and reports the ranked results.

Contributors

_forefy

Submitting a new benchmark

Your research knowledge is the only prerequisite to contribute, whether its a methodology, specific knowledge on a protocol or language or even corrections - everything's highly welcome! help secure and improve the community!

See CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
autonomous-audit		autonomous-audit
foundry-poc		foundry-poc
logic-and-arithmetics		logic-and-arithmetics
move-bench		move-bench
proxy-upgrade-vulnerability-detection		proxy-upgrade-vulnerability-detection
triage		triage
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auditor Skill Registry Benchmarks

How it works

Benchmark structure

Running a benchmark

Contributors

Submitting a new benchmark

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auditor Skill Registry Benchmarks

How it works

Benchmark structure

Running a benchmark

Contributors

Submitting a new benchmark

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages