- Your benchmark will target specific skills - you select which skills it evaluates after registering on https://forefy.com/skills.
- Fork this repo.
- Create a new subdirectory using lowercase letters, numbers, and hyphens (e.g.
reentrancy-detection). - Add the required files (see structure below).
- Open a PR. The title must be:
benchmark: <subdirectory-name>. - Once merged, a community auditor will register your benchmark at forefy.com/benchmarks using the subdirectory URL.
- On the benchmark detail page, select which skills this benchmark targets.
The single source of truth for the benchmark. Contains YAML frontmatter (read by forefy.com for registry ingestion) and the agent runner instructions.
Frontmatter fields:
name(required): benchmark name shown on the leaderboarddescription(required): one sentence describing what this benchmark evaluates
---
name: My Benchmark Name
description: One sentence describing what this benchmark evaluates.
---
# My Benchmark - Runner
## Invocation
When running each skill against a case, use this exact prompt followed by the input:
> [task-specific invocation matching the skill category]
## Required Output Format
...
## Instructions
1. Fetch the list of target skills from the registry: `GET https://forefy.com/api/benchmarks/<benchmark_id>/targets`.
2. For each skill, fetch it from the registry (`GET /api/skills/<id>`), run it against every case in `corpus/public/` using the invocation above.
3. Score with scorer.py and record results.
4. Print a ranked leaderboard.The invocation line is critical - it defines the exact prompt used to call each skill, ensuring fair and reproducible comparison across all competitors.
temperature: must be 0 for reproducibility.output_schema: the JSON schema the skill must output. scorer.py validates against this.
25 public test cases with ground truth (committed, platform-required). Community uses these for optimization and scoring.
[
{
"id": "case-001",
"vulnerable": true,
"vulnerability_type": "reentrancy",
"affected_function": "withdraw",
"explanation": "one sentence root cause"
}
]Case IDs map to contract folders in corpus/public/<id>/. Each folder may contain one or more .sol files.
- Keep contracts realistic - use anonymized or synthetic code.
idmust be unique and stable across benchmark versions.
Test case contracts split by visibility:
corpus/
public/ committed - community runs against these
case-001/
Contract.sol
private/ gitignored - auditor only, for certification
case-011/
Contract.sol
corpus/private/ and expected-private.json are listed in .gitignore and never committed. The auditor maintains them locally to certify community submissions against unseen cases:
python scorer.py <skill>-output.json expected-private.json
Pure Python, no LLM calls, deterministic. Accepts two positional arguments: skill output JSON and expected JSON.
python scorer.py <skill>-output.json expected.json
python scorer.py <skill>-output.json expected-private.json
Prints a float 0.0-1.0 to stdout.
Natural language instructions for the optimization agent. Explain:
- What the skill should do.
- What the required JSON output format is.
- How to run the scorer.
- Any domain-specific guidance.
Target skills are managed on the benchmark detail page at forefy.com - no frontmatter needed. The runner fetches them live from the registry at run time via GET /api/benchmarks/<id>/targets.
Skills are always fetched at their current registry version. Each run records start_commit_sha / end_commit_sha to pin exactly which version was scored.
The private corpus (corpus/private/ + expected-private.json) lives only on the auditor's machine - never committed. Use it to certify community submissions against unseen cases and detect overfitting.
Breaking changes to expected.json or the output schema require a new subdirectory (e.g. reentrancy-detection-v2). Non-breaking additions (new cases, scorer improvements that do not change pass/fail outcomes) can be done in-place via PR.