— scores A across all tool-calling dimensions. Free on OpenRouter.
Use
nvidia/nemotron-3-nano-30b-a3b:freefor tool calling in your agent pipeline.
Full results: results/RESULTS.md · Methodology
Click a badge to view the model on OpenRouter.
- Best for tool calling (T0+T1): nemotron-3-nano-30b-a3b — 100% T0+T1
- Best for schema compliance (T1): nemotron-3-nano-30b-a3b — 100% T1
- Best for restraint (R0): trinity-large-preview — 100% R0
- Best for multi-turn agency (A1): trinity-large-preview — 100% A1
These models fail the basic tool invocation test (T0 < 20%). They will silently fail your agent pipeline.
- llama-3.3-70b-instruct: text-instead-of-tool (T0=0%)
- minimax-m2.5: text-instead-of-tool (T0=0%)
- mistral-small-3.1-24b-instruct: text-instead-of-tool (T0=0%)
- gpt-oss-20b: text-instead-of-tool (T0=0%)
- qwen3-4b: text-instead-of-tool (T0=0%)
- qwen3-coder: text-instead-of-tool (T0=0%)
- qwen3-next-80b-a3b-instruct: text-instead-of-tool (T0=0%)
Five probes across orthogonal capability dimensions:
| Probe | Dimension | Question |
|---|---|---|
| T0 | Invoke | Can it call a tool at all? |
| T1 | Schema | Does it respect parameter types? |
| T2 | Selection | Can it choose the right tool from many? |
| A1 | Linear | Can it chain tool calls across turns? |
| R0 | Abstain | Does it know when NOT to use tools? |
Wilson score intervals. 10 trials per test. Grades based on lowest dimension score. Full methodology.
We also run GPU-accelerated CoreWars battles where LLMs write assembly to fight for shared memory.
flowchart LR
subgraph Turn["Each Turn (x10)"]
A[LLM] -->|writes| B[Redcode]
B -->|battles| C[GPU MARS]
C -->|10K fights| D[Results]
D -->|feedback| A
end
subgraph Surprise["Turn 6-7"]
E[Champion]
E -->|boss fight| C
end
style A fill:#4a9eff
style C fill:#ff6b6b
style E fill:#ffd93d
Each model starts with a basic IMP (MOV 0, 1). They watch 10,000 battles. They write improved code. They repeat for 10 turns. At Turn 6, a surprise champion appears.
8,192 battles. Real-time visualization. 27,845 battles/sec on GPU (RTX 5090).
Test your own models on your own hardware.
git clone https://github.com/jw409/modelforecast && cd modelforecast
curl -LsSf https://astral.sh/uv/install.sh | sh && uv sync
export OPENROUTER_API_KEY=your_key
# Full sweep with canary test
uv run python scripts/run_sweep.py
# Resume interrupted sweep
uv run python scripts/run_sweep.py --resume
# Regenerate results
uv run python scripts/generate_readme_results.pyPrice doesn't predict performance. NVIDIA's free nemotron-3-nano-30b scores 100% across all dimensions — matching or beating most paid models.
Half of "tool-capable" free models can't actually call tools. 8 of 16 free models that advertise tool support fail the basic T0 invocation test at 0%. Don't trust the label — test it.
Small samples lie. Wilson score intervals or you're fooling yourself.
Thanks to these wonderful people (emoji key):
jw 💻 📖 🤔 🚧 🚇 |
Jeff Whitehead 💻 🤔 |
This project follows the all-contributors specification. Contributions of any kind welcome!
MIT License · Not affiliated with OpenRouter