ModelForecast

Best free model right now: nemotron-3-nano-30b-a3b

— scores A across all tool-calling dimensions. Free on OpenRouter.

Use nvidia/nemotron-3-nano-30b-a3b:free for tool calling in your agent pipeline.

Full results: results/RESULTS.md · Methodology

Top performers

Click a badge to view the model on OpenRouter.

Category winners

Best for tool calling (T0+T1): nemotron-3-nano-30b-a3b — 100% T0+T1
Best for schema compliance (T1): nemotron-3-nano-30b-a3b — 100% T1
Best for restraint (R0): trinity-large-preview — 100% R0
Best for multi-turn agency (A1): trinity-large-preview — 100% A1

Avoid these models for tool calling

These models fail the basic tool invocation test (T0 < 20%). They will silently fail your agent pipeline.

llama-3.3-70b-instruct: text-instead-of-tool (T0=0%)
minimax-m2.5: text-instead-of-tool (T0=0%)
mistral-small-3.1-24b-instruct: text-instead-of-tool (T0=0%)
gpt-oss-20b: text-instead-of-tool (T0=0%)
qwen3-4b: text-instead-of-tool (T0=0%)
qwen3-coder: text-instead-of-tool (T0=0%)
qwen3-next-80b-a3b-instruct: text-instead-of-tool (T0=0%)

How We Test

Five probes across orthogonal capability dimensions:

Probe	Dimension	Question
T0	Invoke	Can it call a tool at all?
T1	Schema	Does it respect parameter types?
T2	Selection	Can it choose the right tool from many?
A1	Linear	Can it chain tool calls across turns?
R0	Abstain	Does it know when NOT to use tools?

Wilson score intervals. 10 trials per test. Grades based on lowest dimension score. Full methodology.

The Colosseum

We also run GPU-accelerated CoreWars battles where LLMs write assembly to fight for shared memory.

flowchart LR
    subgraph Turn["Each Turn (x10)"]
        A[LLM] -->|writes| B[Redcode]
        B -->|battles| C[GPU MARS]
        C -->|10K fights| D[Results]
        D -->|feedback| A
    end

    subgraph Surprise["Turn 6-7"]
        E[Champion]
        E -->|boss fight| C
    end

    style A fill:#4a9eff
    style C fill:#ff6b6b
    style E fill:#ffd93d

Each model starts with a basic IMP (MOV 0, 1). They watch 10,000 battles. They write improved code. They repeat for 10 turns. At Turn 6, a surprise champion appears.

WATCH THE BATTLES LIVE

8,192 battles. Real-time visualization. 27,845 battles/sec on GPU (RTX 5090).

Run It Yourself

Test your own models on your own hardware.

git clone https://github.com/jw409/modelforecast && cd modelforecast
curl -LsSf https://astral.sh/uv/install.sh | sh && uv sync
export OPENROUTER_API_KEY=your_key

# Full sweep with canary test
uv run python scripts/run_sweep.py

# Resume interrupted sweep
uv run python scripts/run_sweep.py --resume

# Regenerate results
uv run python scripts/generate_readme_results.py

What We Learned

Price doesn't predict performance. NVIDIA's free nemotron-3-nano-30b scores 100% across all dimensions — matching or beating most paid models.

Half of "tool-capable" free models can't actually call tools. 8 of 16 free models that advertise tool support fail the basic T0 invocation test at 0%. Don't trust the label — test it.

Small samples lie. Wilson score intervals or you're fooling yourself.

Contributors

Thanks to these wonderful people (emoji key):

_jw
💻 📖 🤔 🚧 🚇

_{Jeff Whitehead}
💻 🤔

This project follows the all-contributors specification. Contributions of any kind welcome!

MIT License · Not affiliated with OpenRouter

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github		.github
.planning		.planning
docs		docs
external		external
games		games
results		results
scripts		scripts
src/modelforecast		src/modelforecast
tests		tests
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.gitmodules		.gitmodules
ARCHITECTURE_CONTRACT.json		ARCHITECTURE_CONTRACT.json
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
GRAVEYARD.md		GRAVEYARD.md
HEADLINE.md		HEADLINE.md
LICENSE		LICENSE
Makefile		Makefile
NEWS.md		NEWS.md
README.md		README.md
RESOURCE_MANIFEST.json		RESOURCE_MANIFEST.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModelForecast

Best free model right now: nemotron-3-nano-30b-a3b

Top performers

Category winners

Avoid these models for tool calling

How We Test

The Colosseum

Run It Yourself

What We Learned

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ModelForecast

Best free model right now: nemotron-3-nano-30b-a3b

Top performers

Category winners

Avoid these models for tool calling

How We Test

The Colosseum

Run It Yourself

What We Learned

Contributors

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages