Skip to content

jw409/modelforecast

Repository files navigation

ModelForecast

All Contributors

Best free model right now: nemotron-3-nano-30b-a3b

Grade A — scores A across all tool-calling dimensions. Free on OpenRouter.

Use nvidia/nemotron-3-nano-30b-a3b:free for tool calling in your agent pipeline.

Full results: results/RESULTS.md · Methodology

Top performers

nemotron-3-nano-30b-a3b: Grade A nemotron-3-super-120b-a12b: Grade A gpt-oss-120b: Grade A step-3.5-flash: Grade A glm-4.5-air: Grade A

Click a badge to view the model on OpenRouter.

Category winners

Avoid these models for tool calling

These models fail the basic tool invocation test (T0 < 20%). They will silently fail your agent pipeline.

  • llama-3.3-70b-instruct: text-instead-of-tool (T0=0%)
  • minimax-m2.5: text-instead-of-tool (T0=0%)
  • mistral-small-3.1-24b-instruct: text-instead-of-tool (T0=0%)
  • gpt-oss-20b: text-instead-of-tool (T0=0%)
  • qwen3-4b: text-instead-of-tool (T0=0%)
  • qwen3-coder: text-instead-of-tool (T0=0%)
  • qwen3-next-80b-a3b-instruct: text-instead-of-tool (T0=0%)

How We Test

Five probes across orthogonal capability dimensions:

Probe Dimension Question
T0 Invoke Can it call a tool at all?
T1 Schema Does it respect parameter types?
T2 Selection Can it choose the right tool from many?
A1 Linear Can it chain tool calls across turns?
R0 Abstain Does it know when NOT to use tools?

Wilson score intervals. 10 trials per test. Grades based on lowest dimension score. Full methodology.


The Colosseum

We also run GPU-accelerated CoreWars battles where LLMs write assembly to fight for shared memory.

flowchart LR
    subgraph Turn["Each Turn (x10)"]
        A[LLM] -->|writes| B[Redcode]
        B -->|battles| C[GPU MARS]
        C -->|10K fights| D[Results]
        D -->|feedback| A
    end

    subgraph Surprise["Turn 6-7"]
        E[Champion]
        E -->|boss fight| C
    end

    style A fill:#4a9eff
    style C fill:#ff6b6b
    style E fill:#ffd93d
Loading

Each model starts with a basic IMP (MOV 0, 1). They watch 10,000 battles. They write improved code. They repeat for 10 turns. At Turn 6, a surprise champion appears.

WATCH THE BATTLES LIVE

8,192 battles. Real-time visualization. 27,845 battles/sec on GPU (RTX 5090).


Run It Yourself

Test your own models on your own hardware.

git clone https://github.com/jw409/modelforecast && cd modelforecast
curl -LsSf https://astral.sh/uv/install.sh | sh && uv sync
export OPENROUTER_API_KEY=your_key

# Full sweep with canary test
uv run python scripts/run_sweep.py

# Resume interrupted sweep
uv run python scripts/run_sweep.py --resume

# Regenerate results
uv run python scripts/generate_readme_results.py

What We Learned

Price doesn't predict performance. NVIDIA's free nemotron-3-nano-30b scores 100% across all dimensions — matching or beating most paid models.

Half of "tool-capable" free models can't actually call tools. 8 of 16 free models that advertise tool support fail the basic T0 invocation test at 0%. Don't trust the label — test it.

Small samples lie. Wilson score intervals or you're fooling yourself.


Contributors

Thanks to these wonderful people (emoji key):

jw
jw

💻 📖 🤔 🚧 🚇
Jeff Whitehead
Jeff Whitehead

💻 🤔

This project follows the all-contributors specification. Contributions of any kind welcome!


MIT License · Not affiliated with OpenRouter

About

Check the forecast before you deploy. Tool-calling capability benchmarks for free LLM models.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors