Volo

Deterministic testing for AI agents — in CI, at ~$0. Record your agent run once, replay it against a high-fidelity simulated environment that never hallucinates, and block reliability regressions in your pull requests. Like unit tests and git bisect, but for agents.

Most agent "evals" re-run a live LLM-as-judge on every check — which costs API money and is itself non-deterministic. Volo flips that: it replays a recorded run deterministically against mocked tools, so the same PR check gives the same answer every time, for free.

Why

Teams ship AI agents that fail non-deterministically and can't be tested like normal software. Going from "80% works in a pilot" to "99%+ in production" can take 100× the original build effort. Today's tools either trace agents after they fail in production or run shallow benchmarks that don't reflect reality.

Volo makes agents testable the way regular software is testable:

Record a real run once — every model call, tool call, and decision.
Simulate the agent's full environment — not just cache-replay, but a stateful, source-informed simulator that handles inputs you never recorded (and flags, never hallucinates, when it can't).
Generate adversarial scenarios automatically: dropped tool results, ambiguous turns, prompt injection, long-horizon drift.
Measure reliability across orthogonal dimensions: trajectory determinism, decision determinism, faithfulness, consistency-under-repetition.
Block regressions in CI — every PR runs the suite deterministically at $0 marginal cost.
Root-cause and diff — "git bisect for agents" pinpoints the breaking step and the commit.

Quickstart (local)

Requires: uv ≥ 0.5, Node ≥ 20 with corepack enabled.

git clone https://github.com/abhay-codes07/VOLO.git
cd VOLO
make setup        # installs Python (uv) + JS toolchains, syncs deps
uv run volo --help

# fastest end-to-end try — record an example agent and score it in one step:
uv run volo init examples.calc_agent:run --input '{"a":2,"b":3,"c":4}'

A minimal record → replay loop on a bundled example:

uv run volo record examples.echo_agent:run --out ./.volo/recordings/echo.json
uv run volo sim ./.volo/recordings/echo.json     # deterministic replay, $0

Score an agent against the adversarial scenario suite and get a ship / no-ship verdict:

uv run volo run ./.volo/recordings/echo.json --agent examples.echo_agent:run

Optional: a free LLM judge

Faithfulness can be scored by a local heuristic (default, free, deterministic), local Ollama, or a free OpenAI-compatible API (Groq by default). Drop a key in .env (copy .env.example):

VOLO_OPENAI_COMPAT_OPT_IN=true
GROQ_API_KEY=gsk_...

uv run volo run <recording> --agent <module:fn> --judge groq

Or just write pytest tests

pytest-volo puts the whole engine behind ordinary fixtures — one marker binds a recording, volo_scenario parametrizes the test over every adversarial world, and volo_run returns the ship/no-ship verdict:

import pytest
from pytest_volo import assert_ship

@pytest.mark.volo_recording("recordings/checkout.json")
def test_survives_adversity(volo_scenario):      # runs once per hostile world
    assert run_my_agent({"order": 42})["status"] in ("ok", "refused")

@pytest.mark.volo_recording("recordings/checkout.json")
def test_ships(volo_run):
    assert_ship(volo_run(run_my_agent, n_runs=3))

Deterministic tests for your MCP stack

Volo can record and replay Model Context Protocol servers — so agents that depend on MCP tools get deterministic integration tests with no network, no credentials, and no API cost.

Put the recording proxy between any MCP client and the real server once:

printf '%s\n' \
  '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-06-18"}}' \
  '{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
  '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"add","arguments":{"a":2,"b":40}}}' \
| uv run volo mcp record --out calc.json -- python examples/mcp_calc_server.py

From then on, replay the recording as a simulated MCP server — byte-identical answers, fully offline:

printf '%s\n' \
  '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"add","arguments":{"a":2,"b":40}}}' \
| uv run volo mcp serve calc.json
# {"jsonrpc":"2.0","id":1,"result":{"content":[{"type":"text","text":"42"}],"isError":false}}

Replay matches on the tool name + arguments (not request ids), recorded protocol errors replay as errors, and anything un-recorded returns JSON-RPC error -32042 — the simulator never invents a tool result. Recorded tools/list responses are auto-distilled into tool schemas, so the Tier-2 simulator can take over for near-miss inputs. Full guide: volo mcp --help or the docs site.

Does the simulator make things up? No — it flags.

The hard part of replaying against a simulated environment is inputs you never recorded. Volo either reconstructs them faithfully (from tool specs/sources) or refuses and flags — it never fabricates a tool result. Measured on a held-out benchmark (deterministic, reproducible):

Configuration	Fidelity	Wrong answers
Tier-1 only (cache replay)	20%	0
Tier-2 (source-informed)	100%	0

Wrong = 0 in every configuration is the whole point — see benchmarks/ and run uv run python benchmarks/fidelity.py yourself.

Architecture

Seven subsystems behind a CLI and a Next.js dashboard:

Capture SDK → Environment Simulator → Scenario Generator → Reliability Engine
       ↓                                                            ↓
    CI Runner ───────── Root-Cause / Diff ─────────────── Cost-Routing Brain

Repo layout

packages/       # 10 Python packages (uv workspace) — one per subsystem + core + CLI + MCP
services/api/   # FastAPI backend (local dashboard + cloud seam)
apps/web/       # Next.js dashboard
integrations/   # framework adapters: langgraph, openai_agents, crewai
examples/       # runnable demo agents (also the CI dogfood targets)
tests/          # cross-package integration, e2e, and fidelity benchmarks

Tech stack

Python 3.12+ (uv workspace), FastAPI, SQLModel + SQLite, Ollama for local judging, and a free OpenAI-compatible provider for optional LLM judging. Frontend is Next.js + TypeScript + Tailwind. Everything runs fully locally with zero cloud accounts.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.githooks		.githooks
.github/workflows		.github/workflows
apps/web		apps/web
benchmarks		benchmarks
docs		docs
examples		examples
integrations		integrations
packages		packages
services/api		services/api
tests		tests
website		website
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
turbo.json		turbo.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Volo

Why

Quickstart (local)

Optional: a free LLM judge

Or just write pytest tests

Deterministic tests for your MCP stack

Does the simulator make things up? No — it flags.

Architecture

Repo layout

Tech stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Volo

Why

Quickstart (local)

Optional: a free LLM judge

Or just write pytest tests

Deterministic tests for your MCP stack

Does the simulator make things up? No — it flags.

Architecture

Repo layout

Tech stack

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages