Use this script before publishing results or tagging a release to confirm that the CLI, run logging, trace viewer, and baseline diagnostics all behave as expected for TraceCore.
python -m venv .venv && .venv\Scripts\activate(or reuse an existing environment)pip install -e .[dev]- Ensure
.agent_bench/runs/exists (it is created automatically after the first run) - Optional:
set PYTHONWARNINGS=defaultso surfaced issues are visible during verification - Optional: create an
agent-bench.tomlwith your preferredagent,task, andseeddefaults to avoid retyping flags during this checklist (read by bothtracecoreand the legacyagent-benchalias).
- Run the colorful wizard if you want a guided agent/task/seed selection before manual commands (optional, but it uses the same
runpipeline you’ll exercise below):tracecore interactive # Try --dry-run to preview the command without executing # Try --save-session to persist your choices # Try typing partial names to filter agents/tasks # Press ? during prompts for inline help # If you have baseline data, you'll see suggested pairings
- Run a deterministic task twice to confirm persistence:
tracecore run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42 tracecore run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11 tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7 tracecore run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21
- List recent artifacts and confirm the run IDs you just produced appear at the top:
tracecore runs list --limit 5
- Generate a baseline snapshot for each agent/task pair and sanity-check the metrics:
tracecore baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1 tracecore baseline --agent agents/rate_limit_agent.py --task rate_limited_api@1
- Compare two representative runs (mix and match run IDs or explicit artifact paths) and confirm the diff flags step-level deltas you expect:
tracecore baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json
- Export a frozen baseline for the UI (create
.agent_bench/baselines/baseline-<ts>.json):tracecore baseline --export latest - Note the
run_idvalues—you’ll load them in the UI next. - Replay a prior run deterministically (overrides allowed) and confirm the output matches the original artifact:
tracecore run --replay <run_id> # Optional overrides: tracecore run --replay <run_id> --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
- Validate bundled task manifests and registry entries:
tracecore tasks validate --registry
- Start the server (module form avoids PATH issues):
python -m uvicorn agent_bench.webui.app:app --reload
- Visit http://localhost:8000 in a fresh browser tab.
- Run the same agent/task combinations from the form and verify the result JSON (including
seed) matches the CLI output. - Click any “trace” link (or navigate to
/?trace_id=<run_id>#trace-viewer) and confirm:- The Trace Viewer section auto-scrolls into view.
- Step entries include observation, action, and result payloads.
- The “Download JSON” link serves the
/api/traces/<run_id>response.
- Scroll to the Baselines panel and confirm it reflects the same success rate / averages seen in the CLI baseline output and shows the "Latest published" card referencing your export.
- Open the Guide page (
/guide) and confirm the agent expectations table loads. - Run the
rate_limited_chain@1task from the UI (or CLI) to verify the pain task renders traces correctly—even if your reference agent fails, the trace + error should appear in the Trace tab.
- If you need a clean slate, delete artifacts after capturing them elsewhere:
Remove-Item -Recurse -Force .agent_bench\runs\*
- Re-run the determinism suite to ensure no drift:
python -m pytest tests/test_determinism.py
Before tagging or sharing results:
- Ensure this checklist has been completed in the current commit.
- Archive the
run_idvalues referenced in reports so they remain a reproducible proof of behavior. - Run the full test suite (
python -m pytest). - Note the harness version reported in run metadata; it should match the release tag.
agents/toy_agent.py+filesystem_hidden_config@1—run_ide8d59eb459774d59aeddc30c59b3509dagents/rate_limit_agent.py+rate_limited_api@1—run_id5f1d056ced944eeb8a3ae1b98d26a159agents/chain_agent.py+rate_limited_chain@1—run_ide0cdfa6774604edcbab0b96238206f67agents/ops_triage_agent.py+log_alert_triage@1—run_ide13c79dfdb244385af677d049bb9103b