Manual Verification Checklist

Use this script before publishing results or tagging a release to confirm that the CLI, run logging, trace viewer, and baseline diagnostics all behave as expected for TraceCore.

1. Prerequisites

python -m venv .venv && .venv\Scripts\activate (or reuse an existing environment)
pip install -e .[dev]
Ensure .agent_bench/runs/ exists (it is created automatically after the first run)
Optional: set PYTHONWARNINGS=default so surfaced issues are visible during verification
Optional: create an agent-bench.toml with your preferred agent, task, and seed defaults to avoid retyping flags during this checklist (read by both tracecore and the legacy agent-bench alias).

2. CLI flow

Run the colorful wizard if you want a guided agent/task/seed selection before manual commands (optional, but it uses the same run pipeline you’ll exercise below):

tracecore interactive
# Try --dry-run to preview the command without executing
# Try --save-session to persist your choices
# Try typing partial names to filter agents/tasks
# Press ? during prompts for inline help
# If you have baseline data, you'll see suggested pairings

Run a deterministic task twice to confirm persistence:

tracecore run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
tracecore run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11
tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7
tracecore run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21

List recent artifacts and confirm the run IDs you just produced appear at the top:
```
tracecore runs list --limit 5
```

Generate a baseline snapshot for each agent/task pair and sanity-check the metrics:

tracecore baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1
tracecore baseline --agent agents/rate_limit_agent.py --task rate_limited_api@1

Compare two representative runs (mix and match run IDs or explicit artifact paths) and confirm the diff flags step-level deltas you expect:
```
tracecore baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json
```
Export a frozen baseline for the UI (create .agent_bench/baselines/baseline-<ts>.json):
```
tracecore baseline --export latest
```
Note the run_id values—you’ll load them in the UI next.

Replay a prior run deterministically (overrides allowed) and confirm the output matches the original artifact:

tracecore run --replay <run_id>
# Optional overrides:
tracecore run --replay <run_id> --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42

Validate bundled task manifests and registry entries:
```
tracecore tasks validate --registry
```

3. Web UI flow

Start the server (module form avoids PATH issues):

python -m uvicorn agent_bench.webui.app:app --reload

Visit http://localhost:8000 in a fresh browser tab.
Run the same agent/task combinations from the form and verify the result JSON (including seed) matches the CLI output.
Click any “trace” link (or navigate to /?trace_id=<run_id>#trace-viewer) and confirm:
- The Trace Viewer section auto-scrolls into view.
- Step entries include observation, action, and result payloads.
- The “Download JSON” link serves the /api/traces/<run_id> response.
Scroll to the Baselines panel and confirm it reflects the same success rate / averages seen in the CLI baseline output and shows the "Latest published" card referencing your export.
Open the Guide page (/guide) and confirm the agent expectations table loads.
Run the rate_limited_chain@1 task from the UI (or CLI) to verify the pain task renders traces correctly—even if your reference agent fails, the trace + error should appear in the Trace tab.

4. Cleanup & determinism check

If you need a clean slate, delete artifacts after capturing them elsewhere:
```
Remove-Item -Recurse -Force .agent_bench\runs\*
```
Re-run the determinism suite to ensure no drift:
```
python -m pytest tests/test_determinism.py
```

5. Release gating

Before tagging or sharing results:

Ensure this checklist has been completed in the current commit.
Archive the run_id values referenced in reports so they remain a reproducible proof of behavior.
Run the full test suite (python -m pytest).
Note the harness version reported in run metadata; it should match the release tag.

Recorded run IDs (v0.9.0)

agents/toy_agent.py + filesystem_hidden_config@1 — run_id e8d59eb459774d59aeddc30c59b3509d
agents/rate_limit_agent.py + rate_limited_api@1 — run_id 5f1d056ced944eeb8a3ae1b98d26a159
agents/chain_agent.py + rate_limited_chain@1 — run_id e0cdfa6774604edcbab0b96238206f67
agents/ops_triage_agent.py + log_alert_triage@1 — run_id e13c79dfdb244385af677d049bb9103b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual Verification Checklist

1. Prerequisites

2. CLI flow

3. Web UI flow

4. Cleanup & determinism check

5. Release gating

Recorded run IDs (v0.9.0)

FilesExpand file tree

manual_verification.md

Latest commit

History

manual_verification.md

File metadata and controls

Manual Verification Checklist

1. Prerequisites

2. CLI flow

3. Web UI flow

4. Cleanup & determinism check

5. Release gating

Recorded run IDs (v0.9.0)