Skip to content

Latest commit

 

History

History
93 lines (87 loc) · 5.07 KB

File metadata and controls

93 lines (87 loc) · 5.07 KB

Manual Verification Checklist

Use this script before publishing results or tagging a release to confirm that the CLI, run logging, trace viewer, and baseline diagnostics all behave as expected for TraceCore.

1. Prerequisites

  • python -m venv .venv && .venv\Scripts\activate (or reuse an existing environment)
  • pip install -e .[dev]
  • Ensure .agent_bench/runs/ exists (it is created automatically after the first run)
  • Optional: set PYTHONWARNINGS=default so surfaced issues are visible during verification
  • Optional: create an agent-bench.toml with your preferred agent, task, and seed defaults to avoid retyping flags during this checklist (read by both tracecore and the legacy agent-bench alias).

2. CLI flow

  1. Run the colorful wizard if you want a guided agent/task/seed selection before manual commands (optional, but it uses the same run pipeline you’ll exercise below):
    tracecore interactive
    # Try --dry-run to preview the command without executing
    # Try --save-session to persist your choices
    # Try typing partial names to filter agents/tasks
    # Press ? during prompts for inline help
    # If you have baseline data, you'll see suggested pairings
  2. Run a deterministic task twice to confirm persistence:
    tracecore run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
    tracecore run --agent agents/rate_limit_agent.py --task rate_limited_api@1 --seed 11
    tracecore run --agent agents/chain_agent.py --task rate_limited_chain@1 --seed 7
    tracecore run --agent agents/ops_triage_agent.py --task log_alert_triage@1 --seed 21
  3. List recent artifacts and confirm the run IDs you just produced appear at the top:
    tracecore runs list --limit 5
  4. Generate a baseline snapshot for each agent/task pair and sanity-check the metrics:
    tracecore baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1
    tracecore baseline --agent agents/rate_limit_agent.py --task rate_limited_api@1
  5. Compare two representative runs (mix and match run IDs or explicit artifact paths) and confirm the diff flags step-level deltas you expect:
    tracecore baseline --compare .agent_bench/runs/<run_a>.json .agent_bench/runs/<run_b>.json
  6. Export a frozen baseline for the UI (create .agent_bench/baselines/baseline-<ts>.json):
    tracecore baseline --export latest
  7. Note the run_id values—you’ll load them in the UI next.
  8. Replay a prior run deterministically (overrides allowed) and confirm the output matches the original artifact:
    tracecore run --replay <run_id>
    # Optional overrides:
    tracecore run --replay <run_id> --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
  9. Validate bundled task manifests and registry entries:
    tracecore tasks validate --registry

3. Web UI flow

  1. Start the server (module form avoids PATH issues):
    python -m uvicorn agent_bench.webui.app:app --reload
  2. Visit http://localhost:8000 in a fresh browser tab.
  3. Run the same agent/task combinations from the form and verify the result JSON (including seed) matches the CLI output.
  4. Click any “trace” link (or navigate to /?trace_id=<run_id>#trace-viewer) and confirm:
    • The Trace Viewer section auto-scrolls into view.
    • Step entries include observation, action, and result payloads.
    • The “Download JSON” link serves the /api/traces/<run_id> response.
  5. Scroll to the Baselines panel and confirm it reflects the same success rate / averages seen in the CLI baseline output and shows the "Latest published" card referencing your export.
  6. Open the Guide page (/guide) and confirm the agent expectations table loads.
  7. Run the rate_limited_chain@1 task from the UI (or CLI) to verify the pain task renders traces correctly—even if your reference agent fails, the trace + error should appear in the Trace tab.

4. Cleanup & determinism check

  1. If you need a clean slate, delete artifacts after capturing them elsewhere:
    Remove-Item -Recurse -Force .agent_bench\runs\*
  2. Re-run the determinism suite to ensure no drift:
    python -m pytest tests/test_determinism.py

5. Release gating

Before tagging or sharing results:

  • Ensure this checklist has been completed in the current commit.
  • Archive the run_id values referenced in reports so they remain a reproducible proof of behavior.
  • Run the full test suite (python -m pytest).
  • Note the harness version reported in run metadata; it should match the release tag.

Recorded run IDs (v0.9.0)

  • agents/toy_agent.py + filesystem_hidden_config@1run_id e8d59eb459774d59aeddc30c59b3509d
  • agents/rate_limit_agent.py + rate_limited_api@1run_id 5f1d056ced944eeb8a3ae1b98d26a159
  • agents/chain_agent.py + rate_limited_chain@1run_id e0cdfa6774604edcbab0b96238206f67
  • agents/ops_triage_agent.py + log_alert_triage@1run_id e13c79dfdb244385af677d049bb9103b