Background
Issue #198 defined a first-version product contract for comparing scripted agent workflow runs in CI without requiring hosted state.
Evidence
User value
Users running scripted agent tasks in CI should be able to detect regressions across runs, not only evaluate one current report in isolation.
Adoption rationale
Repeatable CI evidence makes agenttrace more useful for teams that need to review agent workflow drift over time while keeping all artifacts local or inside their own CI system.
Suggested scope
- Define and implement a local baseline comparison mode for agenttrace report artifacts.
- Support comparing two local JSON reports, or a current report plus a baseline file supplied by CI.
- Emit deterministic JSON fields suitable for thresholds: slower_than_baseline, cost_delta_pct, token_delta_pct, new_failure_families, broader_tool_surface, broader_file_surface, and new_high_authority_tool_use.
- Keep thresholds explicit in CLI flags or config.
Non-goals
- Do not add hosted storage or remote state.
- Do not invent hidden policy thresholds.
- Do not require billing-grade cost reconciliation.
- Do not change package/release state.
Acceptance criteria
- A CI workflow can compare a current report against a local baseline artifact.
- JSON output exposes deterministic regression fields for time, cost/tokens, failures, and touched surface.
- Missing baseline or incompatible report versions produce clear diagnostics.
- Existing output-contract, deterministic-output, and report-semantics checks are extended or preserved.
Suggested lane
lane/quality
Risk
Medium. Baseline comparison can become noisy if thresholds are implicit or unstable; keep comparison fields explicit and deterministic.
Source
Follow-up split from #198 and Discussion #2 feedback: #2 (comment)
Background
Issue #198 defined a first-version product contract for comparing scripted agent workflow runs in CI without requiring hosted state.
Evidence
User value
Users running scripted agent tasks in CI should be able to detect regressions across runs, not only evaluate one current report in isolation.
Adoption rationale
Repeatable CI evidence makes agenttrace more useful for teams that need to review agent workflow drift over time while keeping all artifacts local or inside their own CI system.
Suggested scope
Non-goals
Acceptance criteria
Suggested lane
lane/quality
Risk
Medium. Baseline comparison can become noisy if thresholds are implicit or unstable; keep comparison fields explicit and deterministic.
Source
Follow-up split from #198 and Discussion #2 feedback: #2 (comment)