Skip to content

Add local baseline comparison for CI agent workflow regressions #200

@luoyuctl

Description

@luoyuctl

Background

Issue #198 defined a first-version product contract for comparing scripted agent workflow runs in CI without requiring hosted state.

Evidence

User value

Users running scripted agent tasks in CI should be able to detect regressions across runs, not only evaluate one current report in isolation.

Adoption rationale

Repeatable CI evidence makes agenttrace more useful for teams that need to review agent workflow drift over time while keeping all artifacts local or inside their own CI system.

Suggested scope

  • Define and implement a local baseline comparison mode for agenttrace report artifacts.
  • Support comparing two local JSON reports, or a current report plus a baseline file supplied by CI.
  • Emit deterministic JSON fields suitable for thresholds: slower_than_baseline, cost_delta_pct, token_delta_pct, new_failure_families, broader_tool_surface, broader_file_surface, and new_high_authority_tool_use.
  • Keep thresholds explicit in CLI flags or config.

Non-goals

  • Do not add hosted storage or remote state.
  • Do not invent hidden policy thresholds.
  • Do not require billing-grade cost reconciliation.
  • Do not change package/release state.

Acceptance criteria

  • A CI workflow can compare a current report against a local baseline artifact.
  • JSON output exposes deterministic regression fields for time, cost/tokens, failures, and touched surface.
  • Missing baseline or incompatible report versions produce clear diagnostics.
  • Existing output-contract, deterministic-output, and report-semantics checks are extended or preserved.

Suggested lane

lane/quality

Risk

Medium. Baseline comparison can become noisy if thresholds are implicit or unstable; keep comparison fields explicit and deterministic.

Source

Follow-up split from #198 and Discussion #2 feedback: #2 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions