Add local baseline comparison for CI agent workflow regressions

## Background

Issue #198 defined a first-version product contract for comparing scripted agent workflow runs in CI without requiring hosted state.

## Evidence

- Discussion #2 feedback says the strongest CI gate use case is regression testing agent workflows: detecting whether a scripted task became slower, more expensive, more failure-prone, or touched a broader file/tool surface than a previous run.
- #198 proposes baseline-report comparison fields such as elapsed time, token/cost deltas, new failure families, broader tool/file surface, and new high-authority tool use.
- Duplicate search on 2026-05-17 found no separate open task for baseline report comparison.

## User value

Users running scripted agent tasks in CI should be able to detect regressions across runs, not only evaluate one current report in isolation.

## Adoption rationale

Repeatable CI evidence makes agenttrace more useful for teams that need to review agent workflow drift over time while keeping all artifacts local or inside their own CI system.

## Suggested scope

- Define and implement a local baseline comparison mode for agenttrace report artifacts.
- Support comparing two local JSON reports, or a current report plus a baseline file supplied by CI.
- Emit deterministic JSON fields suitable for thresholds: slower_than_baseline, cost_delta_pct, token_delta_pct, new_failure_families, broader_tool_surface, broader_file_surface, and new_high_authority_tool_use.
- Keep thresholds explicit in CLI flags or config.

## Non-goals

- Do not add hosted storage or remote state.
- Do not invent hidden policy thresholds.
- Do not require billing-grade cost reconciliation.
- Do not change package/release state.

## Acceptance criteria

- A CI workflow can compare a current report against a local baseline artifact.
- JSON output exposes deterministic regression fields for time, cost/tokens, failures, and touched surface.
- Missing baseline or incompatible report versions produce clear diagnostics.
- Existing output-contract, deterministic-output, and report-semantics checks are extended or preserved.

## Suggested lane

lane/quality

## Risk

Medium. Baseline comparison can become noisy if thresholds are implicit or unstable; keep comparison fields explicit and deterministic.

## Source

Follow-up split from #198 and Discussion #2 feedback: https://github.com/luoyuctl/agenttrace/discussions/2#discussioncomment-16895719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local baseline comparison for CI agent workflow regressions #200

Background

Evidence

User value

Adoption rationale

Suggested scope

Non-goals

Acceptance criteria

Suggested lane

Risk

Source

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add local baseline comparison for CI agent workflow regressions #200

Description

Background

Evidence

User value

Adoption rationale

Suggested scope

Non-goals

Acceptance criteria

Suggested lane

Risk

Source

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions