Skip to content

feat(testing): add baseline comparison for DSL run artifacts#1558

Open
AdityaShome wants to merge 14 commits intomofa-org:mainfrom
AdityaShome:testing-baseline-comparison
Open

feat(testing): add baseline comparison for DSL run artifacts#1558
AdityaShome wants to merge 14 commits intomofa-org:mainfrom
AdityaShome:testing-baseline-comparison

Conversation

@AdityaShome
Copy link
Copy Markdown
Contributor

Summary

This PRis rebased on #1556, #1555 and #1447, adds baseline comparison for DSL run artifacts on top of the canonical artifact path.

mofa test-dsl can now save the current run artifact as a baseline and compare future runs against a saved baseline artifact. This adds the first regression-oriented workflow for the DSL path without expanding into replay or CI gating yet.

Context

The previous step introduced a canonical AgentRunArtifact and optional artifact output via --artifact-out.

That established a stable execution artifact, but there was still no way to compare a current run against a saved baseline. This PR adds that comparison layer in a narrow MVP form.

This PR closes that gap by:

  • adding a baseline diff model for DSL run artifacts
  • comparing the current artifact against a saved baseline artifact
  • exposing baseline input/output through mofa test-dsl
  • surfacing compact match/mismatch output in the CLI

What Changed

  • Added AgentRunArtifactDiff and ArtifactDifference
  • Added AgentRunArtifact::compare_to(&baseline)
  • Added --baseline-in <file> to compare against a saved baseline artifact
  • Added --baseline-out <file> to write the current artifact as a baseline
  • Added compact CLI output for baseline matches and mismatches
  • Added unit and CLI integration coverage for baseline flows

Files Changed

Core Files

tests/src/artifact.rs

  • Added AgentRunArtifactDiff.
  • Added ArtifactDifference.
  • Added compare_to for MVP artifact comparison.

tests/src/lib.rs

  • Exported the baseline diff types from the root-level tests/ crate.

crates/mofa-cli/src/cli.rs

  • Added --baseline-in and --baseline-out for mofa test-dsl.
  • Added clap parse coverage for the new flags.

crates/mofa-cli/src/commands/test_dsl.rs

  • Loads a saved baseline artifact when --baseline-in is provided.
  • Compares the current artifact to the baseline artifact.
  • Prints baseline: matched or baseline: mismatch.
  • Prints compact difference: <field> lines when mismatches are detected.
  • Writes the current artifact as a baseline when --baseline-out is provided.

crates/mofa-cli/src/main.rs

  • Wired baseline flags through CLI dispatch.

tests/tests/artifact_tests.rs

  • Added focused artifact comparison tests for exact-match and mismatch cases.

Supporting Files

crates/mofa-cli/tests/test_dsl_integration_tests.rs: Added CLI integration coverage for baseline write and mismatch reporting.

Baseline Comparison Scope

The initial comparison is intentionally narrow and compares:

  • status
  • output_text
  • assertion signatures (kind and passed)
  • ordered tool call names

This PR does not yet compare:

  • durations or timing thresholds
  • workspace snapshots
  • session snapshots
  • LLM request/response payload details

CLI Usage

Examples:

mofa test-dsl tests/examples/simple_agent.toml --baseline-out dsl-baseline.json
mofa test-dsl tests/examples/tool_agent.toml --baseline-in dsl-baseline.json

Execution Flow

  flowchart TD
      A[TOML DSL file] --> B[mofa test-dsl]
      B --> C[Execute case]
      C --> D[Build current AgentRunArtifact]

      E[Saved baseline artifact] --> F[Load baseline]
      F --> G[Compare artifacts]
      D --> G

      G --> H[baseline matched]
      G --> I[baseline mismatch]
      I --> J[Difference fields]

      D --> K[Write artifact or baseline file]
      G --> L[CLI summary output]
Loading

Example Output

mofa test-dsl tests/examples/tool_agent.toml --baseline-in dsl-baseline.json
case: tool_agent_run
status: passed
output: Tool execution complete
tool_calls: echo_tool
duration_ms: 0
baseline: mismatch
difference: output_text
difference: tool_calls

Tests

  • cargo test -p mofa-testing --test artifact_tests
  • cargo test -p mofa-cli --test test_dsl_integration_tests
  • cargo test -p mofa-cli test_test_dsl_

Notes

This PR keeps baseline comparison intentionally small and artifact backed. It adds the first regression comparison loop for DSL runs while leaving deeper diff semantics, replay, and CI gating to later steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant