F1.5: live-contract test tier — verified beliefs about real tools (#36) by hartsock · Pull Request #37 · hartsock/modulex-mcp

hartsock · 2026-06-06T02:23:03Z

Summary

Stacked on #35 — retarget to main after it merges (and #34 before it — the usual order).

Tier 3 of the testing strategy: MODULEX_LIVE_TESTS=1-gated tests that run REAL tools through the real TokioSpawner + leash and assert only the output shapes our mock fixtures encode:

Live test	Belief verified
`live_git_status_and_unpushed_states`	clean/dirty/no-upstream state mapping, via the real handlers against a test-created repo
`live_gh_pr_list_json_shape`	`gh --json` emits `number`/`title`/`author.login` (skips if absent/unauthed)
`live_glab_presence`	glab presence/version (auth-marker belief documented as host-only)
`live_harness_contract_end_to_end`	JSON-on-stdout harness contract against a real process
`live_plugin_protocol_with_real_python`	`modulex-plugin/1` stdin/stdout round-trip with real python3

Opt-in by construction: without the env var every test exits with a notice — just check and PR CI remain host-independent (the probe-seam lesson, kept).
FIXTURE-SYNC rule: mock fixtures mimicking real CLIs now carry comments citing their live test; CLAUDE.md codifies it.
just live-test recipe + .github/workflows/live-contract.yml (manual dispatch + weekly cron, GH_TOKEN-authed) so drift surfaces on a cadence.

Test plan

Skip path: default run → 5 passed (all skip-notices, zero tool contact)
Live on a real host: 5/5 green — gh shape verified over live PRs from a public repo, real git states match our enums exactly, real sh/python3 round-trips
just check green: 126 tests total, clippy -D warnings all-features, fmt clean

Fixes #36

WHAT: StepHandler grows description() and data_schema() (required — the compiler enforces the contract on every handler, including plugin seams). All 17 builtins now emit typed StepResult.data: git steps carry per-repo state enums (clean/dirty, all-pushed/unpushed/no-upstream, ok/diverged/fetch-failed/...), deadline/countdown are refactored to compute-then-render with numeric days/work-day payloads, github-pr-scan parses PRs into typed records, gitlab steps carry per-target state + raw passthrough, mr-sla/board/chores/reminders/url-watch all typed; script reports exit_code; harness/python/Python-registered handlers are documented passthroughs. StepRegistry::specs() + Engine::step_specs() expose (type, description, schema); MCP steps_list returns them; routine_run/step_run/report_get gain format="data" (Report::to_data_json — payloads only, no prose); CLI gains --data. Enforcement: tests/data_contract.rs pins every schema to a checked-in golden (tests/golden/<type>.json; UPDATE_GOLDEN_SCHEMAS=1 regenerates — a schema change is a visible reviewed diff and a breaking release) and drives 15 builtins through mock spawner + in-memory store, validating every executed step's data against its schema (jsonschema, dev-dep only — runtime never validates). WHY: FOUNDATION pillar A (#26): agents never parse prose. Verified live: `modulex run demo --data` returns pure typed payloads. Disclosure tier: step-data contract (no new tools; steps_list enriched in place). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WHAT: Spawner gains program_available() (default = the real PATH probe; TokioSpawner inherits it). ExecGate::program_available delegates to the spawner, and the engine's soft-skip probe now asks the gate instead of the host PATH. MockSpawner answers for its scripted world — everything available unless named via the new .missing([..]) builder — and the engine's missing-tool test uses an explicit missing set instead of a hopefully-absent binary name. WHY: CI failure on the data-contract test: the glab-backed steps soft- skipped on the runner (no glab installed) but ran on the dev box — the probe was an environment dependency outside the seam, violating the 'mocked engines are host-independent' rule. Regression coverage: the contract test itself (fails on any host missing a CLI under the old code) plus the reworked missing_tool_soft_skips_the_step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WHAT: tests/live_contract.rs — tier 3, opt-in via MODULEX_LIVE_TESTS=1 (every test exits with a notice without it; default check/CI stay host-independent). Five live verifications through the REAL TokioSpawner + leash: (1) git clean/dirty/no-upstream state mapping driven through the real GitStatus/GitUnpushed handlers against a test-created temp repo; (2) gh pr list --json field shape (number/title/author.login) against a stable public repo, skipping when gh is absent or unauthenticated; (3) glab presence/version probe; (4) the harness JSON-on-stdout contract end-to-end with a real shell script; (5) the modulex-plugin/1 protocol with a real python3 plugin (stdin request, stdout response, typed mapping). Absent tools skip with a visible notice — correct in this tier, whose job is host-dependence. FIXTURE-SYNC citations added at the mock-fixture sites (data_contract, github tests); CLAUDE.md testing rule extended; just live-test recipe; .github/workflows/live-contract.yml (workflow_dispatch + weekly cron, GH_TOKEN-authed) so fixture drift surfaces on a cadence. WHY: #36 — mocks encode beliefs about external tools; this tier makes them verified beliefs. Ran live on this host: 5/5 green (gh shape verified over real PRs). Disclosure tier: tests only — zero surface change. Fixes #36 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hartsock · 2026-06-06T10:01:41Z

Retargeted to main — this PR now carries BOTH F1 and F1.5.

#35 was merged into its stale base branch (integrate/stranded-merges) after that branch had already landed, so F1 never reached main — the third stacked-merge incident. This branch contains the full F1 + F1.5 history, so merging this single PR onto main lands the complete data contract + live-test tier.

Process change to stop this recurring: from now on every PR in this repo is based on main directly, even when work is logically stacked — a child PR's diff temporarily shows its parent's commits, but a UI merge then does the right thing in any order. No more retarget-before-merge choreography.

hartsock and others added 3 commits June 5, 2026 22:09

hartsock added the risk:low Scoped, tested, no CI/build changes label Jun 6, 2026

hartsock changed the base branch from f1/data-contract to main June 6, 2026 10:01

hartsock merged commit e6bac69 into main Jun 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F1.5: live-contract test tier — verified beliefs about real tools (#36)#37

F1.5: live-contract test tier — verified beliefs about real tools (#36)#37
hartsock merged 3 commits into
mainfrom
f1.5/live-contract-tests

hartsock commented Jun 6, 2026

Uh oh!

hartsock commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hartsock commented Jun 6, 2026

Summary

Test plan

Uh oh!

hartsock commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant