Skip to content

feat(e2e): agentic verification loop with MCP Playwright browser layer#128

Draft
informatico-madrid wants to merge 394 commits intotzachbon:mainfrom
informatico-madrid:main
Draft

feat(e2e): agentic verification loop with MCP Playwright browser layer#128
informatico-madrid wants to merge 394 commits intotzachbon:mainfrom
informatico-madrid:main

Conversation

@informatico-madrid
Copy link
Copy Markdown

@informatico-madrid informatico-madrid commented Apr 1, 2026

Summary

This PR adds an agentic verification loop to Smart Ralph: the agent reads a user story, reasons about what "working" looks like, explores the system creatively, and self-corrects when something breaks — without scripted steps or Gherkin.

All additions are purely additive and live inside plugins/ralph-specum/. Zero changes to core flow, existing commands, or stop-watcher behavior.


Problem

Classical test coverage strategies were designed for humans writing tests for humans to read. When an agent writes and executes those same tests:

  • Tests pass but don't verify real behavior (mock-only anti-patterns)
  • High coverage numbers give false confidence
  • Nobody maintains the test suite because the agent keeps rewriting it
  • No self-correction when something breaks mid-spec

The root issue: classical testing is a human artifact, not a verification mechanism designed for agentic loops.


The Idea

Give the agent a contract of what to observe and a license to explore — not a script to follow.

We call this the Verification Contract: a lightweight section in each requirements.md that tells the agent what to observe, not how to test.


What's Added

1. Verification Contract in specs (templates/requirements.md, agents/product-manager.md)

Each spec's requirements.md gets a ## Verification Contract section:

## Verification Contract

**Entry points**: routes, endpoints, or UI surfaces this story touches
**Observable signals**: what PASS looks like (HTTP status, visible element, persisted data)
**Hard invariants**: what must NEVER break (auth, permissions, adjacent flows)
**Seed data**: minimum system state needed to verify
**Dependency map**: other specs/modules that share state with this one
**Escalate if**: conditions that require human judgment

product-manager is updated with guidelines to populate this from user stories.

2. Exploratory qa-engineer (agents/qa-engineer.md)

New [STORY-VERIFY] mode: given a user story + Verification Contract, the agent derives and executes checks autonomously. No Gherkin. No scripted steps. Emits structured signals: VERIFICATION_PASS, VERIFICATION_FAIL, FINDING.

Example checks derived from a single story ("filter invoices by date"):

  • Does the filter actually filter, or just visually sort?
  • Does the filter state persist across page reload?
  • Does it respect the user's timezone?
  • Can it be combined with other filters?

None of these come from a script. They come from reasoning about intent.

3. Repair loop (hooks/scripts/stop-watcher.sh)

When VERIFICATION_FAIL is detected:

→ classify failure (impl_bug / env_issue / spec_ambiguity / flaky)
→ impl_bug: backtrack to originating task, apply targeted fix
→ rerun verification for that story only
→ pass: continue to regression sweep
→ fail again after 2 iterations: escalate to human

New .ralph-state.json fields: repairIteration, failedStory, originTaskIndex.

4. Regression sweep (hooks/scripts/stop-watcher.sh)

After ALL_TASKS_COMPLETE, reads **Dependency map** from the completed spec and runs targeted [STORY-VERIFY] sweeps on dependent specs only. Three tiers: Local → Invariants → Full (nightly).

5. MCP Playwright browser layer (plugins/ralph-specum/skills/e2e/)

Four skill files covering the full browser verification protocol:

Skill Purpose
playwright-env.skill.md Project config: app URL, auth mode, token keys, seed command
mcp-playwright.skill.md Full @playwright/mcp protocol: tool selection, verification sequence, devtools tracing, PASS/FAIL/DEGRADED/ESCALATE signals
playwright-session.skill.md Session lifecycle: open → auth → verify → close. Auth branching by mode (cookie / storage-state / token / form). Lock recovery.
ui-map-init.skill.md Living selector registry. MCP-first exploration (Step 1A), static analysis fallback (Step 1B). Incremental updates by spec-executor and qa-engineer.

task-planner auto-injects a VE0 task (ui-map-init prerequisite) before any spec that uses Playwright. Zero human memory required.

Key design decisions:

  • Browser is one signal layer, not the foundation. CLI + HTTP + Browser + Logs.
  • Agent never launches or kills the MCP server. Human-configured in MCP client (--isolated --caps=testing).
  • No auto-install. If @playwright/mcp is missing → VERIFICATION_DEGRADED + escalate to human.
  • ui-map.local.md is gitignored. Local to each developer, grown incrementally by the agents that use it.

Files Changed

New files:

  • plugins/ralph-specum/skills/e2e/playwright-env.skill.md
  • plugins/ralph-specum/skills/e2e/mcp-playwright.skill.md
  • plugins/ralph-specum/skills/e2e/playwright-session.skill.md
  • plugins/ralph-specum/skills/e2e/ui-map-init.skill.md
  • plugins/ralph-specum/skills/e2e/e2e-verify-integration.skill.md
  • plugins/ralph-specum/skills/e2e/homeassistant-selector-map.skill.md
  • FORK_GOALS.md

Modified files (additive only):

  • templates/requirements.md — added ## Verification Contract section
  • agents/product-manager.md — guidelines to populate Verification Contract
  • agents/qa-engineer.md[STORY-VERIFY] mode
  • agents/task-planner.md — VE task auto-injection, VE0 prerequisite
  • agents/spec-executor.md — skill load order for VE tasks, session end reminders, ui-map patch after data-testid changes
  • agents/architect-reviewer.md — mandatory test strategy with mock rules
  • plugins/ralph-specum/hooks/scripts/stop-watcher.sh — repair loop + regression sweep
  • .gitignore**/ui-map.local.md

What This Is NOT

  • Not a Gherkin/BDD framework
  • Not a replacement for qa-engineer (extends it)
  • Not a browser-testing tool (browser is one signal layer)
  • Not a full rewrite (every change is small and additive)

Summary by CodeRabbit

  • New Features

    • Added E2E verification layer with Playwright browser testing integration
    • Introduced automated UI Map for selector discovery and management
    • Added Verification Contract section in requirements for defining test observables
    • Implemented repair loop for automated issue diagnosis and retries
    • Added regression sweep to verify dependent specs after changes
  • Documentation

    • Updated requirement and design templates with verification guidance
    • Added Playwright selector strategy and configuration examples
    • Expanded spec workflow to include project type classification

tzachbon and others added 30 commits March 11, 2026 21:35
- selector-map.skill.md: stable selector strategy for Playwright tests
- e2e-verify-integration.skill.md: corrected integration with ralph-specum
  signals (VERIFICATION_PASS/FAIL via qa-engineer, ALL_TASKS_COMPLETE via
  stop-watcher transcript detection). Removes incorrect legacy signals
  from ralph-loop.sh. Aligns task format with tasks.md template: one
  checkbox per task, [VERIFY] tag, user stories in requirements.md.
Documents the two skills added to skills/e2e/:
- selector-map.skill.md: stable Playwright selector strategy
- e2e-verify-integration.skill.md: correct signal contract for ralph-specum

Marks Phase 0 fully complete and updates Status.
…ore and FORK_GOALS

- Rename selector-map.skill.md → homeassistant-selector-map.skill.md
  (HA-specific examples, reusable as reference for other projects)
- Add ui-map-init.skill.md: agnostic protocol for generating ui-map.local.md
  in any project (runs once per project, output is gitignored)
- Add **/ui-map.local.md to .gitignore
- Update FORK_GOALS.md Phase 0.1: remove coupled selector hierarchy from
  description, reference new filenames, stay generic
- .gitignore: add **/ui-map.local.md (project-specific selector map, never committed)
- FORK_GOALS.md Phase 0.1: update filenames, remove coupled selector
  hierarchy from description, stay domain-agnostic
…-playwright skills

- Add playwright-env.skill.md: resolves appUrl, authMode, credentials refs,
  seed data, browser config and safety limits before any browser tool call.
  Emits ESCALATE if critical context is missing instead of guessing.

- Add playwright-env.local.md.example: full template covering all auth modes
  (none, form, token, cookie, basic, oauth, storage-state). Gitignored by design.

- Update mcp-playwright.skill.md (v1→v2): add Step -1 that loads playwright-env
  before the dependency check. Add anti-pattern entry for skipping Step -1.

- Update playwright-session.skill.md (v1→v2): reads appUrl/authMode/browser
  config from .ralph-state.json→playwrightEnv instead of hardcoded values.
  Add explicit auth flow per mode (form, token, cookie, basic, storage-state,
  oauth/sso). Add ESCALATE rule for oauth/sso without pre-auth session.

README.fork.md already references playwright-env.skill.md and
playwright-env.local.md.example — both now exist.
- .gitignore: add playwright-env.local.md and .ralph-auth-state.json
  to prevent accidental commit of local auth config and storage-state files.

- playwright-env.skill.md (v1→v2): add explicit connectivity check step
  after appUrl is resolved. Uses curl with 5s timeout before writing
  playwrightEnv to state. Emits ESCALATE(app-not-reachable) if check fails.
  Checklist item updated from vague 'reachable' note to concrete step ref.
…w-5 seed order, yellow-6 stable-state timeout, white-7 jq race note

- playwright-session.skill.md (v2→v3):
  - orange-3: expand token auth section with 3 concrete injection patterns
    (localStorage, Authorization header, cookie fallback) so agent never
    improvises. Add tokenBootstrapRule field to playwright-env.local.md format.
  - yellow-5: clarify seedCommand execution order — runs in playwright-env
    AFTER connectivity check, BEFORE writing playwrightEnv to state.
  - yellow-6: replace vague 'stable state' with concrete criterion: call
    browser_snapshot and check no [aria-busy] or loading indicators in tree;
    retry once after 1000ms if found. Document the tool to use.

- ui-map-init.skill.md (v1→v2):
  - orange-4: add Step -1 (load playwright-env) before Step 0, matching
    the pattern in mcp-playwright.skill.md. Stops if ESCALATE received.

- playwright-env.skill.md (v2→v3):
  - yellow-5: add seedCommand execution step between connectivity check
    and writing playwrightEnv to state. Only runs on local/staging.

- mcp-playwright.skill.md (v2→v3):
  - white-7: add note in Step 0 about jq write race condition in parallel
    execution scenarios. Recommend basePath-scoped lock or sequential VE tasks.
Critical fixes:
- playwright-session: fix token Pattern A (localStorage inject before goto fails on blank origin)
- playwright-env: wrap RALPH_SEED_COMMAND in eval + quotes to handle args/spaces
- mcp-playwright: replace npx @latest (downloads) with --no-install check + lock recovery

Cache & isolation:
- mcp-playwright: add RALPH_PLAYWRIGHT_ISOLATED setting, --isolated flag, lock recovery protocol
- playwright-session: add Session End lock-file cleanup on close failure
- playwright-env: add RALPH_PLAYWRIGHT_ISOLATED to Settings Reference + tokenLocalStorageKey

Minor:
- mcp-playwright: clarify snapshot-only vs screenshot-always contradiction
- playwright-session: add CAPTCHA/2FA ESCALATE detection in form auth
- playwright-env: add TTL warning for state fallback from previous session"
…on, mcp context reset, auth table header

- ui-map-init: fix date -r (Linux only) → portable stat fallback for macOS
- mcp-playwright: Step 0b — run always when isolated=false, not conditionally
- playwright-session: clarify MCP server restart required between sessions
- playwright-env: fix broken markdown table header in Authentication section
…sk numbering contract, spec-workflow e2e bridge

- reality-verification: add project-type detection (api-only/cli/library skip Playwright entirely)
- reality-verification: document VE task numbering contract (VE0 = ui-map-init, 4.3 = verify fix)
- spec-workflow: reference e2e skills in implement phase, add project-type to requirements phase activities
mcp-playwright MUST load before playwright-session — session start
reads .ralph-state.json → mcpPlaywright which is only written by
mcp-playwright Step 0. Loading playwright-session first causes it
to find the key absent and fall into degraded mode incorrectly.

Correct order (matches spec-workflow/SKILL.md and phase-transitions.md):
  1. playwright-env
  2. mcp-playwright
  3. playwright-session
  4. ui-map-init (VE0 only)
informatico-madrid and others added 30 commits April 8, 2026 16:33
Mark task 5.7 complete — Component: Chat Channel section already uses
.ralph-state.json and lastReadLine correctly. Verification passed:
grep "chat-state" and grep "lastReadIndex" both return no matches.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Rationale: lastReadLine is a line cursor, not a message index — messages
in chat.md are multi-line (header + blank line + body), so a line cursor
accurately tracks position.

Updated: requirements.md (FR-14), spec-executor.md (JSON field + jq pattern),
external-reviewer.md (JSON field + jq pattern + review cycle steps)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Replaced vitest (Node.js) runner with bats (Bash test framework)
- Updated Test Discovery section with bats commands
- Updated Mock Boundary table with bash-appropriate mocks
- Updated Concurrent writes row with background subshells pattern
- Verified: grep vitest returns CLEAN

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- design.md: added bash to ACK message block, text to CLOSE Thread Example
- requirements.md: added text to FR-2 message format example
- Note: line 289 and 306 had no bare fences in current file state

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ence detection, human as participant

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e bug, inconsistencies, reviewer improvements

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…-executor, format fixes

- C1: spec-executor.md — add flock-based atomic write (was bare cat >>)
- C2: requirements.md — fix NFR-1 vs FR-13 contradiction (clarify flock for chat.md, temp+rename for state)
- M3/M4: BLOCK→HOLD in spec-executor.md and external-reviewer.md (5+ references)
- M5: chat.md — real entries now follow canonical ### [writer → addressee] format
- M8: design.md — remove language tag from closing backticks
- N1: templates/chat.md — add text language to fenced block
- N5: chat.md (spec instance) — add text language to fenced block
- N8: .progress.md — mark temp+rename note as superseded
- N11: tasks.md — rename lastReadIndex→lastReadLine globally
- N13: task_review.md — remove duplicate timestamp in resolved_at

Co-authored-by: Qwen-Coder <[email protected]>
… progress.md typo

- N2: index-state.json — 'complete' → 'completed' for ralph-quality-improvements
- N3: index.md — empty status → 'done' for ralph-quality-improvements
- M6: research.md — clarify two activation thresholds (existence vs existence + 1 message)
- N7: .progress.md — 'docs/agen-chat/' → 'docs/agent-chat/'

Co-authored-by: Qwen-Coder <[email protected]>
Add mandatory chat.md check step between task parsing and delegation.
The coordinator now reads new messages from chat.md (after lastReadLine),
blocks on HOLD/PENDING, responds to OVER, and announces each task before
delegating — activating the bidirectional protocol already defined in
spec-executor.md and external-reviewer.md.
The Chat Protocol section already defined the correct bidirectional
behavior but was disconnected from the numbered Task Loop. Add step 2a
so the executor reads chat.md on every iteration BEFORE checking
task_review.md (step 2b), mirroring the pilot callout pattern now
enforced in coordinator-pattern.md.
Expand Step 4 to cover all signals defined in external-reviewer.md:
- INTENT-FAIL: log + wait 1 cycle before delegating
- DEADLOCK: hard stop, surface to human
- URGENT: treat as HOLD
- CONTINUE: no-op, proceed
- ALIVE / STILL: heartbeats, ignore
Previous commit only handled HOLD, PENDING and OVER.
… layer, EXECUTOR_START as Layer 0

Changes from V2 (agent self-fix):
- Role Definition: add 3 lines making chat.md usage explicit in Integrity Rules
- Chat Protocol: add Step 2b — read task_review.md before delegating (defense-in-depth)
- Verification: add Layer 2b anti-fabrication (independent verify command execution)
- Update all 'all 3 layers' references to 'all 4 layers'

Changes from V3 (external analysis):
- Reposition Step 2b as standalone section BEFORE the Chat Protocol <mandatory> block
- Add defense-in-depth comment explaining intentional duplication vs spec-executor
- Integrate EXECUTOR_START verification as Layer 0 (was floating section, now numbered blocker)
- Update Verification Summary to list all 5 layers (0+4)"
…s.md unmark

G4 — Section 0 Bootstrap: add step to read chat.md and check for active
HOLD/PENDING/DEADLOCK signals before starting the Review Cycle. Prevents
reviewer from starting blind when a conversation is already in progress.

G2 — Section 6b unmark: wrap tasks.md demark write in flock exclusive lock
(same pattern as chat.md atomic append) to prevent race condition with
coordinator reading tasks.md to advance taskIndex concurrently."
Add TASK_AMBIGUOUS pre-execution signal (inspired by ChatDev dehallucination
communicative pattern) to break the retry loop caused by ambiguous task blocks:

spec-executor.md:
- New section 'Ambiguity Detection (Pre-Execution)' before Task Loop
- Criteria for emitting TASK_AMBIGUOUS: contradictory instructions, missing
  required context, impossible constraints, undefined references
- Output format mirrors TASK_MODIFICATION_REQUEST (structured, parseable)
- Guard: max 1 TASK_AMBIGUOUS per task to prevent clarification loops

coordinator-pattern.md:
- New handler in 'After Delegation' for TASK_AMBIGUOUS output
- Coordinator enriches task block with clarification and re-delegates
- Does NOT increment taskIteration (ambiguity is spec error, not execution error)
- Logs clarification applied to .progress.md for auditability
- Max 2 clarification rounds per task before escalating to human

channel-map.md (new):
- Documents all shared filesystem channels, writers, readers, timing
- Explicitly marks channels with multiple writers (race condition risk)
- Explains locking strategy per channel
- Reference document for future protocol decisions and new agent onboarding
…lls for VE tasks

- Added detailed E2E / VE task review section in external-reviewer.md to enforce proper review protocols.
- Updated spec-reviewer.md to include a new E2E review rubric for better evaluation of VE tasks.
- Mandated inclusion of `Skills:` metadata in VE tasks within task-planner.md to ensure necessary skills are loaded.
- Revised coordinator-pattern.md to enforce strict anti-patterns for VE tasks and required skills.
- Enhanced phase-rules.md with specific requirements for VE2 tasks to ensure comprehensive user flow verification.
- Updated verification-layers.md to differentiate artifact type instructions for VE/E2E tasks.
- Introduced a new SKILL.md for E2E to outline the skill suite for end-to-end testing.
- Expanded homeassistant-selector-map.skill.md with native navigation documentation for Home Assistant.
- Enhanced playwright-session.skill.md with a mandatory unexpected page recovery protocol.
- Updated tasks.md template to require skills for all VE tasks and clarified task structures.
Fixes applied (7 real issues from 18 comments):
- tasks.md: add playwright-session to VE0 Skills (both POC/TDD blocks)
- tasks.md: add selector-map to VE2 Skills (both POC/TDD blocks)
- task-planner.md: align Skills example to **Skills**: bold format + fix dup </mandatory>
- external-reviewer.md: resolve tool-permissions contradiction (allow post-task test exec)
- homeassistant-selector-map.skill.md: fix false 'English across locales' claim,
  use data-panel-id for sidebar items instead of localized getByRole names
- spec-reviewer.md: simplify 'No fixed waits' FAIL to match absolute PASS prohibition
- coordinator-pattern.md: update 3 'Required Skills' refs to 'Skills:' metadata field

False positives (not changed):
- VE3 platform_skills: cleanup task doesn't need platform-specific skills
- artifactType header in verification-layers: style preference, inline approach works
- || table syntax in playwright-session: valid Markdown single-pipe tables
- [VERIFY] in submode detection: layered design already handles this correctly
- MD040 fence warnings: agent instruction files, not user-facing rendered docs
…-chat-coordinator

feat(e2e): enhance verification processes and introduce mandatory ski…
Update fork with latest changes from tzachbon/smart-ralph
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants