feat(e2e): agentic verification loop with MCP Playwright browser layer#128
Draft
informatico-madrid wants to merge 394 commits intotzachbon:mainfrom
Draft
feat(e2e): agentic verification loop with MCP Playwright browser layer#128informatico-madrid wants to merge 394 commits intotzachbon:mainfrom
informatico-madrid wants to merge 394 commits intotzachbon:mainfrom
Conversation
- selector-map.skill.md: stable selector strategy for Playwright tests - e2e-verify-integration.skill.md: corrected integration with ralph-specum signals (VERIFICATION_PASS/FAIL via qa-engineer, ALL_TASKS_COMPLETE via stop-watcher transcript detection). Removes incorrect legacy signals from ralph-loop.sh. Aligns task format with tasks.md template: one checkbox per task, [VERIFY] tag, user stories in requirements.md.
Documents the two skills added to skills/e2e/: - selector-map.skill.md: stable Playwright selector strategy - e2e-verify-integration.skill.md: correct signal contract for ralph-specum Marks Phase 0 fully complete and updates Status.
…ore and FORK_GOALS - Rename selector-map.skill.md → homeassistant-selector-map.skill.md (HA-specific examples, reusable as reference for other projects) - Add ui-map-init.skill.md: agnostic protocol for generating ui-map.local.md in any project (runs once per project, output is gitignored) - Add **/ui-map.local.md to .gitignore - Update FORK_GOALS.md Phase 0.1: remove coupled selector hierarchy from description, reference new filenames, stay generic
…meassistant-selector-map.skill.md)
- .gitignore: add **/ui-map.local.md (project-specific selector map, never committed) - FORK_GOALS.md Phase 0.1: update filenames, remove coupled selector hierarchy from description, stay domain-agnostic
…uct-manager + exploratory qa-engineer
…-playwright skills - Add playwright-env.skill.md: resolves appUrl, authMode, credentials refs, seed data, browser config and safety limits before any browser tool call. Emits ESCALATE if critical context is missing instead of guessing. - Add playwright-env.local.md.example: full template covering all auth modes (none, form, token, cookie, basic, oauth, storage-state). Gitignored by design. - Update mcp-playwright.skill.md (v1→v2): add Step -1 that loads playwright-env before the dependency check. Add anti-pattern entry for skipping Step -1. - Update playwright-session.skill.md (v1→v2): reads appUrl/authMode/browser config from .ralph-state.json→playwrightEnv instead of hardcoded values. Add explicit auth flow per mode (form, token, cookie, basic, storage-state, oauth/sso). Add ESCALATE rule for oauth/sso without pre-auth session. README.fork.md already references playwright-env.skill.md and playwright-env.local.md.example — both now exist.
- .gitignore: add playwright-env.local.md and .ralph-auth-state.json to prevent accidental commit of local auth config and storage-state files. - playwright-env.skill.md (v1→v2): add explicit connectivity check step after appUrl is resolved. Uses curl with 5s timeout before writing playwrightEnv to state. Emits ESCALATE(app-not-reachable) if check fails. Checklist item updated from vague 'reachable' note to concrete step ref.
…w-5 seed order, yellow-6 stable-state timeout, white-7 jq race note
- playwright-session.skill.md (v2→v3):
- orange-3: expand token auth section with 3 concrete injection patterns
(localStorage, Authorization header, cookie fallback) so agent never
improvises. Add tokenBootstrapRule field to playwright-env.local.md format.
- yellow-5: clarify seedCommand execution order — runs in playwright-env
AFTER connectivity check, BEFORE writing playwrightEnv to state.
- yellow-6: replace vague 'stable state' with concrete criterion: call
browser_snapshot and check no [aria-busy] or loading indicators in tree;
retry once after 1000ms if found. Document the tool to use.
- ui-map-init.skill.md (v1→v2):
- orange-4: add Step -1 (load playwright-env) before Step 0, matching
the pattern in mcp-playwright.skill.md. Stops if ESCALATE received.
- playwright-env.skill.md (v2→v3):
- yellow-5: add seedCommand execution step between connectivity check
and writing playwrightEnv to state. Only runs on local/staging.
- mcp-playwright.skill.md (v2→v3):
- white-7: add note in Step 0 about jq write race condition in parallel
execution scenarios. Recommend basePath-scoped lock or sequential VE tasks.
Critical fixes: - playwright-session: fix token Pattern A (localStorage inject before goto fails on blank origin) - playwright-env: wrap RALPH_SEED_COMMAND in eval + quotes to handle args/spaces - mcp-playwright: replace npx @latest (downloads) with --no-install check + lock recovery Cache & isolation: - mcp-playwright: add RALPH_PLAYWRIGHT_ISOLATED setting, --isolated flag, lock recovery protocol - playwright-session: add Session End lock-file cleanup on close failure - playwright-env: add RALPH_PLAYWRIGHT_ISOLATED to Settings Reference + tokenLocalStorageKey Minor: - mcp-playwright: clarify snapshot-only vs screenshot-always contradiction - playwright-session: add CAPTCHA/2FA ESCALATE detection in form auth - playwright-env: add TTL warning for state fallback from previous session"
…on, mcp context reset, auth table header - ui-map-init: fix date -r (Linux only) → portable stat fallback for macOS - mcp-playwright: Step 0b — run always when isolated=false, not conditionally - playwright-session: clarify MCP server restart required between sessions - playwright-env: fix broken markdown table header in Authentication section
…sk numbering contract, spec-workflow e2e bridge - reality-verification: add project-type detection (api-only/cli/library skip Playwright entirely) - reality-verification: document VE task numbering contract (VE0 = ui-map-init, 4.3 = verify fix) - spec-workflow: reference e2e skills in implement phase, add project-type to requirements phase activities
… without issue ref v0.4.0
mcp-playwright MUST load before playwright-session — session start reads .ralph-state.json → mcpPlaywright which is only written by mcp-playwright Step 0. Loading playwright-session first causes it to find the key absent and fall into degraded mode incorrectly. Correct order (matches spec-workflow/SKILL.md and phase-transitions.md): 1. playwright-env 2. mcp-playwright 3. playwright-session 4. ui-map-init (VE0 only)
Mark task 5.7 complete — Component: Chat Channel section already uses .ralph-state.json and lastReadLine correctly. Verification passed: grep "chat-state" and grep "lastReadIndex" both return no matches. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Rationale: lastReadLine is a line cursor, not a message index — messages in chat.md are multi-line (header + blank line + body), so a line cursor accurately tracks position. Updated: requirements.md (FR-14), spec-executor.md (JSON field + jq pattern), external-reviewer.md (JSON field + jq pattern + review cycle steps) Co-Authored-By: Claude Opus 4.6 <[email protected]>
…n .ralph-state.json Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Replaced vitest (Node.js) runner with bats (Bash test framework) - Updated Test Discovery section with bats commands - Updated Mock Boundary table with bash-appropriate mocks - Updated Concurrent writes row with background subshells pattern - Verified: grep vitest returns CLEAN Co-Authored-By: Claude Opus 4.6 <[email protected]>
- design.md: added bash to ACK message block, text to CLOSE Thread Example - requirements.md: added text to FR-2 message format example - Note: line 289 and 306 had no bare fences in current file state Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ence detection, human as participant Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ents Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e bug, inconsistencies, reviewer improvements Co-Authored-By: Claude Opus 4.6 <[email protected]>
…-executor, format fixes - C1: spec-executor.md — add flock-based atomic write (was bare cat >>) - C2: requirements.md — fix NFR-1 vs FR-13 contradiction (clarify flock for chat.md, temp+rename for state) - M3/M4: BLOCK→HOLD in spec-executor.md and external-reviewer.md (5+ references) - M5: chat.md — real entries now follow canonical ### [writer → addressee] format - M8: design.md — remove language tag from closing backticks - N1: templates/chat.md — add text language to fenced block - N5: chat.md (spec instance) — add text language to fenced block - N8: .progress.md — mark temp+rename note as superseded - N11: tasks.md — rename lastReadIndex→lastReadLine globally - N13: task_review.md — remove duplicate timestamp in resolved_at Co-authored-by: Qwen-Coder <[email protected]>
… progress.md typo - N2: index-state.json — 'complete' → 'completed' for ralph-quality-improvements - N3: index.md — empty status → 'done' for ralph-quality-improvements - M6: research.md — clarify two activation thresholds (existence vs existence + 1 message) - N7: .progress.md — 'docs/agen-chat/' → 'docs/agent-chat/' Co-authored-by: Qwen-Coder <[email protected]>
Feat/agent chat protocol
Add mandatory chat.md check step between task parsing and delegation. The coordinator now reads new messages from chat.md (after lastReadLine), blocks on HOLD/PENDING, responds to OVER, and announces each task before delegating — activating the bidirectional protocol already defined in spec-executor.md and external-reviewer.md.
The Chat Protocol section already defined the correct bidirectional behavior but was disconnected from the numbered Task Loop. Add step 2a so the executor reads chat.md on every iteration BEFORE checking task_review.md (step 2b), mirroring the pilot callout pattern now enforced in coordinator-pattern.md.
Expand Step 4 to cover all signals defined in external-reviewer.md: - INTENT-FAIL: log + wait 1 cycle before delegating - DEADLOCK: hard stop, surface to human - URGENT: treat as HOLD - CONTINUE: no-op, proceed - ALIVE / STILL: heartbeats, ignore Previous commit only handled HOLD, PENDING and OVER.
… layer, EXECUTOR_START as Layer 0 Changes from V2 (agent self-fix): - Role Definition: add 3 lines making chat.md usage explicit in Integrity Rules - Chat Protocol: add Step 2b — read task_review.md before delegating (defense-in-depth) - Verification: add Layer 2b anti-fabrication (independent verify command execution) - Update all 'all 3 layers' references to 'all 4 layers' Changes from V3 (external analysis): - Reposition Step 2b as standalone section BEFORE the Chat Protocol <mandatory> block - Add defense-in-depth comment explaining intentional duplication vs spec-executor - Integrate EXECUTOR_START verification as Layer 0 (was floating section, now numbered blocker) - Update Verification Summary to list all 5 layers (0+4)"
…s.md unmark G4 — Section 0 Bootstrap: add step to read chat.md and check for active HOLD/PENDING/DEADLOCK signals before starting the Review Cycle. Prevents reviewer from starting blind when a conversation is already in progress. G2 — Section 6b unmark: wrap tasks.md demark write in flock exclusive lock (same pattern as chat.md atomic append) to prevent race condition with coordinator reading tasks.md to advance taskIndex concurrently."
Add TASK_AMBIGUOUS pre-execution signal (inspired by ChatDev dehallucination communicative pattern) to break the retry loop caused by ambiguous task blocks: spec-executor.md: - New section 'Ambiguity Detection (Pre-Execution)' before Task Loop - Criteria for emitting TASK_AMBIGUOUS: contradictory instructions, missing required context, impossible constraints, undefined references - Output format mirrors TASK_MODIFICATION_REQUEST (structured, parseable) - Guard: max 1 TASK_AMBIGUOUS per task to prevent clarification loops coordinator-pattern.md: - New handler in 'After Delegation' for TASK_AMBIGUOUS output - Coordinator enriches task block with clarification and re-delegates - Does NOT increment taskIteration (ambiguity is spec error, not execution error) - Logs clarification applied to .progress.md for auditability - Max 2 clarification rounds per task before escalating to human channel-map.md (new): - Documents all shared filesystem channels, writers, readers, timing - Explicitly marks channels with multiple writers (race condition risk) - Explains locking strategy per channel - Reference document for future protocol decisions and new agent onboarding
…lls for VE tasks - Added detailed E2E / VE task review section in external-reviewer.md to enforce proper review protocols. - Updated spec-reviewer.md to include a new E2E review rubric for better evaluation of VE tasks. - Mandated inclusion of `Skills:` metadata in VE tasks within task-planner.md to ensure necessary skills are loaded. - Revised coordinator-pattern.md to enforce strict anti-patterns for VE tasks and required skills. - Enhanced phase-rules.md with specific requirements for VE2 tasks to ensure comprehensive user flow verification. - Updated verification-layers.md to differentiate artifact type instructions for VE/E2E tasks. - Introduced a new SKILL.md for E2E to outline the skill suite for end-to-end testing. - Expanded homeassistant-selector-map.skill.md with native navigation documentation for Home Assistant. - Enhanced playwright-session.skill.md with a mandatory unexpected page recovery protocol. - Updated tasks.md template to require skills for all VE tasks and clarified task structures.
…s and improve clarity in task templates
Fixes applied (7 real issues from 18 comments): - tasks.md: add playwright-session to VE0 Skills (both POC/TDD blocks) - tasks.md: add selector-map to VE2 Skills (both POC/TDD blocks) - task-planner.md: align Skills example to **Skills**: bold format + fix dup </mandatory> - external-reviewer.md: resolve tool-permissions contradiction (allow post-task test exec) - homeassistant-selector-map.skill.md: fix false 'English across locales' claim, use data-panel-id for sidebar items instead of localized getByRole names - spec-reviewer.md: simplify 'No fixed waits' FAIL to match absolute PASS prohibition - coordinator-pattern.md: update 3 'Required Skills' refs to 'Skills:' metadata field False positives (not changed): - VE3 platform_skills: cleanup task doesn't need platform-specific skills - artifactType header in verification-layers: style preference, inline approach works - || table syntax in playwright-session: valid Markdown single-pipe tables - [VERIFY] in submode detection: layered design already handles this correctly - MD040 fence warnings: agent instruction files, not user-facing rendered docs
…-chat-coordinator feat(e2e): enhance verification processes and introduce mandatory ski…
Update fork with latest changes from tzachbon/smart-ralph
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an agentic verification loop to Smart Ralph: the agent reads a user story, reasons about what "working" looks like, explores the system creatively, and self-corrects when something breaks — without scripted steps or Gherkin.
All additions are purely additive and live inside
plugins/ralph-specum/. Zero changes to core flow, existing commands, or stop-watcher behavior.Problem
Classical test coverage strategies were designed for humans writing tests for humans to read. When an agent writes and executes those same tests:
The root issue: classical testing is a human artifact, not a verification mechanism designed for agentic loops.
The Idea
We call this the Verification Contract: a lightweight section in each
requirements.mdthat tells the agent what to observe, not how to test.What's Added
1. Verification Contract in specs (
templates/requirements.md,agents/product-manager.md)Each spec's
requirements.mdgets a## Verification Contractsection:product-manageris updated with guidelines to populate this from user stories.2. Exploratory
qa-engineer(agents/qa-engineer.md)New
[STORY-VERIFY]mode: given a user story + Verification Contract, the agent derives and executes checks autonomously. No Gherkin. No scripted steps. Emits structured signals:VERIFICATION_PASS,VERIFICATION_FAIL,FINDING.Example checks derived from a single story ("filter invoices by date"):
None of these come from a script. They come from reasoning about intent.
3. Repair loop (
hooks/scripts/stop-watcher.sh)When
VERIFICATION_FAILis detected:New
.ralph-state.jsonfields:repairIteration,failedStory,originTaskIndex.4. Regression sweep (
hooks/scripts/stop-watcher.sh)After
ALL_TASKS_COMPLETE, reads**Dependency map**from the completed spec and runs targeted[STORY-VERIFY]sweeps on dependent specs only. Three tiers: Local → Invariants → Full (nightly).5. MCP Playwright browser layer (
plugins/ralph-specum/skills/e2e/)Four skill files covering the full browser verification protocol:
playwright-env.skill.mdmcp-playwright.skill.md@playwright/mcpprotocol: tool selection, verification sequence, devtools tracing, PASS/FAIL/DEGRADED/ESCALATE signalsplaywright-session.skill.mdui-map-init.skill.mdtask-plannerauto-injects aVE0task (ui-map-init prerequisite) before any spec that uses Playwright. Zero human memory required.Key design decisions:
--isolated --caps=testing).@playwright/mcpis missing →VERIFICATION_DEGRADED+ escalate to human.ui-map.local.mdis gitignored. Local to each developer, grown incrementally by the agents that use it.Files Changed
New files:
plugins/ralph-specum/skills/e2e/playwright-env.skill.mdplugins/ralph-specum/skills/e2e/mcp-playwright.skill.mdplugins/ralph-specum/skills/e2e/playwright-session.skill.mdplugins/ralph-specum/skills/e2e/ui-map-init.skill.mdplugins/ralph-specum/skills/e2e/e2e-verify-integration.skill.mdplugins/ralph-specum/skills/e2e/homeassistant-selector-map.skill.mdFORK_GOALS.mdModified files (additive only):
templates/requirements.md— added## Verification Contractsectionagents/product-manager.md— guidelines to populate Verification Contractagents/qa-engineer.md—[STORY-VERIFY]modeagents/task-planner.md— VE task auto-injection, VE0 prerequisiteagents/spec-executor.md— skill load order for VE tasks, session end reminders, ui-map patch after data-testid changesagents/architect-reviewer.md— mandatory test strategy with mock rulesplugins/ralph-specum/hooks/scripts/stop-watcher.sh— repair loop + regression sweep.gitignore—**/ui-map.local.mdWhat This Is NOT
qa-engineer(extends it)Summary by CodeRabbit
New Features
Documentation