fix(have:review-cycle): operational discipline — Copilot non-optional, loop convergence, blocker docs#5
Conversation
09e9342 to
de7295d
Compare
|
Added third operational fix to the same PR (commit The rule: loop exits when no P0/P1/P2 findings remain — not when every reviewer returns zero findings. P3/nit findings (polish, narrow factual edges, cosmetic placement) get recorded as accepted non-blockers in the PR body, not looped on. Evidence: pr-review#3 took 7 rounds. Under the new rule it would have exited at round 4 (3 substantive fixes + 1 verify round), with these P3 findings recorded but not fixed in that PR:
Same ship outcome, ~half the wall-clock time. The 'medium' findings (transform-types misattributed to TSX/decorators, shebang in wrong section) would still have looped because those would actually mislead reviewers using the checklist. PR body now has explicit Final Report fields to make the discipline visible:
So this PR now closes three real operational gaps in one shot:
All three came from the same session running review-cycle on pr-review#3 and observing what actually went wrong. Real data, not speculation. |
There was a problem hiding this comment.
Pull request overview
This PR updates the /have:review-cycle command documentation to enforce stricter review-loop discipline (including a non-optional Copilot CLI run) and adds release automation that generates changesets from conventional commits and publishes directly on pushes to main.
Changes:
- Harden
/have:review-cycledocs: make Copilot CLI pre-push review non-optional, document real-world tool blockers/fallbacks, and add explicit loop convergence “hard rules”. - Add
scripts/auto-changeset.tsplustsxdependency and scripts to auto-generate a unified changeset across packages. - Replace the release workflow’s PR-or-publish flow with a direct publish-on-main-push flow that auto-changesets, versions, commits, publishes, tags, and pushes back to
main.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/auto-changeset.ts |
New CI-oriented script to generate a single changeset from conventional commits since the last release tag. |
package.json |
Adds tsx + new changeset scripts (changeset:auto, changeset:version, changeset:publish). |
pnpm-lock.yaml |
Locks new tsx dependency (and its transitive dependencies). |
codex/plugins/have/commands/review-cycle.md |
Makes Copilot CLI step non-optional; adds loop hard rules and blocker/fallback documentation. |
claude/have/commands/review-cycle.md |
Same review-cycle documentation changes as the codex command file. |
.github/workflows/release.yml |
Changes release strategy to direct publish on push to main, including auto-changeset + versioning + commit + publish + push/tags. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e176380 to
15f1846
Compare
Belated but the right move — ran pr-review (codex + copilot CLI ensemble) on PR #5 after the rebase. Two findings, both real: 1. [medium, both reviewers] The example `gh copilot` invocation used `--allow-all-tools`, but the surrounding guidance claimed the run stayed "read-only" because the prompt instructed not to modify files. That's a policy/enforcement mismatch — prompt instructions are advisory, tool permissions are the actual enforcement. With write/edit tools available, a "review" pass can mutate the working tree mid-round, breaking the same-commit guarantee the loop relies on. Fixed by switching the example to an explicit read-only tool set (`--available-tools shell,read`) with a note that the prompt instruction is defense-in-depth. Also flagged that the CLI is preview-stage and tool names shift, so consumers should verify against `gh copilot -- --help` for their version. 2. [low, copilot only] P2 policy was internally inconsistent: step 3 said "block by default; loop unless explicitly accepted with rationale in the PR body", but step 6 said "Address all valid P0/P1/P2 findings in priority order" (no acceptance path). Final Report template only had a slot for "Accepted non-blockers (P3/nit)", no slot for accepted P2. Fixed by aligning step 6 ("Address all P0/P1 mandatory; P2 mandatory unless explicitly accepted...") and adding "Accepted P2 (with rationale)" to the Final Report template. Worth noting the asymmetric blind-spot pattern showed up again: - Codex caught the substantive medium (read-only enforcement). - Copilot caught both the same medium AND a low that codex missed (P2 cross-section consistency). The placement/contradiction pattern keeps being copilot's unique signal. Mirror edits in both claude/ and codex/ command files.
Round 3 of pr-review on PR #5 found three real contract issues codex caught (copilot's output truncated mid-exploration this round): 1. [medium] Status `clean` definition omitted validation. So failed validation + reviewers clean → Status=clean → /ship gates on Status=clean and proceeds. Added validation to the contract: clean now requires "validation green"; failed validation maps to `blocked`. 2. [medium] "Record in PR body" / "open as draft PR" language assumed a PR already exists. But /review-cycle runs BEFORE PR creation when invoked from /ship — there's no PR body to write to. Reframed: /review-cycle records all decisions in its FINAL REPORT (canonical record); /ship is responsible for copying the report into the PR body and deciding the draft-PR fallback. When a PR already exists, the report still gets the canonical record; updating the PR body is then optional/derivative. 3. [low] Per-commit convergence rule said "clean on X ≠ clean on Y, re-run all reviewers" — but the P3 rule said "P3 cheap fix → no verify round needed." Contradiction: a P3-only commit would trigger the per-commit rule, forcing another full ensemble pass for a one-line wording tweak. Scoped per-commit rule to behaviour-changing commits (P0/P1/P2 fixes or non-fix changes); explicitly exempted P3-only commits. Mirror edits in both claude/ and codex/ command files. Running round 4 to verify convergence on the workflow contract. Worth noting the round counts: 3 substantive medium-fix rounds plus this commit. Each round caught real workflow-contract issues, not trivia — the loop is doing exactly what it's supposed to. The fact that we keep finding new contract bugs is itself evidence that the spec is dense enough to warrant this many rounds of refinement.
…andles partial Round 4 of pr-review on PR #5: codex caught two more medium contract drifts, both introduced by round-3's fixes. Copilot's output truncated mid-exploration again this round; treating as one substantive reviewer. 1. [medium] Stop condition contradicted P2 acceptance. Step 11 said "stop when no P0/P1/P2 findings remain" — but reviewers can keep surfacing an accepted P2 (they have no way to know it was accepted in the final report). So an accepted P2 would prevent reaching the documented clean state, the loop would hit the cap, and the run would look blocked despite the acceptance being on file. Changed stop condition to "no *unaccepted* P0/P1/P2 findings remain" with an explanatory sentence about why reviewers may keep surfacing accepted items. 2. [medium] Round-3 fix delegated the skipped-Copilot fallback to `/ship` ("downstream decides whether to open as draft") but didn't update `ship.md` to actually handle that case. `/ship`'s gate previously only branched on clean / partial-with-noise / blocked — a `partial` from a skipped required reviewer matched none of those, leaving the agent without an instruction. Expanded the gate to three explicit `partial` sub-cases: - partial because Copilot CLI was skipped → draft PR + bot review + rerun review-cycle + `gh pr ready` when clean - partial because another required reviewer (codex, claude-sub) was skipped → draft PR + flag for human acceptance - partial with only accepted P2/non-blockers → continue, copy acceptances from final report into PR body Same edit in both claude/have/commands/ship.md and codex/plugins/have/commands/ship.md. This expands the PR's scope to touch ship.md, but the changes are logically coupled to the review-cycle contract — the cycle correctly caught that I'd written a delegation without updating the delegate. Running round 5 to verify.
allowlist + ship propagation Two reviewers caught three real findings on the round-4 commit. The most embarrassing one was self-inflicted: I had two Copilot CLI session transcripts (.deny-test.jsonl, .revparse-test.jsonl) leak into the PR via `git add -A`. They were written by Copilot's own review probes (testing tool-permission flags) and got picked up along with my real changes. They leak local /Users/will/.agents/ paths, skill descriptions, session IDs. Copilot caught its OWN debug artifact in PR #5 and used `.deny-test .jsonl:30` (showing `Permission to run this tool was denied ... shell(git rev-parse)`) as concrete evidence for a *separate* finding — that the documented allowlist was missing `git rev-parse`. The meta-loop: artifact A introduced a privacy leak (caught), and also produced empirical evidence for content gap B (also caught). Three fixes: 1. Removed `.deny-test.jsonl` and `.revparse-test.jsonl` from the tree and the index. Added `*.jsonl` to .gitignore so future review-probe runs won't repeat the mistake. 2. Added `shell(git rev-parse)` to both review-cycle command file examples. Copilot's empirical evidence proved the omission causes real review failures. 3. /ship now always copies `Accepted P2`, `Accepted non-blockers`, and `Skipped reviewers` from the review-cycle report into the PR body, regardless of whether the gate returned clean or partial. Codex correctly pointed out that under the new status contract, accepted-P2 produces `clean` (all P2 fixed-or-accepted → clean), so gating the propagation on `partial` would silently drop the audit trail. Same edit in both ship.md files. The broader pattern from this session: I keep using `git add -A` without reviewing what's actually being staged. That's how PR #5 originally picked up PR #4's commits (bad rebase) and how this round's debug transcripts leaked. Worth adding to the pr-review checklist as another infra-hazard bullet: "Files in PR diff that aren't named in commit messages — sign of git add -A picking up debug artifacts." Running round 6 to verify convergence.
Round 6 found four real issues from two reviewers. The most important one is a security finding from copilot: the documented Copilot allowlist permits `shell(rg)`, `shell(cat)`, `shell(head)` without path scoping, leaving a prompt-injection exfiltration path even though writes are blocked. A crafted commit subject could instruct the reviewer to read $HOME secrets. Four fixes: 1. [medium, security] Added `--add-dir "$(git rev-parse --show-toplevel)"` to the Copilot CLI example. Confines file-access surface to the repo. Combined with the per-command allowlist, eliminates the exfiltration path even under prompt injection. Documented the reasoning explicitly so future maintainers don't remove --add-dir thinking the allowlist alone is sufficient. 2. [medium] /ship's Copilot-skipped fallback had a logic gap: rerun review-cycle after bot review still returns partial (CLI is still blocked), so it can't clear status. Reworked the clearance path: bot review + no unaddressed findings + human marks ready = substitute acceptance. The human's `gh pr ready` invocation IS the explicit signal. Document substitution in PR body. 3. [medium] Removed the "Partial with only accepted P2 / accepted non-blockers" sub-branch in /ship — unreachable under the current status contract (accepted P2 → clean per the validation rule). The propagation of those fields is now unconditional (added in round 5), so the unreachable branch is also redundant. 4. [medium] "Commit and PR" section was still gated by "When validation and /review-cycle are clean," contradicting the draft-on-partial logic above. Reworded to "when the Review Cycle Gate has been satisfied," with explicit guidance that draft vs ready follows the gate's branch. 5. [low] Narrowed gitignore from `*.jsonl` (overbroad — hides legitimate fixtures) to specific probe-filename patterns (`*-test.jsonl`, `.copilot-session-*.jsonl`, `copilot-session-*.jsonl`). Plus a comment pointing to the actual rule: review outputs go to /tmp, not the repo. Running round 7 to verify convergence. This is round 6 of looping on PR #5. 17 substantive findings caught across the 6 rounds, all valid. The cycle is doing what it should even if convergence is taking longer than I expected.
… gitignore Round 7 of pr-review on PR #5. Codex caught two more real issues (copilot's output truncated mid-verification this round, same operational pattern as rounds 3-6 — extensive empirical testing, no Findings section emitted). 1. [medium, security] The round-6 `--add-dir <repo>` "fix" was incomplete. Per Copilot CLI's permissions help, the default allowed-paths set includes the system temp dir. `--add-dir` adds dirs, it doesn't restrict them. So /tmp was still reachable — a prompt-injected diff could still ask Copilot to read /tmp/secrets and quote them into findings. Fixed with a three-part path lock: - `-C "$REPO_ROOT"` sets cwd explicitly (default cwd is whatever shell invoked, not necessarily the repo) - `--add-dir "$REPO_ROOT"` adds the repo - `--disallow-temp-dir` removes /tmp from the default set Documented why all three flags are needed so future maintainers don't remove one thinking the others are sufficient. 2. [low] `*-test.jsonl` pattern in .gitignore was still overbroad — would silently hide e.g. `fixtures/payment-test.jsonl` from `git status` / `git add`. Narrowed to just the `.copilot-session-*.jsonl` / `copilot-session-*.jsonl` prefixes. Comment now explicitly says the real rule is "review outputs go to /tmp" — gitignore is defense-in-depth, not the primary prevention. This is round 7 of the cycle. The security finding was iterative — round 6 added --add-dir thinking that was the fix; round 7 caught that --add-dir alone wasn't enough; the three-part lock should be complete. Running round 8 to verify. Cycle metadata so far: - 19 substantive findings caught across 7 rounds - All valid, all addressed - Copilot's output truncation in rounds 3+ suggests its empirical-verification approach exceeds output limits in longer reviews; codex remains reliable - The security-scoping concern took 2 iterations to fully nail (--add-dir then --disallow-temp-dir + -C) — exactly the "fix can introduce or expose the next finding" pattern the loop is designed to catch
…/policy blockers Real failure mode just hit: I ran review-cycle on pr-review#3 from inside a Claude Code session and silently skipped the Copilot step, rationalising "the Copilot bot already reviewed it." That defeats the whole pre-push purpose — the bot only fires AFTER the PR opens. Both reviewer subprocesses other than codex failed in the same run: - `claude -p` from a parent claude session: 401 (OAuth doesn't propagate to children). - `gh copilot`: "Access denied by policy settings" (org Copilot policy disabling CLI use). The command file already specified running gh copilot, but the language was passive enough that I rationalised skipping it. This edit: 1. Marks the Copilot CLI step as non-optional with a one-sentence explanation of why (bot is post-push, CLI is pre-push, mixing them up defeats the catch-before-push purpose). 2. Updates the example invocation to the current `copilot` CLI syntax (`gh copilot -- -p "..." --allow-all-tools --effort xhigh`), replacing the older `--allow-tool 'shell(git)'` form that apparently doesn't apply to the v1+ Copilot CLI. 3. Documents the two known blockers (Copilot CLI policy, claude -p 401) with concrete remediation steps — the policy URL to flip, the env var to set. 4. Defines fallback behaviour: if a reviewer is unavailable, record it in the final report explicitly. Do not silently drop. If Copilot CLI is unavailable, fall back to draft PR + Copilot bot review before marking ready. Mirror edit in both the claude/ and codex/ command files. The command surfaces don't differ enough on this section to justify divergent guidance.
Real failure: I ran review-cycle on pr-review#3 and "stopped" after each reviewer's first pass instead of actually looping. When I caught myself and ran it properly, the cycle took 7 rounds to converge — catching 9 progressively-narrower factual issues that would have shipped if I'd stopped early. The command file already said "Run up to `rounds` review rounds. Default: 3" and "stop the loop as clean" when no findings remain, but the wording was loose enough that I rationalized one-shot behaviour. This commit adds explicit Hard Rules that close that gap: 1. Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes — that lets findings cascade in misleading ways and obscures whether reviewers actually agree on the latest state). 2. A fix-round is NEVER the final round. Convergence requires at least one round where every reviewer returns 0 findings against the latest commit. Just pushed a fix? Run another round before declaring clean. 3. Convergence is per-commit, not per-finding. Reviewer A clean against commit X doesn't transfer to commit Y (the fix commit). Also updated: - Default cap guidance: 3 is right for code; 5-10 for documentation / reviewer-checklist content where each round catches narrower factual edges (the pr-review#3 cycle was 7 rounds). - Step 10 now explicit: "If a fix was pushed in this round, the next round MUST run." - Step 11 explicit: "Stop as clean only when a verify round (no edits) returns no actionable findings." - Cap-hit guidance distinguishes three cases: spec too detailed, diminishing returns acceptable, genuine gap. Mirror edit in both claude/ and codex/ command files. Evidence: see pr-review#3 (happyvertical/pr-review#3) for the 7-round convergence log with per-round commits, findings counts, and the asymmetric convergence pattern between codex (catches narrow factual edges via deep verification) and copilot CLI (catches structural/placement issues via live web fetch + cross-file grep).
Real signal from running review-cycle on pr-review#3: the loop went 7 rounds, but rounds 5 and arguably 1+3 had only "low" severity findings (narrow Node version-window edge cases). Looping on those was technical perfectionism — they didn't change what shipped, just burnt reviewer cycles. This commit adds an explicit exit-on-trivia rule, paired with the existing exit-on-substantive-clean rule: - P0/P1 (correctness, security, data loss, broken build): always block, always loop. - P2 (likely bug, missing test/docs): block by default, loop unless accepted in PR body with rationale. - **P3 / nit (polish, narrow factual edges, cosmetic): never block, never extend the loop.** Record as accepted non-blockers in the PR body or file as follow-up issues. So the loop exit condition becomes "no P0/P1/P2 findings remain" instead of "every reviewer returns zero findings." Convergence is about substantive risk, not perfect agreement. Re-running the pr-review#3 cycle under this rule: would have exited at round 4 (3 substantive fixes + 1 verify round) instead of round 7. The 3 low-severity findings (Node 25/26 ambiguity, Node 23.0-23.5 window, transform-types-omission narrow case) would be listed in the PR body as accepted non-blockers. Same ship outcome, ~half the wall-clock time. Updated the Final Report template to make the new fields visible: - "Remaining blockers (P0-P2)" — what would have re-triggered the loop - "Accepted non-blockers (P3/nit)" — recorded but not fixed in this PR - "Skipped reviewers" — to enforce the existing "never silently drop" discipline Mirror edit in both claude/ and codex/ command files.
Previous wording on the exit-on-trivia rule implied P3 findings
should be deferred ("record in PR body as accepted non-blockers, or
file as follow-up issues"). That's too restrictive — most P3 fixes
are one-line tweaks (rewording a doc bullet, fixing a comment
typo, etc.) and the right answer is to just fix them inline.
The rule's actual intent is: don't extend the LOOP on P3 — don't
run another full ensemble round just to verify a P3 fix. Whether
to address the P3 finding itself is a separate question with three
answers depending on cost:
1. Cheap to fix → fix inline, group with any other fixes
2. Worth tracking but not blocking → record in PR body
3. Bigger than this PR's scope → follow-up issue
The loop exit cares about "no P0/P1/P2 remaining," not about
whether P3 fixes happened.
Mirror edit in both claude/ and codex/ command files.
Belated but the right move — ran pr-review (codex + copilot CLI ensemble) on PR #5 after the rebase. Two findings, both real: 1. [medium, both reviewers] The example `gh copilot` invocation used `--allow-all-tools`, but the surrounding guidance claimed the run stayed "read-only" because the prompt instructed not to modify files. That's a policy/enforcement mismatch — prompt instructions are advisory, tool permissions are the actual enforcement. With write/edit tools available, a "review" pass can mutate the working tree mid-round, breaking the same-commit guarantee the loop relies on. Fixed by switching the example to an explicit read-only tool set (`--available-tools shell,read`) with a note that the prompt instruction is defense-in-depth. Also flagged that the CLI is preview-stage and tool names shift, so consumers should verify against `gh copilot -- --help` for their version. 2. [low, copilot only] P2 policy was internally inconsistent: step 3 said "block by default; loop unless explicitly accepted with rationale in the PR body", but step 6 said "Address all valid P0/P1/P2 findings in priority order" (no acceptance path). Final Report template only had a slot for "Accepted non-blockers (P3/nit)", no slot for accepted P2. Fixed by aligning step 6 ("Address all P0/P1 mandatory; P2 mandatory unless explicitly accepted...") and adding "Accepted P2 (with rationale)" to the Final Report template. Worth noting the asymmetric blind-spot pattern showed up again: - Codex caught the substantive medium (read-only enforcement). - Copilot caught both the same medium AND a low that codex missed (P2 cross-section consistency). The placement/contradiction pattern keeps being copilot's unique signal. Mirror edits in both claude/ and codex/ command files.
…atus contract Round 2 of pr-review on this PR found two more medium issues, both valid and both newly introduced by round 1's fixes. Exactly the "fix-round is never the final round" failure mode this PR codifies. Copilot literally EXECUTED CLI commands during its review to verify the first finding empirically — strongest convergence signal yet. 1. Round-1 fix replaced `--allow-all-tools` with `--available-tools shell,read`. That's wrong: `--available-tools` only filters which tools the model can SEE, not which it can run without approval. In non-interactive `-p` mode there's no UI to ask for permission, so tool calls get denied with "Permission denied and could not request permission from user." Review then runs with zero repository context. Copilot verified this by running `gh copilot -- -p "Run shell command: git diff --name-only --stat" --available-tools shell --effort low -s` and getting the permission-denied error. Codex independently caught it by reading the CLI's permissions docs. Correct shape: explicit per-command `--allow-tool 'shell(git diff)'`, `--allow-tool 'shell(git log)'`, etc. flags. Enforces read-only at the permission layer with per-command granularity. Added the example pattern + `gh copilot -- help permissions` reference for keeping current. 2. The "Skipped reviewers" rule said to record skips but didn't require Status to drop. So Copilot policy-blocked + codex+claude clean → final report "Status: clean, Skipped: Copilot" → /ship gates on Status==clean and proceeds. Recreates the soft-skip this PR exists to prevent. Fixed by making Status's contract explicit: clean REQUIRES all required reviewers ran. Any required reviewer skipped → Status is at minimum "partial." The Status enum description is now inlined in the final report template so the rule is visible wherever a consumer looks. Copilot CLI specifically gets the "open as draft for bot review" fallback explicit, not just "consider." Mirror edits in both claude/ and codex/ command files. Running round 3 to verify convergence.
Round 3 of pr-review on PR #5 found three real contract issues codex caught (copilot's output truncated mid-exploration this round): 1. [medium] Status `clean` definition omitted validation. So failed validation + reviewers clean → Status=clean → /ship gates on Status=clean and proceeds. Added validation to the contract: clean now requires "validation green"; failed validation maps to `blocked`. 2. [medium] "Record in PR body" / "open as draft PR" language assumed a PR already exists. But /review-cycle runs BEFORE PR creation when invoked from /ship — there's no PR body to write to. Reframed: /review-cycle records all decisions in its FINAL REPORT (canonical record); /ship is responsible for copying the report into the PR body and deciding the draft-PR fallback. When a PR already exists, the report still gets the canonical record; updating the PR body is then optional/derivative. 3. [low] Per-commit convergence rule said "clean on X ≠ clean on Y, re-run all reviewers" — but the P3 rule said "P3 cheap fix → no verify round needed." Contradiction: a P3-only commit would trigger the per-commit rule, forcing another full ensemble pass for a one-line wording tweak. Scoped per-commit rule to behaviour-changing commits (P0/P1/P2 fixes or non-fix changes); explicitly exempted P3-only commits. Mirror edits in both claude/ and codex/ command files. Running round 4 to verify convergence on the workflow contract. Worth noting the round counts: 3 substantive medium-fix rounds plus this commit. Each round caught real workflow-contract issues, not trivia — the loop is doing exactly what it's supposed to. The fact that we keep finding new contract bugs is itself evidence that the spec is dense enough to warrant this many rounds of refinement.
…andles partial Round 4 of pr-review on PR #5: codex caught two more medium contract drifts, both introduced by round-3's fixes. Copilot's output truncated mid-exploration again this round; treating as one substantive reviewer. 1. [medium] Stop condition contradicted P2 acceptance. Step 11 said "stop when no P0/P1/P2 findings remain" — but reviewers can keep surfacing an accepted P2 (they have no way to know it was accepted in the final report). So an accepted P2 would prevent reaching the documented clean state, the loop would hit the cap, and the run would look blocked despite the acceptance being on file. Changed stop condition to "no *unaccepted* P0/P1/P2 findings remain" with an explanatory sentence about why reviewers may keep surfacing accepted items. 2. [medium] Round-3 fix delegated the skipped-Copilot fallback to `/ship` ("downstream decides whether to open as draft") but didn't update `ship.md` to actually handle that case. `/ship`'s gate previously only branched on clean / partial-with-noise / blocked — a `partial` from a skipped required reviewer matched none of those, leaving the agent without an instruction. Expanded the gate to three explicit `partial` sub-cases: - partial because Copilot CLI was skipped → draft PR + bot review + rerun review-cycle + `gh pr ready` when clean - partial because another required reviewer (codex, claude-sub) was skipped → draft PR + flag for human acceptance - partial with only accepted P2/non-blockers → continue, copy acceptances from final report into PR body Same edit in both claude/have/commands/ship.md and codex/plugins/have/commands/ship.md. This expands the PR's scope to touch ship.md, but the changes are logically coupled to the review-cycle contract — the cycle correctly caught that I'd written a delegation without updating the delegate. Running round 5 to verify.
allowlist + ship propagation Two reviewers caught three real findings on the round-4 commit. The most embarrassing one was self-inflicted: I had two Copilot CLI session transcripts (.deny-test.jsonl, .revparse-test.jsonl) leak into the PR via `git add -A`. They were written by Copilot's own review probes (testing tool-permission flags) and got picked up along with my real changes. They leak local /Users/will/.agents/ paths, skill descriptions, session IDs. Copilot caught its OWN debug artifact in PR #5 and used `.deny-test .jsonl:30` (showing `Permission to run this tool was denied ... shell(git rev-parse)`) as concrete evidence for a *separate* finding — that the documented allowlist was missing `git rev-parse`. The meta-loop: artifact A introduced a privacy leak (caught), and also produced empirical evidence for content gap B (also caught). Three fixes: 1. Removed `.deny-test.jsonl` and `.revparse-test.jsonl` from the tree and the index. Added `*.jsonl` to .gitignore so future review-probe runs won't repeat the mistake. 2. Added `shell(git rev-parse)` to both review-cycle command file examples. Copilot's empirical evidence proved the omission causes real review failures. 3. /ship now always copies `Accepted P2`, `Accepted non-blockers`, and `Skipped reviewers` from the review-cycle report into the PR body, regardless of whether the gate returned clean or partial. Codex correctly pointed out that under the new status contract, accepted-P2 produces `clean` (all P2 fixed-or-accepted → clean), so gating the propagation on `partial` would silently drop the audit trail. Same edit in both ship.md files. The broader pattern from this session: I keep using `git add -A` without reviewing what's actually being staged. That's how PR #5 originally picked up PR #4's commits (bad rebase) and how this round's debug transcripts leaked. Worth adding to the pr-review checklist as another infra-hazard bullet: "Files in PR diff that aren't named in commit messages — sign of git add -A picking up debug artifacts." Running round 6 to verify convergence.
Round 6 found four real issues from two reviewers. The most important one is a security finding from copilot: the documented Copilot allowlist permits `shell(rg)`, `shell(cat)`, `shell(head)` without path scoping, leaving a prompt-injection exfiltration path even though writes are blocked. A crafted commit subject could instruct the reviewer to read $HOME secrets. Four fixes: 1. [medium, security] Added `--add-dir "$(git rev-parse --show-toplevel)"` to the Copilot CLI example. Confines file-access surface to the repo. Combined with the per-command allowlist, eliminates the exfiltration path even under prompt injection. Documented the reasoning explicitly so future maintainers don't remove --add-dir thinking the allowlist alone is sufficient. 2. [medium] /ship's Copilot-skipped fallback had a logic gap: rerun review-cycle after bot review still returns partial (CLI is still blocked), so it can't clear status. Reworked the clearance path: bot review + no unaddressed findings + human marks ready = substitute acceptance. The human's `gh pr ready` invocation IS the explicit signal. Document substitution in PR body. 3. [medium] Removed the "Partial with only accepted P2 / accepted non-blockers" sub-branch in /ship — unreachable under the current status contract (accepted P2 → clean per the validation rule). The propagation of those fields is now unconditional (added in round 5), so the unreachable branch is also redundant. 4. [medium] "Commit and PR" section was still gated by "When validation and /review-cycle are clean," contradicting the draft-on-partial logic above. Reworded to "when the Review Cycle Gate has been satisfied," with explicit guidance that draft vs ready follows the gate's branch. 5. [low] Narrowed gitignore from `*.jsonl` (overbroad — hides legitimate fixtures) to specific probe-filename patterns (`*-test.jsonl`, `.copilot-session-*.jsonl`, `copilot-session-*.jsonl`). Plus a comment pointing to the actual rule: review outputs go to /tmp, not the repo. Running round 7 to verify convergence. This is round 6 of looping on PR #5. 17 substantive findings caught across the 6 rounds, all valid. The cycle is doing what it should even if convergence is taking longer than I expected.
… gitignore Round 7 of pr-review on PR #5. Codex caught two more real issues (copilot's output truncated mid-verification this round, same operational pattern as rounds 3-6 — extensive empirical testing, no Findings section emitted). 1. [medium, security] The round-6 `--add-dir <repo>` "fix" was incomplete. Per Copilot CLI's permissions help, the default allowed-paths set includes the system temp dir. `--add-dir` adds dirs, it doesn't restrict them. So /tmp was still reachable — a prompt-injected diff could still ask Copilot to read /tmp/secrets and quote them into findings. Fixed with a three-part path lock: - `-C "$REPO_ROOT"` sets cwd explicitly (default cwd is whatever shell invoked, not necessarily the repo) - `--add-dir "$REPO_ROOT"` adds the repo - `--disallow-temp-dir` removes /tmp from the default set Documented why all three flags are needed so future maintainers don't remove one thinking the others are sufficient. 2. [low] `*-test.jsonl` pattern in .gitignore was still overbroad — would silently hide e.g. `fixtures/payment-test.jsonl` from `git status` / `git add`. Narrowed to just the `.copilot-session-*.jsonl` / `copilot-session-*.jsonl` prefixes. Comment now explicitly says the real rule is "review outputs go to /tmp" — gitignore is defense-in-depth, not the primary prevention. This is round 7 of the cycle. The security finding was iterative — round 6 added --add-dir thinking that was the fix; round 7 caught that --add-dir alone wasn't enough; the three-part lock should be complete. Running round 8 to verify. Cycle metadata so far: - 19 substantive findings caught across 7 rounds - All valid, all addressed - Copilot's output truncation in rounds 3+ suggests its empirical-verification approach exceeds output limits in longer reviews; codex remains reliable - The security-scoping concern took 2 iterations to fully nail (--add-dir then --disallow-temp-dir + -C) — exactly the "fix can introduce or expose the next finding" pattern the loop is designed to catch
Rounds 6-8 ratcheted up the Copilot allowlist docs (--add-dir, --disallow-temp-dir, -C, and a round-8 "HIGH severity" claim about prompt-injection exfiltration). Will pointed out the underlying mistake: codex was applying a remote-service threat model to a local CLI tool. Copilot CLI runs on the user's machine with the user's credentials; it can already see what the user can see. "Exfiltration" only matters when findings reach a third party, which they don't in the normal use case (engineer reviewing own org's PR pre-push, findings going to own terminal). I should have caught this. My own pr-review checklist says "Findings are evidence, not orders — verify each finding against the code before fixing." I just acted on the HIGH severity tag without questioning the threat model. Exact failure mode the checklist warns against. This commit walks back the elaborate security framing: - Kept the actual flags (-C, --add-dir, --disallow-temp-dir, per-command --allow-tool). They're still useful — for **scope hygiene** (keep the reviewer focused on the repo, prevent accidental noise from /tmp files), not security. - Replaced the rounds-6-7 "Why the path-scoping is a three-part lock... no exfiltration even under prompt injection" prose with a simpler "this is scope hygiene, not a security boundary" paragraph. - Added a "When stricter sandboxing actually matters" callout covering the real edge cases where the threat model holds: OSS PR review with auto-posted findings, CI runs with restricted secrets, untrusted contributor diffs. For those, a sanitized temp checkout is the right architecture. - For the normal HappyVertical case, the existing flags are enough. No architectural change needed. Mirror edit in both claude/ and codex/ command files. Lesson worth folding back: the pr-review checklist could use a bullet about "match the threat model to the deployment" or similar, so I (and any future reviewer using the checklist) catch the conflation before applying it. Will think about how to phrase it.
Per Will's request: use the "-cli" suffix consistently when referring to the local CLI tools as concepts, to keep the distinction between "Copilot bot" (cloud service that reviews post-PR) and "copilot-cli" (the local CLI we invoke pre-push) visible everywhere. Same pattern for the other two: - "Codex" / "codex" (as tool reference) → "codex-cli" - "Claude" (as reviewer-name in lists) → "claude-cli" - "Copilot" / "Copilot CLI" → "copilot-cli" PRESERVED (these aren't the CLI tool): - "Claude Code" — the product name / agent identity - "Copilot bot" / "Copilot PR review *bot*" — the cloud bot - "Copilot in Chrome" — the browser extension - "Copilot policies page" — the org admin setting - Branch prefixes `claude/<task-name>` / `codex/<task-name>` - Filesystem paths `/Users/will/.claude/`, `.claude-plugin`, etc. - Literal invocations `claude -p`, `codex exec`, `gh copilot` - "Claude Bash tool" — refers to Claude Code's internal tool Done across all four command files: - claude/have/commands/review-cycle.md - codex/plugins/have/commands/review-cycle.md - claude/have/commands/ship.md - codex/plugins/have/commands/ship.md Mechanical rename via Python with preserve-then-substitute pattern; the first pass produced "codex-cli-cli" doubling which I fixed with a follow-up pass; final hand-fix for "claude-cli Code" → "Claude Code" where the standalone-Claude regex collided with the Claude-Code product name.
…msg in error output PR #5's commits use `fix(review-cycle,ship):` (multi-scope, one edit landing in both slash command surfaces). The old regex `(\([a-z0-9-]+\))?` rejected the comma, blocking the PR from merging despite the commits being well-formed Conventional Commits. Relax to `[a-z0-9][a-z0-9,/-]*`: - First char must be alphanumeric — prevents stray punctuation like `(,foo)` or `(-foo)`. - Allows comma for multi-scope commits. - Allows forward slash for dep-name scopes like `chore(tibdex/github-app-token):`. - Still rejects uppercase, colons in scope, leading punctuation. - Verified against 9 synthetic cases (5 pass, 4 fail as expected) AND all 13 PR #5 commits (all pass). Also fix the existing `echo "::error::Invalid commit message: $msg"` to escape `%`, CR, LF before embedding the user-controlled commit message in a workflow command. This is the same lesson from PR #6 round 14 / round 4: GitHub workflow commands parse `%`, `%0D`, and `%0A` as their respective bytes, so an attacker (or just an inconvenient commit message) containing those bytes can corrupt the command payload or inject additional workflow commands. Extract escape logic into `escape_wc()` for reuse. Also switched `echo "$msg" | grep` to `printf '%s' "$msg" | grep` — `echo` can interpret leading `-n`/`-e`/`-E` as flags in some shells, so a commit subject starting with one of those would silently produce wrong output. `printf '%s'` is unconditional.
Fold in the meta-lesson from have-config#6's loop. That PR ran codex-cli solo for 12 rounds because the ensemble step in this command was described but not enforced. At round 13, codex returned 0 findings; I declared convergence. When copilot-cli was added for round 14, it immediately surfaced two real findings that codex had been blind to across all 12 rounds: 1. The step-0 partial-release detector matched ANY commit subject starting with `chore(release):` — would deadlock on a normal human dependency-bump commit like `chore(release): bump pnpm/action-setup`. Codex missed; copilot caught. 2. A SIGPIPE/pipefail interaction in the same detector would silently bypass it entirely under non-trivial conditions. Codex missed; copilot caught. Both were the kind of failure-path edge case the ensemble exists to catch. Solo runs converge on what one reviewer's prior knowledge covers; ensemble convergence covers the union. Add a Hard Rule covering two specific failure modes: - *Silent solo*: declaring convergence on one reviewer's 0 findings because the others are "redundant" / "slow" / "already passed last round". They're not redundant — they have non-overlapping blind spots. - *Unavailable ≠ clean*: if a reviewer is blocked by auth, policy, or environment, the absence of findings is not affirmative clearance. Record explicitly; don't count silence as agreement. Mirror edit in both `claude/` and `codex/` review-cycle.md.
1a14339 to
d9db0cf
Compare
Ensemble: codex-cli + copilot-cli, 3 distinct findings, no overlap.
All real, all high-confidence.
1. [medium, codex] Final report example omits the required claude-cli
reviewer
The new contract requires three reviewer subprocesses: codex-cli,
claude-cli (separate non-interactive print-mode invocation), and
copilot-cli. The example string was "3 rounds: codex-cli +
copilot-cli + me" — implying the parent agent counts as the
claude-cli reviewer. That contradicts the Hard Rule that claude-cli
must run as an independent subprocess. An agent following the
example could report Status: clean with claude-cli never having
run, and `/ship` would gate-pass on a false-converged report.
Fix: example string lists all three required reviewers; added a
clarification that the orchestrator agent's inline opinion does
not substitute for any subprocess reviewer.
2. [medium, codex] Round-cap `blocked` definition contradicts P3
non-blocking rule
Status definition said `blocked = ... cap hit with findings open`
without scoping to severity. The Hard Rules say P3/nit findings
never block and never extend the loop. So if the loop hit the
cap with only P3 findings remaining (the exact "loop converging
on trivia" pattern these rules are designed to exit cleanly),
the status definition would force `blocked` and `/ship` would
refuse to open a ready PR.
Fix: explicit carve-out that round-cap exit with ONLY P3/nit
findings remaining is NOT blocked — those go in accepted
non-blockers and Status stays clean (or partial if a required
reviewer was skipped).
3. [low, copilot] Scope regex accepted malformed multi-scope strings
The first cut at the scope regex (`[a-z0-9][a-z0-9,/-]*`) was
too permissive — it let separators appear in arbitrary positions:
`fix(a,):`, `fix(a/):`, `fix(a,,b):`, `fix(a-):`, `fix(,b):`,
`fix(/foo):` all passed. Downstream tooling that assumes
comma-separated non-empty scope tokens would break.
Fix: tokenize the grammar — `scope = segment ("," segment)*`,
`segment = [a-z0-9] ([a-z0-9/-]* [a-z0-9])?`. Each segment must
start AND end with alphanumeric; separators only between. Verified
against 16 synthetic cases (8 pass, 8 fail as expected) AND all
15 PR #5 commits (all pass).
Mirror edits in both claude/ and codex/ review-cycle.md files.
…review polish Round-2 ensemble (codex-cli + copilot-cli + me as third). 3 distinct real findings + 1 P3 inline-fix from the self-review pass. 1. [medium, codex] Copilot --allow-tool isn't a hard read-only boundary `--allow-tool 'shell(git diff)'` approves any `git diff …` invocation INCLUDING write-capable forms like `git diff --output=path` which can dirty the working tree. The review-cycle docs claimed the allowlist was the structural enforcement of "no write tools" — that's false. Same applies to `shell(rg)` via redirection-flag forms. This breaks Step 11's same-commit guarantee: if a reviewer modifies the tree mid-round (whether via prompt injection or just accidental tool choice), the commit being reviewed has moved. Fix: document the gap, document the structural mitigation (post-review `git status --porcelain` tree-clean check), and point at the disposable-worktree pattern for repeat offenders. 2. [medium, codex] Draft fallback can wait forever for bot review The ship.md fallback opens a draft PR expecting the Copilot bot to auto-review. But GitHub's default Copilot review only fires on PR open (non-draft) or Draft→Open transition, and doesn't auto-re-review subsequent pushes. Without "Review draft pull requests" + "Review new pushes" settings enabled, the fallback silently waits forever. Fix: document the prerequisite check explicitly, document the manual review-request fallback (`gh api ... requested_reviewers`), and emphasize that "current commit" means re-request after each subsequent push. 3. [medium, copilot] Step 11 stop condition omitted reviewer availability "Stop as clean" only checked findings + validation, not whether all required reviewers ran. The Status contract said skipped required reviewer forces `partial`, but Step 11 could have allowed `clean` to slip through if a reviewer was unavailable. Fix: rewrite Step 11 with three explicit conditions (no unaccepted P0/P1/P2 + green validation + all required reviewers ran). Inline clarifying "no findings surfaced ≠ clean — a reviewer that didn't run produced no findings because it didn't run." 4. [P3, self-review polish, me] Forward-reference cleanup First draft of Step 11 said "(skipped/unavailable reviewer → see the next bullet)" with the actual explanation in the paragraph below. "Next bullet" is ambiguous — the third condition isn't followed by separate bullets. Made the third condition self-contained and dropped the forward pointer. Mirror edits in both claude/ and codex/ files. Reviewer notes: - Three reviewers ran: codex-cli (background subprocess), copilot-cli (background subprocess), and me (Claude orchestrator doing an explicit checklist pass on the staged diff). - claude-cli subprocess was NOT run — the documented OAuth blocker fired (parent OAuth doesn't propagate to child claude -p, keychain token expired). The user-orchestrator (Claude) doing an explicit checklist review is the documented fallback. Acceptable per the contract this PR is editing IF the orchestrator's review is written down, not "I looked, it's fine". That's what the Finding 4 / self-review section above is.
…4 reviewers) Round-3 ensemble: codex-cli + copilot-cli + claude sub-agent (via parent's Agent tool — the cleanest workaround for the OAuth blocker documented in PR #5's own contract; same independence property as a separate `claude -p`) + orchestrator self-review. Eight distinct findings surfaced across the four reviewers (codex 0, copilot 2, claude 6, me 2). Six accepted, two rejected: ACCEPTED (applied this commit): 1. [medium, copilot + claude] `Accepted P2` listed as partial trigger in ship.md, but the Status contract defines `partial` only by skipped required reviewers (Accepted P2 → clean). Removed the ambiguous "or Accepted P2" branch trigger. 2. [medium, claude] Default rounds wording contradicted itself — Line 7 / arg list say "Default: 3", but line 308 said "Default: 3 for code, 5-10 for docs" as if auto-detected. The command does NOT auto-detect change type. Reworded to "default is 3 regardless; for doc work consider passing rounds=5..10" so the user does the adjustment explicitly. 3. [low, copilot] Scope regex allowed `fix(a//b):`, `fix(a--b):` (repeated separators within a segment). Combined with finding 4 below into a single tighter grammar. 4. [low, claude] Scope regex rejected `@scope/pkg` (scoped npm packages) — `feat(@happyvertical/sql):` would have failed commitlint despite the error message explicitly advertising forward slash "e.g. dep names". Combined with finding 3 into: scope = segment ("," segment)* segment = "@"? alpha-run (("/" | "-") alpha-run)* alpha-run = [a-z0-9]+ Verified against 22 synthetic cases + all 17 PR #5 commits. 5. [low, claude] `.gitignore` `*.jsonl` patterns don't match copilot's actual session file format. Per `gh copilot -- --help`, `--share[=path]` writes `copilot-session-<id>.md` (markdown), not jsonl; `--output-format json` writes to stdout, not a file; `--log-dir` defaults to `~/.copilot/logs/` (outside repo). Added `*.md` patterns (the actual leak vector) while keeping `*.jsonl` patterns for defense-in-depth (the original 1fc8677 incident leaked `.deny-test.jsonl` / `.revparse-test.jsonl` from internal probes — that pattern may resurface under different CLI versions). Comment now states what the primary prevention actually is (review outputs go to /tmp). 6. [low, claude] Round-cap section was silent on "P0/P1/P2 fix lands in final permitted round" — Rule 10 demands a verify round; cap forbids it. A literal-following agent could report `clean` based on "no findings remaining" when actually no verify round ran to surface them. Added explicit special case: report status `partial` with note recommending re-run with raised cap. REJECTED (with rationale recorded): 7. [medium, claude] "copilot-cli `-p` requires `--allow-all-tools`" — empirically refuted. Just tested `copilot -p` with per-command `--allow-tool` only (no `--allow-all-tools`), got correct output. The doc's recipe is correct; copilot's `--help` text saying "required for non-interactive mode" appears to be misleading guidance, not a hard requirement. 8. [suggestion, claude] "Use `shell(git:*)` prefix consolidation" — would WEAKEN the doc's read-only safety. `shell(git:*)` enables ALL git subcommands including `git push`, `git apply`, `git commit` (writes). The current enumeration of read-only subcommands is intentional safety, not redundancy. P3 DEFERRED (recorded for follow-up, not fixed this round per the "P3 never extend the loop" rule): - [P3, me] ship.md cites "Settings → Code & automation → Copilot → Code review" UI path. GitHub UI paths drift. Add canonical docs URL as durable reference. - [P3, me] `gh api ... requested_reviewers copilot-pull-request- reviewer` 404s if Copilot isn't installed on the repo. Out of likely threat model but worth a caveat. Mirror edits in both claude/ and codex/ files. Reviewer note: claude-cli subprocess auth (OAuth from parent claude session) remained blocked this round. The Agent-tool workaround (spawning a fresh Claude as a sub-agent with the same review prompt) provides the same fresh-perspective independence — and demonstrably caught 6 findings the other reviewers + the orchestrator missed. Worth folding into the docs as the canonical claude-cli substitute when OAuth fails.
…, no rejections) Round-4 ensemble (codex-cli + copilot-cli + claude sub-agent + me). Surfaced 6 distinct findings; 4 accepted, 1 declined as nit, 1 was my own self-review note that overlapped with #1. 1. [medium, ALL 4 REVIEWERS] Cap-blocked-verify case overloaded `partial` status, breaking `/ship`'s partial-handling contract Round-3 fix #6 said "report status `partial`" when a P0/P1/P2 fix landed in the final permitted round (no verify possible). But the Status contract defines `partial` strictly as "skipped required reviewer". `/ship` then routes `partial` only by `Skipped reviewers`, which would be empty in this case — undefined behavior, agent could fall through to "treat as clean and ship unverified fix". Four-way confirmation: - codex: "make it `blocked`/`needs-verify`" - copilot: "either add explicit `verify-round-missed` reason + new ship branch, or classify as `blocked` to keep partial strict" - claude sub-agent: "either add third partial sub-branch in ship.md, or use different status (`blocked` with verify-needed marker)" - me (self-review): "contract overload — needs broadened definition or different status" Fix: reclassify as `blocked` with reason `verify-round-blocked-by-cap`. Three coordinated edits: (a) Round-cap section: change "report status as `partial`" to "report status as `blocked` with reason `verify-round-blocked-by-cap`". (b) Status contract `blocked =` definition: extend to enumerate "P0/P1/P2 fix landed in the final permitted round with no verify round possible". Reasoning: an unverified fix counts as potentially unaccepted because we don't yet know if the fix introduced new findings. (c) Status contract `partial =` definition: add "Single cause only — other 'incomplete' states (unverified fix, validation failed) are `blocked`, not `partial`" to lock down the single-cause invariant. (d) ship.md `blocked` branch: add explicit sub-case for `verify-round-blocked-by-cap` directing operator to re-run `/review-cycle rounds=N+1`. 2. [low, copilot + claude] `.gitignore` comment cited prior incident filenames the patterns don't actually catch Round-3 comment claimed defense-in-depth for `.deny-test.jsonl` / `.revparse-test.jsonl` from the 1fc8677 incident, but the narrow `copilot-session-*.jsonl` patterns don't match those names. Misleading framing. Fix: rewrite the comment to be honest about what the patterns cover and don't cover. Note explicitly that the narrow patterns were a deliberate round-7 walkback (avoided `*-test.jsonl` wildcard that would hide legitimate fixtures), and that the structural defense for arbitrary probe filenames is the /tmp rule, not gitignore. 3. [low, codex] Commitlint workflow had unused `pull-requests: read` permission The job only does `actions/checkout` + `git log` over commit SHAs from the event payload. No PR API calls. `pull-requests: read` is dead scope. Per least-privilege, dropped it. 4. [low, claude] Commitlint error message didn't mention `@` (the regex was updated to allow it in round 3 but the help text lagged) A user hitting the error with `feat(@foo/bar): ...` for an unrelated reason would read the help message and conclude `@` isn't supported. One-line fix to add "optional leading @ (scoped npm packages)" to the chars list. DECLINED: 5. [nit, claude] Regex accepts `feat(@scope):` (no `/pkg` suffix) — not a valid npm-scoped reference Tightening to require `/` after `@` would reject legitimate non-npm uses (e.g. someone wanting to use `@`-prefix for their own scope convention). The regex isn't claiming to enforce npm-semantic validity. Out of scope for a commit-message linter. Round-4 ran the full 4-reviewer ensemble in parallel against the same commit (1dd7688). Three rounds total of 4-reviewer ensembles now — the claude-sub-agent-via-Agent-tool workaround is proving durable enough to fold into the docs as the canonical claude-cli substitute when OAuth fails.
Round-5 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
Six distinct findings; five accepted, one declined.
1. [medium, codex] Wrong API path for manual Copilot review request
Round-2 fix told operators to call `gh api ...
requested_reviewers -F 'reviewers[]=copilot-pull-request-reviewer'`
to manually request Copilot review when the auto-review opt-in
isn't enabled. Per GitHub's current docs, the supported CLI path
is `gh pr edit <PR> --add-reviewer @copilot`, and
`copilot-pull-request-reviewer` is a workflow/app slug used for
billing, not a reviewer alias. The original recipe could 422
silently while the operator believes the fallback was satisfied
— leaving the PR un-reviewed.
Fix: replace the `gh api` recipe with the documented `gh pr edit
--add-reviewer @copilot` form. Link to GitHub's "request a code
review" docs. For re-reviews, point at the Reviewers-menu
re-request button (the documented re-review UX). Also linked
the prerequisite-settings docs URL instead of a UI path that
can drift as GitHub renames sections.
2. [low, claude sub-agent] codex variant ship.md frontmatter unquoted
`description: Prepare ... shipping: validate ...` — embedded
colon-space is interpreted as a nested mapping by strict YAML
parsers. The mirror claude variant already quotes the string;
the codex variant didn't. One-character fix (add outer quotes).
Pre-existing but in a file this PR touches substantively;
normalize as part of this round.
3. [low, claude sub-agent + me (overlap)] Top-line said "three
reviewers" but practice has been "four" (with orchestrator)
This loop has been running a 4-reviewer ensemble for several
rounds (codex-cli + copilot-cli + claude sub-agent + orchestrator
self-review), but the docs still said "three independent reviewer
subprocesses". Reviewers can keep surfacing the same ambiguity
each round unless the docs match practice.
Fix: substantial docs update:
(a) Intro paragraph in both variants now describes a
"4-reviewer ensemble": three subprocess reviewers + the
orchestrator's own checklist pass.
(b) Claude variant: explicitly documents the claude-sub-agent
via Agent tool as the canonical claude-cli substitute when
OAuth fails (preferred: `claude -p` with valid auth;
fallback: sub-agent with same prompt). Validated across
3 rounds in this loop.
(c) Both variants: explicit rule that the orchestrator's pass
is NOT silent-solo — must be an enumerated checklist run
with written-out findings, not "I looked, it's fine".
(d) Reviews field in Final Report updated to list ALL FOUR
slots and require explicit skip/substitute reasons.
4. [P3, me — self-review] Voice consistency
Special-case wording used "we don't yet know whether the fix
introduced new findings" — colloquial "we" vs the doc's
imperative voice elsewhere. Reworded to "its effect on the
codebase is unobserved" (third-person, consistent).
5. [P3, me — self-review] P3-only carve-out ambiguous re
validation-failed
Round-2 carve-out said "Status stays clean (or partial if a
required reviewer was skipped)" but didn't address validation
failure. Could read as implying clean/partial overrides
validation-failed → blocked. Clarified: "(or blocked if
validation failed — the carve-out only suppresses the P3-only
at-cap path; other blocked causes still apply)".
DECLINED:
6. [low (low confidence), claude sub-agent] Bare `Merge` subject
(no trailing space) bypasses neither check
The current `if [[ "$msg" =~ ^Merge\ ]]` matches `Merge ` only.
A subject of just `Merge` with no following text would fail
both this guard and the regex, rejecting the merge commit.
Declined because `git merge` never produces a bare `Merge`
subject — it always emits `Merge branch ...` or `Merge pull
request ...`. The only way to get a bare `Merge` subject is a
manually authored message, in which case the operator should
conform. Not worth tightening the guard for an unobserved edge
case.
…o rejections)
Round-6 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
Seven distinct findings, all accepted. Most are fallout from round-5's
contract evolution interacting with existing rules.
1. [medium, codex + copilot + me (3-way)] Loop steps still said
"three reviewers" while intro said "four"
Round-5 added the orchestrator as the 4th reviewer slot in the
intro, but the procedural loop (`Run codex-cli, claude-cli, and
copilot-cli`) and the "For all three" header still said three.
A literal-following operator would skip the orchestrator pass
entirely and still report `clean` because the 4th slot wasn't
required by the loop steps.
Fix:
(a) Loop step 2: list all FOUR slots explicitly and require
each to produce findings before dedup.
(b) Per-reviewer guidance: rename "For all three" → "For all
three subprocess reviewers" + add new "Orchestrator
self-review (the 4th reviewer slot)" section with explicit
rules: same JSON shape as subprocesses, "no findings" must
be explicit, runs concurrently with subprocesses.
2. [medium, copilot] Slot-based redefinition needed for claude
subprocess vs sub-agent
The claude-cli reviewer can be filled via subprocess (`claude -p`)
OR sub-agent (via Agent tool when OAuth fails). But the Status
contract said skipped required reviewer forces `partial`, and the
`Skipped reviewers` field looked at the subprocess specifically.
Strict reading: if `claude -p` fails and sub-agent succeeds, the
subprocess is technically "skipped" → `partial` even though the
slot was filled.
Fix: think in terms of SLOTS, not specific invocations. The
claude slot can be filled by EITHER subprocess or sub-agent;
the slot is "skipped" only if BOTH fail. Same for ship.md
partial branch.
3. [medium, codex] Tree-clean post-check breaks for dirty/uncommitted reviews
Round-2's tree-clean check (`git status --porcelain` after each
reviewer) only works when the tree was clean BEFORE the reviewer
ran. For uncommitted-work reviews, `git status` is already
non-empty, so the check either invalidates every round or
misses same-status mutations (e.g. reviewer modifies an
already-modified file; status stays `M path`).
Fix: rewrite as **pre/post tree-snapshot comparison**:
- Committed-work case: pre-clean, post-check `git status --porcelain`
for any output (existing behaviour, documented explicitly).
- Uncommitted-work case: either stash/commit before reviewing
(recommended) OR capture pre/post snapshots of status + diff
+ untracked content hashes and diff them.
Documented both with the "never just 'is git status clean now'
as the post-check" reminder.
4. [medium, claude] `gh pr edit --add-reviewer @copilot` requires
gh CLI v2.88.0+
Round-5 replaced the wrong `gh api ... requested_reviewers`
recipe with `gh pr edit --add-reviewer @copilot`. Per
cli/cli#v2.88.0 release notes, `@copilot` on `--add-reviewer`
was added in v2.88.0; older gh returns `Could not request
reviewer: '@copilot' not found` and silently skips the request
— same silent-fallback-fails-forever pattern round-5 was trying
to prevent.
Fix: document the version requirement explicitly, with
`gh --version` check and upgrade path. Re-reviews use the PR
page's Reviewers menu re-request button (not gh pr edit, which
is initial-add only).
5. [medium, claude] claude variant ship.md "claude-cli subprocess
auth fails" → partial contradicts round-5's sub-agent fallback
Round-5 documented the sub-agent fallback for claude-cli, but
the ship.md partial-branch still listed "claude-cli subprocess
auth fails" as a partial trigger. If sub-agent succeeded, the
slot is filled and `/review-cycle` should return `clean`, not
`partial`. Round-5 contract update wasn't propagated to ship.md.
Fix: update claude variant ship.md to require "claude-cli AND
sub-agent fallback both failed" before treating the slot as
skipped. Codex variant unchanged (no Agent-tool equivalent).
Added defensive note that if `/review-cycle` returned `partial`
despite a successful sub-agent fill, the bug is in the
orchestrator's classification, not ship.md's handling.
6. [low, claude] codex variant Reviews template cited undefined
"sub-agent fallback"
Round-5's mirror edit added "(subprocess OR sub-agent fallback)"
to the Reviews template in both variants. But the codex variant
has no body explanation of "sub-agent fallback" — the Codex CLI
orchestrator doesn't have an Anthropic-Agent-tool equivalent.
Operator reading the codex variant report template would be
confused.
Fix: drop "OR sub-agent fallback" from the codex variant
Reviews template; explicitly note that the Codex CLI
orchestrator has no documented substitute for claude-cli, so
unavailability means accept the reduced-coverage tradeoff.
7. [low, claude] ship.md step 7 auto-`gh pr ready` collided with
partial-branch human-clearance requirement
Step 7 said "if existing PR is draft and the work is now clean,
`gh pr ready`". "Now clean" was ambiguous — could be read as
subjective ("the tree looks fine now") rather than the Review
Cycle Gate output specifically. On `partial` (e.g. Copilot CLI
blocked, draft opened, bot reviewed manually), a literal reader
could auto-ready before the human explicitly accepted the
bot-for-CLI substitution.
Fix: tie step 7 explicitly to "/review-cycle returned status
`clean`" (not `partial`, not `blocked`). On `partial`, the
human runs `gh pr ready` after the documented clearance path.
…aveat
Round-7 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
7 findings; 6 accepted, 1 declined (false positive on my part — claude
sub-agent claimed claude variant ship.md frontmatter was unquoted but
empirical check shows both variants are already quoted).
1. [medium, codex + copilot + claude (3-way)] claude variant
required-reviewer wording said "claude-cli subprocess" but the
slot-based model elsewhere allows EITHER subprocess OR sub-agent
Round-5 documented the sub-agent fallback. Round-6 updated some
contract text to slot-based language but missed the explicit
"Status MUST drop to partial when ... claude-cli subprocess is
skipped" clause. Strict reading: subprocess fails → partial,
even when sub-agent fills the slot. ship.md already calls this
"a bug in classification".
Fix: rewrite the required-reviewer list as an enumerated slot
list. Claude variant explicitly says "the claude reviewer slot —
filled by either `claude -p` subprocess OR Agent-tool sub-agent
fallback when OAuth fails. The slot is 'filled' if EITHER
succeeds; only 'skipped' if BOTH fail."
2. [medium, codex] codex variant treated blocked-claude as
"filled with caveat" in the loop step
Round-6 wrote "claude-cli (or accepted-tradeoff if blocked)" in
the codex variant's loop step 2, which mixes filled and skipped
semantics. A later run can treat the blocked slot as filled and
bypass the partial-status gate.
Fix: drop the "(or accepted-tradeoff if blocked)" parenthetical.
Explicit: if claude-cli fails on Codex CLI (no sub-agent
substitute), the slot is skipped → partial. Accepted-tradeoff
rationale lives in the final report, not in the slot list.
3. [medium, copilot] codex variant required-reviewer list omitted
the orchestrator slot
Round-6 added the 4-reviewer ensemble but the codex variant's
required-reviewer-for-status-gate definition only listed the
three subprocess reviewers. A run that skipped the orchestrator
pass could still report `clean` because the gate's required
list didn't include it.
Fix: enumerate all four slots in both variants' required-
reviewer definitions. The orchestrator slot fills itself but
must be listed as required for the partial-status gate to bite
when an operator forgets the self-review pass.
4. [medium, claude] "Run claude-cli review" section had no sub-agent
fallback recipe
Round-5 documented the fallback existence in the intro but the
actual "Run claude-cli review (as a subprocess)" section only
showed the subprocess invocation. An orchestrator hitting the
401 error and looking for fallback instructions in the recipe
section would find none, then either skip the slot wrong or
ad-hoc invoke a sub-agent that may not produce the standard
JSON shape.
Fix: rename section to "Run claude-cli review (subprocess
preferred, sub-agent fallback)". Document both paths:
- Subprocess: existing `claude -p` recipe, plus explicit note
about CLAUDE_CODE_OAUTH_TOKEN / ANTHROPIC_API_KEY setup so
subprocess auth actually works from a Claude Code parent.
- Sub-agent fallback: concrete Agent({}) shape with
`subagent_type: "general-purpose"`, the standard JSON output
instruction, and run_in_background.
Also note explicitly the model-family overlap caveat (sub-agent
is same model as orchestrator, so independence is
"no-shared-conversation-context" not "different-model-family").
5. [low, claude] "For all three subprocess reviewers" rules don't
cleanly apply to the sub-agent fallback path
The renamed section bundles rules that are subprocess-specific
(15-min background-Bash timeout, stdout/stderr capture to
separate files). The sub-agent doesn't run as a subprocess, so
these don't apply.
Fix: rename to "For all three external reviewer slots" and
prepend a note explaining which rules map to the subprocess
path vs the sub-agent fallback path.
6. [low, claude — philosophical but valid] Orchestrator-as-4th-
reviewer has structural confirmation bias the doc didn't
acknowledge
The ensemble's stated justification is "Different models have
different blind spots." The 4th slot (orchestrator self-review)
is the SAME model as the claude-cli reviewer AND the agent that
wrote the code being reviewed. It has the LEAST blind-spot
coverage of all four slots and has full author context (=
confirmation bias). Treating "4/4 clean" as equivalent to "3/3
clean from independent reviewers" over-weights the orchestrator.
Fix: add an explicit caveat in the orchestrator self-review
section: this slot's role is "explicit checklist
accountability" (forces the discipline), not independent
blind-spot coverage. A finding the orchestrator alone surfaces
is real; a "no findings" pass from the orchestrator alone is
weak. Findings flagged ONLY by orchestrator get extra scrutiny
for over-flagging (the bias works both ways).
DECLINED:
7. [low (low conf), claude] Cross-variant frontmatter inconsistency
— empirically refuted. Checked both files: both already use
quoted `description: "..."`. Claude sub-agent's claim was wrong.
…bug, 5 low polish)
Round-8 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
6 findings; all accepted. Most were polish, but one was a real bug
I introduced in round 6.
1. [medium, codex] `git stash` recommendation actively breaks the
dirty-tree review workflow
Round-6 said "create a `wip` commit or `git stash`, run the
round on the committed/stashed state, then unstash/reset
after." That works for the WIP commit (changes stay in the
tree). But `git stash` REMOVES changes from the worktree — so
reviewers run against the pre-WIP tree, report findings on
code that wasn't being reviewed, then unstash restores changes
that never got reviewed.
Real concrete failure: stash WIP → reviewers report clean on
pre-WIP state → unstash → `/ship` continues with unreviewed
restored changes.
Fix: drop `git stash` from the recommended path. Document the
WIP-commit-then-`git reset --mixed HEAD~1` dance explicitly,
with the warning that stash is the wrong tool here.
2. [low, codex] Dirty snapshot recipe missed staged/index state
The snapshot recipe captured `git status --porcelain` + `git
diff` + untracked-file hashes. But `git diff` only covers
UNSTAGED worktree changes. If the baseline has staged changes
and a reviewer mutates the index while leaving the status
shape as `M path`, snapshots can compare equal even though the
bytes being reviewed changed.
Fix: include `git diff --cached` (staged/index) in both before
and after captures.
3. [low, claude] `/ship` Final Report `Reviews:` field never
upgraded to the 4-slot enumeration
Round-7 updated `/review-cycle`'s Reviews field to require all
four slots be enumerated. `/ship`'s own Final Report template
still said `Reviews: <rounds and tools>` — vague pre-slot-model
wording. The `Accepted P2` / `Accepted non-blockers` /
`Skipped reviewers` fields were also missing from `/ship`'s
own report even though `/ship` is required to copy them into
the PR body.
Fix: rewrite the `/ship` Final Report Reviews field to copy
from `/review-cycle`'s report verbatim, and add the three
missing fields. `/ship`'s own report is now self-contained.
4. [low, claude] codex variant orchestrator concurrency wording
was weaker than claude variant's
The claude variant said "Run the orchestrator pass in parallel
with the subprocesses (while they run in the background, the
orchestrator reads the diff against the checklist)." The codex
variant said only "Run the orchestrator pass while the
subprocesses are running in the background" — could be
misread as "wait for them, then run while they idle."
Fix: mirror the claude variant's exact wording in the codex
variant. Wall-clock parallelism is the intent.
5. [low, claude] "All reviewers in parallel" rule didn't carve
out the sub-agent launch-on-failure semantic
The loop rule says "all reviewers in parallel from t=0", but
the sub-agent fallback is inherently sequential — it can only
launch after `claude -p` returns auth failure. A strict
reading could push an orchestrator into launching both
concurrently as a defensive precaution, doubling Anthropic
API spend on rounds where the subprocess succeeds.
Fix: one-paragraph carve-out below the parallel rule —
launch subprocess first, wait briefly for the 401, spawn the
sub-agent only if auth failed. The fallback still counts as
"parallel" for the same-commit guarantee.
6. [P3, me — self-review] Agent({}) pseudo-code in sub-agent
fallback recipe could mislead readers into using JS-like
syntax
The example used `Agent({ subagent_type: "...", ... })` shape
which looks like a JavaScript function call. The actual tool
invocation uses JSON parameter values.
Fix: rewrote as "Agent tool invocation:" + key-value list +
explicit "illustrative — not a literal JavaScript call" note.
…viewers)
Round-9 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
5 findings; 4 accepted, 1 rejected as empirically refuted.
1. [medium, claude] codex variant intro contradicted the new
partial-on-skip Status contract
Round-5 intro paragraph (which I added for codex variant) said
the parent should "explicitly accept the reduced-coverage
tradeoff with rationale" when a reviewer is blocked. The
tightened Status contract from rounds 6-7 explicitly forbids
this: skipped slot → partial, no "accept and continue" path.
Driver reading top-to-bottom hits the old framing first, applies
it, never re-reads the contract → unreviewed code through
`/ship`'s gate.
Fix: rewrite both spots (intro and Reviews template) to point
at the Status contract: skipped means partial, rationale lives
in the Skipped reviewers field as documentation, not as a
loophole that keeps Status=clean.
2+3. [medium, codex + claude] Dirty-tree WIP-commit recipe from
round-8 has multiple real footguns
Codex caught: `git add -A` stages unrelated untracked files
(`.copilot-session-*.md`, scratch notes), which end up in the
WIP commit AND in `pr-review --base`'s diff sent to external
reviewers — potential local-data leak.
Claude sub-agent caught: `git reset --mixed HEAD~1` is a
`git reset` (Hard Rules normally prohibit `git reset --hard /
--mixed`), and `HEAD~1` is positional — if HEAD moved during
the review for any reason, the reset silently drops the wrong
commit.
Both reviewers caught (independently): the recipe destroys the
user's pre-WIP staging discipline (specific staged/unstaged
split → everything unstaged after reset --mixed).
Fix: substantial rework of the dirty-work section.
(a) Flip recommendation: **snapshot comparison is now the
primary path** (no footguns — preserves staging, doesn't
touch commits, doesn't leak untracked files).
(b) WIP commit demoted to "manual-friendly alternative" with
a fully-specified safe recipe that:
- Captures pre-state (HEAD SHA, staged patch, unstaged
patch) for restoration.
- Uses explicit pathspecs (never `git add -A`).
- Verifies HEAD didn't move during review.
- Verifies the commit subject matches "wip: review
snapshot" before resetting.
- Restores staging discipline via stored patches.
(c) Hard Rules carve-out: the WIP-undo `git reset --hard` is
permitted ONLY as part of this verified-undo dance with
the shape shown. Carve-out is scoped and explicit.
(d) Three explicit DON'T warnings (stash, `add -A`, unguarded
`reset --mixed HEAD~1`) explaining why each is wrong.
4. [low, claude] ship.md partial-branch didn't cover orchestrator-
slot-unfilled
The Status contract requires `partial` whenever ANY required
slot is unfilled. The orchestrator is in the required list,
but ship.md's two partial sub-branches only covered copilot-cli
skipped and claude/codex slot unfilled. If the orchestrator
skips its own checklist pass, a literal reader maps to neither
branch and stalls.
Fix: extend partial-(b)'s parenthetical in both variants to
include "orchestrator slot was unfilled because no explicit
`{findings: []}` checklist pass was produced this round".
REJECTED:
5. [low, claude] `--effort xhigh` may not be valid for copilot-cli
— empirically refuted. Just ran `copilot --help`: choices are
`none, low, medium, high, xhigh, max`. `xhigh` IS valid. The
docs are correct. Claude sub-agent's concern about "codex
value reused for different binary without verification" was a
plausible heuristic but the value happens to be supported in
both.
… converged) Round-10 ensemble (codex + copilot + claude sub-agent + me). 4 distinct findings, all converging on the same conclusion: the WIP-commit recipe added in round 8 and reworked in round 9 has accumulated too many independent footguns to keep patching. The footgun inventory (caught across this and prior rounds): 1. [HIGH, round 10, codex + copilot 2-way] `git reset --hard` step loses untracked files included in the WIP commit. `git diff` doesn't capture them, so the restore patch can't recreate them. Concrete local data loss for new files/binary assets. 2. [medium, round 10, claude sub-agent] `git apply` of the unstaged patch is unsound when staged + unstaged changes touch overlapping hunks in the same file. The unstaged patch's context is the INDEX state, but we apply it to a worktree at pre-WIP HEAD (no staged applied yet). Hunks fuzz-fail, leave `.rej` files, silent failure if not run with `set -e`. 3. [medium, round 10, claude sub-agent] `git commit` failures (extremely common — commitlint hooks, gitleaks, gpg signing, biome, plus this very repo's own commitlint workflow which rejects `wip:` as a non-Conventional-Commits type) silently make `WIP_SHA` empty, and the downstream verify check misdiagnoses it as "HEAD moved during review". User goes hunting for non-existent HEAD movement. 4. [medium, round 10, copilot + claude + me 3-way] Fixed `/tmp/` paths collide across concurrent runs and across retry runs after a failure (second invocation overwrites recovery state from first failed run). 5. [round 9, codex + claude] `git add -A` stages unrelated untracked files which end up in the WIP commit AND in `pr-review --base`'s diff sent to external reviewers — potential local data leak. 6. [round 9, claude] `git reset --mixed HEAD~1` is positional; if HEAD moved (interrupted run, stray amend, side commit), wrong commit gets reset. 7. [round 9, claude + me 2-way] Hard Rules section forbids `git reset --hard`; recipe's "carve-out" was only inline, not in the Hard Rules section itself. 8. [round 9, me] WIP dance destroys user's pre-WIP staging discipline (carefully staged/unstaged split → everything unstaged after reset). That's 8 distinct sharp edges across 2 rounds. Each round of fixes introduces another caveat to document. Same pattern as PR #6's round-6.5 BREAKING-CHANGE-scanning removal: when a recipe needs more caveats than recipe, delete it. Fix: remove the WIP-commit alternative entirely. Replace with a fully-fleshed-out snapshot-comparison recipe that: - Uses `mktemp -d` for per-run snapshot directory (no /tmp collisions). - Captures status + unstaged diff + staged diff + untracked file hashes BEFORE the reviewer. - Captures same shape AFTER. - Diffs the four pairs; ANY difference → round invalid. - Catches same-status-but-different-content cases (e.g. `M path → M path` with different bytes) that a simple "is status empty" check would miss. - Catches added/modified untracked files via the hash list. - Preserves staging discipline. - No `git reset` calls → Hard Rules don't need a carve-out. Brief "why not WIP" explanation kept inline as a tombstone so future contributors don't reinvent the WIP recipe — enumerates the 8 footguns so the choice is documented, not just made. Net: -71 / +49 lines across both variants. Removes the most caveat-heavy section of these docs and replaces it with a single correct recipe.
…(4 reviewers)
Round-11 was the verify round for round-10's WIP-recipe deletion +
snapshot-recipe replacement. The snapshot recipe itself had a real
HIGH severity shell-injection bug — 3-way confirmation across all
three external reviewers + my self-review caught it.
**The bug**: round-10's snapshot recipe used:
```bash
git ls-files --others --exclude-standard -z \
| xargs -0 -I{} sh -c 'printf "%s %s\n" "$(git hash-object "{}")" "{}"'
```
`xargs -I{}` performs TEXTUAL substitution of `{}` into the
`sh -c` string BEFORE sh parses it. So a filename containing
`$(...)` gets evaluated as command substitution inside the
double-quoted argument.
Empirically reproduced:
```
$ touch 'evil$(echo PWNED >&2).txt'
$ git ls-files --others -z | xargs -0 -I{} sh -c 'printf "%s %s\n" "$(git hash-object "{}")" "{}"'
PWNED ← echo executed
fatal: could not open 'evil.txt' for reading
PWNED ← executed again for the second {}
```
This is especially bad because:
1. It directly contradicts the doc's load-bearing claim from
round 10 — "Use snapshot comparison — it's the only approach
with no footguns" was the justification for deleting the
WIP recipe. A footgun in the replacement undermines the
whole simplification.
2. The recipe targets dirty-tree review of `codex review
--uncommitted` flows — exactly the situations where untrusted
content is most likely to be in the worktree (extracted
archives, fetched untrusted contributor diffs, etc.).
3. Operators are told to copy this recipe verbatim into their
review-cycle implementation.
**Reviewer ratings**:
- codex: HIGH
- copilot: HIGH (x2 — one per file)
- claude sub-agent: medium (with empirical reproduction)
- me (self-review): low (correctness/edge case framing, missed
the security angle)
The two HIGH ratings + concrete reproduction are the right call.
**Fix**: replace with a null-delimited `while read` loop that
passes each filename as a bash variable (data) rather than
substituting into shell source:
```bash
git ls-files --others --exclude-standard -z \
| while IFS= read -r -d '' f; do
printf '%s %s\n' "$(git hash-object -- "$f")" "$f"
done \
| sort > "$SNAPSHOT_DIR/pre-untracked.txt"
```
Verified clean: the same `evil$(echo PWNED >&2).txt` filename
that triggered execution under the old form is now printed
literally with its hash.
Added an inline comment explaining the gotcha so future
maintainers don't "simplify" back to the xargs form. The
comment cites the exact failure shape.
Same fix in both claude/ and codex/ variants (recipe is
byte-identical across them).
Summary
Two real failures with
/have:review-cyclesurfaced in the same session, both about implicit discipline that was easy to rationalize away. This PR closes both gaps in the command file:Gap 1: I silently skipped the Copilot step
Rationalized "the Copilot bot already reviewed the PR" — but the bot only fires post-push. The whole point of pre-PR review-cycle is catching issues before the bot can. The command's wording was passive enough that this rationalization was easy.
Fix: mark the Copilot CLI step as non-optional with a one-sentence explanation of why (bot = post-push, CLI = pre-push). Updated invocation to current
copilotCLI syntax (gh copilot -- -p "..." --allow-all-tools --effort xhigh).Gap 2: I did one-shot rounds instead of actually looping
Ran each reviewer once, declared "done" after the first batch of fixes. When I caught myself and ran the cycle properly, it took 7 rounds to converge on a 37-line documentation PR — catching 9 progressively-narrower issues that would have shipped otherwise. See pr-review#3 for the per-round commits.
Fix: add explicit Hard Rules for the loop:
Gap 3: Real environment blockers undocumented
Both non-codex reviewers failed during the live run for environment reasons that weren't in the command:
claude -pfrom a parent claude session → 401 (OAuth doesn't propagate to children)gh copilot→ "Access denied by policy settings" (org Copilot policy disabled CLI use)Fix: document both with concrete remediation (policy URL to flip;
ANTHROPIC_API_KEYenv var to set). Define explicit fallback behavior: if a reviewer is unavailable, record it in the final report. Don't silently drop. If Copilot CLI is unreachable, open the PR as draft so the Copilot bot reviews before merge candidates form.Same edit in both
claude/andcodex/command filesThe command behavior doesn't differ enough on these sections to justify divergence.
Test plan
claude plugin validatepasses (no schema changes)/have:review-cycleon the next non-trivial PR and confirm all three reviewers fire pre-pushANTHROPIC_API_KEYworkaround unblocksclaude -pfrom a parent claude session