Skip to content

fix(have:review-cycle): operational discipline — Copilot non-optional, loop convergence, blocker docs#5

Merged
willgriffin merged 26 commits into
mainfrom
fix/review-cycle-copilot-required
May 23, 2026
Merged

fix(have:review-cycle): operational discipline — Copilot non-optional, loop convergence, blocker docs#5
willgriffin merged 26 commits into
mainfrom
fix/review-cycle-copilot-required

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

@willgriffin willgriffin commented May 22, 2026

Summary

Two real failures with /have:review-cycle surfaced in the same session, both about implicit discipline that was easy to rationalize away. This PR closes both gaps in the command file:

Gap 1: I silently skipped the Copilot step

Rationalized "the Copilot bot already reviewed the PR" — but the bot only fires post-push. The whole point of pre-PR review-cycle is catching issues before the bot can. The command's wording was passive enough that this rationalization was easy.

Fix: mark the Copilot CLI step as non-optional with a one-sentence explanation of why (bot = post-push, CLI = pre-push). Updated invocation to current copilot CLI syntax (gh copilot -- -p "..." --allow-all-tools --effort xhigh).

Gap 2: I did one-shot rounds instead of actually looping

Ran each reviewer once, declared "done" after the first batch of fixes. When I caught myself and ran the cycle properly, it took 7 rounds to converge on a 37-line documentation PR — catching 9 progressively-narrower issues that would have shipped otherwise. See pr-review#3 for the per-round commits.

Fix: add explicit Hard Rules for the loop:

  • Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes)
  • A fix-round is never the final round — must run a verify round
  • Convergence is per-commit, not per-finding
  • Default cap stays 3 for code, 5-10 for documentation/checklist content

Gap 3: Real environment blockers undocumented

Both non-codex reviewers failed during the live run for environment reasons that weren't in the command:

  • claude -p from a parent claude session → 401 (OAuth doesn't propagate to children)
  • gh copilot → "Access denied by policy settings" (org Copilot policy disabled CLI use)

Fix: document both with concrete remediation (policy URL to flip; ANTHROPIC_API_KEY env var to set). Define explicit fallback behavior: if a reviewer is unavailable, record it in the final report. Don't silently drop. If Copilot CLI is unreachable, open the PR as draft so the Copilot bot reviews before merge candidates form.

Same edit in both claude/ and codex/ command files

The command behavior doesn't differ enough on these sections to justify divergence.

Test plan

  • claude plugin validate passes (no schema changes)
  • Real-world validation: pr-review#3 ran with the looping discipline now documented — converged at round 7, exactly as the new Hard Rules would have produced from the start
  • Once Copilot CLI policy is enabled (it now is), run /have:review-cycle on the next non-trivial PR and confirm all three reviewers fire pre-push
  • Confirm ANTHROPIC_API_KEY workaround unblocks claude -p from a parent claude session

@willgriffin willgriffin changed the title fix(have:review-cycle): make Copilot CLI step non-optional + document auth/policy blockers fix(have:review-cycle): operational discipline — Copilot non-optional, loop convergence, blocker docs May 22, 2026
@willgriffin willgriffin force-pushed the fix/review-cycle-copilot-required branch from 09e9342 to de7295d Compare May 22, 2026 19:55
@willgriffin
Copy link
Copy Markdown
Contributor Author

Added third operational fix to the same PR (commit ac332a7) — folding in the exit-on-trivia rule based on the actual data from pr-review#3's 7-round cycle.

The rule: loop exits when no P0/P1/P2 findings remain — not when every reviewer returns zero findings. P3/nit findings (polish, narrow factual edges, cosmetic placement) get recorded as accepted non-blockers in the PR body, not looped on.

Evidence: pr-review#3 took 7 rounds. Under the new rule it would have exited at round 4 (3 substantive fixes + 1 verify round), with these P3 findings recorded but not fixed in that PR:

  • Node 25/26 type-strip ambiguity (low — narrow version window)
  • Node 23.0-23.5 strip-types flag window omitted (low — 6-month version window most projects skip)
  • Initial transform-types-omission (low — codex's first finding, also narrow)

Same ship outcome, ~half the wall-clock time. The 'medium' findings (transform-types misattributed to TSX/decorators, shebang in wrong section) would still have looped because those would actually mislead reviewers using the checklist.

PR body now has explicit Final Report fields to make the discipline visible:

  • Remaining blockers (P0-P2) — what would have re-triggered the loop
  • Accepted non-blockers (P3/nit) — recorded but not fixed
  • Skipped reviewers — enforces the no-silently-drop rule

So this PR now closes three real operational gaps in one shot:

  1. Copilot CLI step non-optional (it's pre-push; the bot is post-push)
  2. Loop discipline: parallel reviewers on same commit, fix-round never the final round
  3. Exit-on-trivia: don't loop on P3/nit

All three came from the same session running review-cycle on pr-review#3 and observing what actually went wrong. Real data, not speculation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the /have:review-cycle command documentation to enforce stricter review-loop discipline (including a non-optional Copilot CLI run) and adds release automation that generates changesets from conventional commits and publishes directly on pushes to main.

Changes:

  • Harden /have:review-cycle docs: make Copilot CLI pre-push review non-optional, document real-world tool blockers/fallbacks, and add explicit loop convergence “hard rules”.
  • Add scripts/auto-changeset.ts plus tsx dependency and scripts to auto-generate a unified changeset across packages.
  • Replace the release workflow’s PR-or-publish flow with a direct publish-on-main-push flow that auto-changesets, versions, commits, publishes, tags, and pushes back to main.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/auto-changeset.ts New CI-oriented script to generate a single changeset from conventional commits since the last release tag.
package.json Adds tsx + new changeset scripts (changeset:auto, changeset:version, changeset:publish).
pnpm-lock.yaml Locks new tsx dependency (and its transitive dependencies).
codex/plugins/have/commands/review-cycle.md Makes Copilot CLI step non-optional; adds loop hard rules and blocker/fallback documentation.
claude/have/commands/review-cycle.md Same review-cycle documentation changes as the codex command file.
.github/workflows/release.yml Changes release strategy to direct publish on push to main, including auto-changeset + versioning + commit + publish + push/tags.
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/release.yml Outdated
Comment thread .github/workflows/release.yml
Comment thread .github/workflows/release.yml Outdated
Comment thread scripts/auto-changeset.ts Outdated
@willgriffin willgriffin force-pushed the fix/review-cycle-copilot-required branch from e176380 to 15f1846 Compare May 22, 2026 20:18
willgriffin added a commit that referenced this pull request May 22, 2026
Belated but the right move — ran pr-review (codex + copilot CLI
ensemble) on PR #5 after the rebase. Two findings, both real:

1. [medium, both reviewers] The example `gh copilot` invocation
   used `--allow-all-tools`, but the surrounding guidance claimed
   the run stayed "read-only" because the prompt instructed not to
   modify files. That's a policy/enforcement mismatch — prompt
   instructions are advisory, tool permissions are the actual
   enforcement. With write/edit tools available, a "review" pass
   can mutate the working tree mid-round, breaking the same-commit
   guarantee the loop relies on.

   Fixed by switching the example to an explicit read-only tool set
   (`--available-tools shell,read`) with a note that the prompt
   instruction is defense-in-depth. Also flagged that the CLI is
   preview-stage and tool names shift, so consumers should verify
   against `gh copilot -- --help` for their version.

2. [low, copilot only] P2 policy was internally inconsistent:
   step 3 said "block by default; loop unless explicitly accepted
   with rationale in the PR body", but step 6 said "Address all
   valid P0/P1/P2 findings in priority order" (no acceptance path).
   Final Report template only had a slot for "Accepted non-blockers
   (P3/nit)", no slot for accepted P2.

   Fixed by aligning step 6 ("Address all P0/P1 mandatory; P2
   mandatory unless explicitly accepted...") and adding "Accepted
   P2 (with rationale)" to the Final Report template.

Worth noting the asymmetric blind-spot pattern showed up again:
- Codex caught the substantive medium (read-only enforcement).
- Copilot caught both the same medium AND a low that codex missed
  (P2 cross-section consistency). The placement/contradiction
  pattern keeps being copilot's unique signal.

Mirror edits in both claude/ and codex/ command files.
willgriffin added a commit that referenced this pull request May 22, 2026
Round 3 of pr-review on PR #5 found three real contract issues codex
caught (copilot's output truncated mid-exploration this round):

1. [medium] Status `clean` definition omitted validation. So failed
   validation + reviewers clean → Status=clean → /ship gates on
   Status=clean and proceeds. Added validation to the contract:
   clean now requires "validation green"; failed validation maps
   to `blocked`.

2. [medium] "Record in PR body" / "open as draft PR" language
   assumed a PR already exists. But /review-cycle runs BEFORE PR
   creation when invoked from /ship — there's no PR body to write
   to. Reframed: /review-cycle records all decisions in its FINAL
   REPORT (canonical record); /ship is responsible for copying
   the report into the PR body and deciding the draft-PR fallback.
   When a PR already exists, the report still gets the canonical
   record; updating the PR body is then optional/derivative.

3. [low] Per-commit convergence rule said "clean on X ≠ clean on Y,
   re-run all reviewers" — but the P3 rule said "P3 cheap fix → no
   verify round needed." Contradiction: a P3-only commit would
   trigger the per-commit rule, forcing another full ensemble pass
   for a one-line wording tweak. Scoped per-commit rule to
   behaviour-changing commits (P0/P1/P2 fixes or non-fix changes);
   explicitly exempted P3-only commits.

Mirror edits in both claude/ and codex/ command files.

Running round 4 to verify convergence on the workflow contract.

Worth noting the round counts: 3 substantive medium-fix rounds plus
this commit. Each round caught real workflow-contract issues, not
trivia — the loop is doing exactly what it's supposed to. The fact
that we keep finding new contract bugs is itself evidence that the
spec is dense enough to warrant this many rounds of refinement.
willgriffin added a commit that referenced this pull request May 22, 2026
…andles partial

Round 4 of pr-review on PR #5: codex caught two more medium contract
drifts, both introduced by round-3's fixes. Copilot's output truncated
mid-exploration again this round; treating as one substantive reviewer.

1. [medium] Stop condition contradicted P2 acceptance. Step 11 said
   "stop when no P0/P1/P2 findings remain" — but reviewers can keep
   surfacing an accepted P2 (they have no way to know it was
   accepted in the final report). So an accepted P2 would prevent
   reaching the documented clean state, the loop would hit the cap,
   and the run would look blocked despite the acceptance being on
   file. Changed stop condition to "no *unaccepted* P0/P1/P2
   findings remain" with an explanatory sentence about why
   reviewers may keep surfacing accepted items.

2. [medium] Round-3 fix delegated the skipped-Copilot fallback to
   `/ship` ("downstream decides whether to open as draft") but
   didn't update `ship.md` to actually handle that case. `/ship`'s
   gate previously only branched on clean / partial-with-noise /
   blocked — a `partial` from a skipped required reviewer matched
   none of those, leaving the agent without an instruction.

   Expanded the gate to three explicit `partial` sub-cases:
   - partial because Copilot CLI was skipped → draft PR + bot
     review + rerun review-cycle + `gh pr ready` when clean
   - partial because another required reviewer (codex, claude-sub)
     was skipped → draft PR + flag for human acceptance
   - partial with only accepted P2/non-blockers → continue, copy
     acceptances from final report into PR body

   Same edit in both claude/have/commands/ship.md and
   codex/plugins/have/commands/ship.md.

This expands the PR's scope to touch ship.md, but the changes are
logically coupled to the review-cycle contract — the cycle correctly
caught that I'd written a delegation without updating the delegate.
Running round 5 to verify.
willgriffin added a commit that referenced this pull request May 22, 2026
allowlist + ship propagation

Two reviewers caught three real findings on the round-4 commit.
The most embarrassing one was self-inflicted: I had two Copilot CLI
session transcripts (.deny-test.jsonl, .revparse-test.jsonl) leak
into the PR via `git add -A`. They were written by Copilot's own
review probes (testing tool-permission flags) and got picked up
along with my real changes. They leak local /Users/will/.agents/
paths, skill descriptions, session IDs.

Copilot caught its OWN debug artifact in PR #5 and used `.deny-test
.jsonl:30` (showing `Permission to run this tool was denied ...
shell(git rev-parse)`) as concrete evidence for a *separate*
finding — that the documented allowlist was missing `git rev-parse`.
The meta-loop: artifact A introduced a privacy leak (caught), and
also produced empirical evidence for content gap B (also caught).

Three fixes:

1. Removed `.deny-test.jsonl` and `.revparse-test.jsonl` from the
   tree and the index. Added `*.jsonl` to .gitignore so future
   review-probe runs won't repeat the mistake.

2. Added `shell(git rev-parse)` to both review-cycle command file
   examples. Copilot's empirical evidence proved the omission
   causes real review failures.

3. /ship now always copies `Accepted P2`, `Accepted non-blockers`,
   and `Skipped reviewers` from the review-cycle report into the
   PR body, regardless of whether the gate returned clean or
   partial. Codex correctly pointed out that under the new status
   contract, accepted-P2 produces `clean` (all P2 fixed-or-accepted
   → clean), so gating the propagation on `partial` would silently
   drop the audit trail. Same edit in both ship.md files.

The broader pattern from this session: I keep using `git add -A`
without reviewing what's actually being staged. That's how PR #5
originally picked up PR #4's commits (bad rebase) and how this
round's debug transcripts leaked. Worth adding to the pr-review
checklist as another infra-hazard bullet: "Files in PR diff that
aren't named in commit messages — sign of git add -A picking up
debug artifacts."

Running round 6 to verify convergence.
willgriffin added a commit that referenced this pull request May 22, 2026
Round 6 found four real issues from two reviewers. The most important
one is a security finding from copilot: the documented Copilot
allowlist permits `shell(rg)`, `shell(cat)`, `shell(head)` without
path scoping, leaving a prompt-injection exfiltration path even
though writes are blocked. A crafted commit subject could instruct
the reviewer to read $HOME secrets.

Four fixes:

1. [medium, security] Added `--add-dir "$(git rev-parse --show-toplevel)"`
   to the Copilot CLI example. Confines file-access surface to the
   repo. Combined with the per-command allowlist, eliminates the
   exfiltration path even under prompt injection. Documented the
   reasoning explicitly so future maintainers don't remove --add-dir
   thinking the allowlist alone is sufficient.

2. [medium] /ship's Copilot-skipped fallback had a logic gap: rerun
   review-cycle after bot review still returns partial (CLI is still
   blocked), so it can't clear status. Reworked the clearance path:
   bot review + no unaddressed findings + human marks ready =
   substitute acceptance. The human's `gh pr ready` invocation IS
   the explicit signal. Document substitution in PR body.

3. [medium] Removed the "Partial with only accepted P2 / accepted
   non-blockers" sub-branch in /ship — unreachable under the current
   status contract (accepted P2 → clean per the validation rule).
   The propagation of those fields is now unconditional (added in
   round 5), so the unreachable branch is also redundant.

4. [medium] "Commit and PR" section was still gated by "When
   validation and /review-cycle are clean," contradicting the
   draft-on-partial logic above. Reworded to "when the Review Cycle
   Gate has been satisfied," with explicit guidance that draft vs
   ready follows the gate's branch.

5. [low] Narrowed gitignore from `*.jsonl` (overbroad — hides
   legitimate fixtures) to specific probe-filename patterns
   (`*-test.jsonl`, `.copilot-session-*.jsonl`,
   `copilot-session-*.jsonl`). Plus a comment pointing to the
   actual rule: review outputs go to /tmp, not the repo.

Running round 7 to verify convergence.

This is round 6 of looping on PR #5. 17 substantive findings caught
across the 6 rounds, all valid. The cycle is doing what it should
even if convergence is taking longer than I expected.
willgriffin added a commit that referenced this pull request May 22, 2026
… gitignore

Round 7 of pr-review on PR #5. Codex caught two more real issues
(copilot's output truncated mid-verification this round, same
operational pattern as rounds 3-6 — extensive empirical testing,
no Findings section emitted).

1. [medium, security] The round-6 `--add-dir <repo>` "fix" was
   incomplete. Per Copilot CLI's permissions help, the default
   allowed-paths set includes the system temp dir. `--add-dir`
   adds dirs, it doesn't restrict them. So /tmp was still
   reachable — a prompt-injected diff could still ask Copilot to
   read /tmp/secrets and quote them into findings.

   Fixed with a three-part path lock:
   - `-C "$REPO_ROOT"` sets cwd explicitly (default cwd is
     whatever shell invoked, not necessarily the repo)
   - `--add-dir "$REPO_ROOT"` adds the repo
   - `--disallow-temp-dir` removes /tmp from the default set

   Documented why all three flags are needed so future maintainers
   don't remove one thinking the others are sufficient.

2. [low] `*-test.jsonl` pattern in .gitignore was still overbroad —
   would silently hide e.g. `fixtures/payment-test.jsonl` from
   `git status` / `git add`. Narrowed to just the
   `.copilot-session-*.jsonl` / `copilot-session-*.jsonl` prefixes.
   Comment now explicitly says the real rule is "review outputs
   go to /tmp" — gitignore is defense-in-depth, not the primary
   prevention.

This is round 7 of the cycle. The security finding was
iterative — round 6 added --add-dir thinking that was the fix;
round 7 caught that --add-dir alone wasn't enough; the three-part
lock should be complete. Running round 8 to verify.

Cycle metadata so far:
- 19 substantive findings caught across 7 rounds
- All valid, all addressed
- Copilot's output truncation in rounds 3+ suggests its
  empirical-verification approach exceeds output limits in
  longer reviews; codex remains reliable
- The security-scoping concern took 2 iterations to fully nail
  (--add-dir then --disallow-temp-dir + -C) — exactly the
  "fix can introduce or expose the next finding" pattern the
  loop is designed to catch
…/policy blockers

Real failure mode just hit: I ran review-cycle on pr-review#3 from
inside a Claude Code session and silently skipped the Copilot step,
rationalising "the Copilot bot already reviewed it." That defeats
the whole pre-push purpose — the bot only fires AFTER the PR opens.

Both reviewer subprocesses other than codex failed in the same run:
- `claude -p` from a parent claude session: 401 (OAuth doesn't
  propagate to children).
- `gh copilot`: "Access denied by policy settings" (org Copilot
  policy disabling CLI use).

The command file already specified running gh copilot, but the
language was passive enough that I rationalised skipping it. This
edit:

1. Marks the Copilot CLI step as non-optional with a one-sentence
   explanation of why (bot is post-push, CLI is pre-push, mixing
   them up defeats the catch-before-push purpose).
2. Updates the example invocation to the current `copilot` CLI
   syntax (`gh copilot -- -p "..." --allow-all-tools --effort xhigh`),
   replacing the older `--allow-tool 'shell(git)'` form that
   apparently doesn't apply to the v1+ Copilot CLI.
3. Documents the two known blockers (Copilot CLI policy, claude -p
   401) with concrete remediation steps — the policy URL to flip,
   the env var to set.
4. Defines fallback behaviour: if a reviewer is unavailable, record
   it in the final report explicitly. Do not silently drop. If
   Copilot CLI is unavailable, fall back to draft PR + Copilot bot
   review before marking ready.

Mirror edit in both the claude/ and codex/ command files. The
command surfaces don't differ enough on this section to justify
divergent guidance.
Real failure: I ran review-cycle on pr-review#3 and "stopped" after
each reviewer's first pass instead of actually looping. When I caught
myself and ran it properly, the cycle took 7 rounds to converge —
catching 9 progressively-narrower factual issues that would have
shipped if I'd stopped early.

The command file already said "Run up to `rounds` review rounds.
Default: 3" and "stop the loop as clean" when no findings remain,
but the wording was loose enough that I rationalized one-shot
behaviour. This commit adds explicit Hard Rules that close that
gap:

1. Each round runs all reviewers in parallel against the SAME
   commit (not sequentially against each other's fixes — that lets
   findings cascade in misleading ways and obscures whether
   reviewers actually agree on the latest state).

2. A fix-round is NEVER the final round. Convergence requires at
   least one round where every reviewer returns 0 findings against
   the latest commit. Just pushed a fix? Run another round before
   declaring clean.

3. Convergence is per-commit, not per-finding. Reviewer A clean
   against commit X doesn't transfer to commit Y (the fix commit).

Also updated:

- Default cap guidance: 3 is right for code; 5-10 for documentation /
  reviewer-checklist content where each round catches narrower
  factual edges (the pr-review#3 cycle was 7 rounds).
- Step 10 now explicit: "If a fix was pushed in this round, the next
  round MUST run."
- Step 11 explicit: "Stop as clean only when a verify round (no
  edits) returns no actionable findings."
- Cap-hit guidance distinguishes three cases: spec too detailed,
  diminishing returns acceptable, genuine gap.

Mirror edit in both claude/ and codex/ command files.

Evidence: see pr-review#3 (happyvertical/pr-review#3)
for the 7-round convergence log with per-round commits, findings
counts, and the asymmetric convergence pattern between codex (catches
narrow factual edges via deep verification) and copilot CLI (catches
structural/placement issues via live web fetch + cross-file grep).
Real signal from running review-cycle on pr-review#3: the loop went
7 rounds, but rounds 5 and arguably 1+3 had only "low" severity
findings (narrow Node version-window edge cases). Looping on those
was technical perfectionism — they didn't change what shipped, just
burnt reviewer cycles.

This commit adds an explicit exit-on-trivia rule, paired with the
existing exit-on-substantive-clean rule:

- P0/P1 (correctness, security, data loss, broken build): always
  block, always loop.
- P2 (likely bug, missing test/docs): block by default, loop unless
  accepted in PR body with rationale.
- **P3 / nit (polish, narrow factual edges, cosmetic): never block,
  never extend the loop.** Record as accepted non-blockers in the
  PR body or file as follow-up issues.

So the loop exit condition becomes "no P0/P1/P2 findings remain"
instead of "every reviewer returns zero findings." Convergence is
about substantive risk, not perfect agreement.

Re-running the pr-review#3 cycle under this rule: would have exited
at round 4 (3 substantive fixes + 1 verify round) instead of round
7. The 3 low-severity findings (Node 25/26 ambiguity, Node 23.0-23.5
window, transform-types-omission narrow case) would be listed in
the PR body as accepted non-blockers. Same ship outcome, ~half the
wall-clock time.

Updated the Final Report template to make the new fields visible:
- "Remaining blockers (P0-P2)" — what would have re-triggered the loop
- "Accepted non-blockers (P3/nit)" — recorded but not fixed in this PR
- "Skipped reviewers" — to enforce the existing "never silently drop"
  discipline

Mirror edit in both claude/ and codex/ command files.
Previous wording on the exit-on-trivia rule implied P3 findings
should be deferred ("record in PR body as accepted non-blockers, or
file as follow-up issues"). That's too restrictive — most P3 fixes
are one-line tweaks (rewording a doc bullet, fixing a comment
typo, etc.) and the right answer is to just fix them inline.

The rule's actual intent is: don't extend the LOOP on P3 — don't
run another full ensemble round just to verify a P3 fix. Whether
to address the P3 finding itself is a separate question with three
answers depending on cost:

1. Cheap to fix → fix inline, group with any other fixes
2. Worth tracking but not blocking → record in PR body
3. Bigger than this PR's scope → follow-up issue

The loop exit cares about "no P0/P1/P2 remaining," not about
whether P3 fixes happened.

Mirror edit in both claude/ and codex/ command files.
Belated but the right move — ran pr-review (codex + copilot CLI
ensemble) on PR #5 after the rebase. Two findings, both real:

1. [medium, both reviewers] The example `gh copilot` invocation
   used `--allow-all-tools`, but the surrounding guidance claimed
   the run stayed "read-only" because the prompt instructed not to
   modify files. That's a policy/enforcement mismatch — prompt
   instructions are advisory, tool permissions are the actual
   enforcement. With write/edit tools available, a "review" pass
   can mutate the working tree mid-round, breaking the same-commit
   guarantee the loop relies on.

   Fixed by switching the example to an explicit read-only tool set
   (`--available-tools shell,read`) with a note that the prompt
   instruction is defense-in-depth. Also flagged that the CLI is
   preview-stage and tool names shift, so consumers should verify
   against `gh copilot -- --help` for their version.

2. [low, copilot only] P2 policy was internally inconsistent:
   step 3 said "block by default; loop unless explicitly accepted
   with rationale in the PR body", but step 6 said "Address all
   valid P0/P1/P2 findings in priority order" (no acceptance path).
   Final Report template only had a slot for "Accepted non-blockers
   (P3/nit)", no slot for accepted P2.

   Fixed by aligning step 6 ("Address all P0/P1 mandatory; P2
   mandatory unless explicitly accepted...") and adding "Accepted
   P2 (with rationale)" to the Final Report template.

Worth noting the asymmetric blind-spot pattern showed up again:
- Codex caught the substantive medium (read-only enforcement).
- Copilot caught both the same medium AND a low that codex missed
  (P2 cross-section consistency). The placement/contradiction
  pattern keeps being copilot's unique signal.

Mirror edits in both claude/ and codex/ command files.
…atus contract

Round 2 of pr-review on this PR found two more medium issues, both
valid and both newly introduced by round 1's fixes. Exactly the
"fix-round is never the final round" failure mode this PR codifies.

Copilot literally EXECUTED CLI commands during its review to verify
the first finding empirically — strongest convergence signal yet.

1. Round-1 fix replaced `--allow-all-tools` with `--available-tools
   shell,read`. That's wrong: `--available-tools` only filters which
   tools the model can SEE, not which it can run without approval.
   In non-interactive `-p` mode there's no UI to ask for permission,
   so tool calls get denied with "Permission denied and could not
   request permission from user." Review then runs with zero
   repository context.

   Copilot verified this by running `gh copilot -- -p "Run shell
   command: git diff --name-only --stat" --available-tools shell
   --effort low -s` and getting the permission-denied error. Codex
   independently caught it by reading the CLI's permissions docs.

   Correct shape: explicit per-command `--allow-tool 'shell(git
   diff)'`, `--allow-tool 'shell(git log)'`, etc. flags. Enforces
   read-only at the permission layer with per-command granularity.
   Added the example pattern + `gh copilot -- help permissions`
   reference for keeping current.

2. The "Skipped reviewers" rule said to record skips but didn't
   require Status to drop. So Copilot policy-blocked + codex+claude
   clean → final report "Status: clean, Skipped: Copilot" → /ship
   gates on Status==clean and proceeds. Recreates the soft-skip
   this PR exists to prevent.

   Fixed by making Status's contract explicit: clean REQUIRES all
   required reviewers ran. Any required reviewer skipped → Status
   is at minimum "partial." The Status enum description is now
   inlined in the final report template so the rule is visible
   wherever a consumer looks. Copilot CLI specifically gets the
   "open as draft for bot review" fallback explicit, not just
   "consider."

Mirror edits in both claude/ and codex/ command files.

Running round 3 to verify convergence.
Round 3 of pr-review on PR #5 found three real contract issues codex
caught (copilot's output truncated mid-exploration this round):

1. [medium] Status `clean` definition omitted validation. So failed
   validation + reviewers clean → Status=clean → /ship gates on
   Status=clean and proceeds. Added validation to the contract:
   clean now requires "validation green"; failed validation maps
   to `blocked`.

2. [medium] "Record in PR body" / "open as draft PR" language
   assumed a PR already exists. But /review-cycle runs BEFORE PR
   creation when invoked from /ship — there's no PR body to write
   to. Reframed: /review-cycle records all decisions in its FINAL
   REPORT (canonical record); /ship is responsible for copying
   the report into the PR body and deciding the draft-PR fallback.
   When a PR already exists, the report still gets the canonical
   record; updating the PR body is then optional/derivative.

3. [low] Per-commit convergence rule said "clean on X ≠ clean on Y,
   re-run all reviewers" — but the P3 rule said "P3 cheap fix → no
   verify round needed." Contradiction: a P3-only commit would
   trigger the per-commit rule, forcing another full ensemble pass
   for a one-line wording tweak. Scoped per-commit rule to
   behaviour-changing commits (P0/P1/P2 fixes or non-fix changes);
   explicitly exempted P3-only commits.

Mirror edits in both claude/ and codex/ command files.

Running round 4 to verify convergence on the workflow contract.

Worth noting the round counts: 3 substantive medium-fix rounds plus
this commit. Each round caught real workflow-contract issues, not
trivia — the loop is doing exactly what it's supposed to. The fact
that we keep finding new contract bugs is itself evidence that the
spec is dense enough to warrant this many rounds of refinement.
…andles partial

Round 4 of pr-review on PR #5: codex caught two more medium contract
drifts, both introduced by round-3's fixes. Copilot's output truncated
mid-exploration again this round; treating as one substantive reviewer.

1. [medium] Stop condition contradicted P2 acceptance. Step 11 said
   "stop when no P0/P1/P2 findings remain" — but reviewers can keep
   surfacing an accepted P2 (they have no way to know it was
   accepted in the final report). So an accepted P2 would prevent
   reaching the documented clean state, the loop would hit the cap,
   and the run would look blocked despite the acceptance being on
   file. Changed stop condition to "no *unaccepted* P0/P1/P2
   findings remain" with an explanatory sentence about why
   reviewers may keep surfacing accepted items.

2. [medium] Round-3 fix delegated the skipped-Copilot fallback to
   `/ship` ("downstream decides whether to open as draft") but
   didn't update `ship.md` to actually handle that case. `/ship`'s
   gate previously only branched on clean / partial-with-noise /
   blocked — a `partial` from a skipped required reviewer matched
   none of those, leaving the agent without an instruction.

   Expanded the gate to three explicit `partial` sub-cases:
   - partial because Copilot CLI was skipped → draft PR + bot
     review + rerun review-cycle + `gh pr ready` when clean
   - partial because another required reviewer (codex, claude-sub)
     was skipped → draft PR + flag for human acceptance
   - partial with only accepted P2/non-blockers → continue, copy
     acceptances from final report into PR body

   Same edit in both claude/have/commands/ship.md and
   codex/plugins/have/commands/ship.md.

This expands the PR's scope to touch ship.md, but the changes are
logically coupled to the review-cycle contract — the cycle correctly
caught that I'd written a delegation without updating the delegate.
Running round 5 to verify.
allowlist + ship propagation

Two reviewers caught three real findings on the round-4 commit.
The most embarrassing one was self-inflicted: I had two Copilot CLI
session transcripts (.deny-test.jsonl, .revparse-test.jsonl) leak
into the PR via `git add -A`. They were written by Copilot's own
review probes (testing tool-permission flags) and got picked up
along with my real changes. They leak local /Users/will/.agents/
paths, skill descriptions, session IDs.

Copilot caught its OWN debug artifact in PR #5 and used `.deny-test
.jsonl:30` (showing `Permission to run this tool was denied ...
shell(git rev-parse)`) as concrete evidence for a *separate*
finding — that the documented allowlist was missing `git rev-parse`.
The meta-loop: artifact A introduced a privacy leak (caught), and
also produced empirical evidence for content gap B (also caught).

Three fixes:

1. Removed `.deny-test.jsonl` and `.revparse-test.jsonl` from the
   tree and the index. Added `*.jsonl` to .gitignore so future
   review-probe runs won't repeat the mistake.

2. Added `shell(git rev-parse)` to both review-cycle command file
   examples. Copilot's empirical evidence proved the omission
   causes real review failures.

3. /ship now always copies `Accepted P2`, `Accepted non-blockers`,
   and `Skipped reviewers` from the review-cycle report into the
   PR body, regardless of whether the gate returned clean or
   partial. Codex correctly pointed out that under the new status
   contract, accepted-P2 produces `clean` (all P2 fixed-or-accepted
   → clean), so gating the propagation on `partial` would silently
   drop the audit trail. Same edit in both ship.md files.

The broader pattern from this session: I keep using `git add -A`
without reviewing what's actually being staged. That's how PR #5
originally picked up PR #4's commits (bad rebase) and how this
round's debug transcripts leaked. Worth adding to the pr-review
checklist as another infra-hazard bullet: "Files in PR diff that
aren't named in commit messages — sign of git add -A picking up
debug artifacts."

Running round 6 to verify convergence.
Round 6 found four real issues from two reviewers. The most important
one is a security finding from copilot: the documented Copilot
allowlist permits `shell(rg)`, `shell(cat)`, `shell(head)` without
path scoping, leaving a prompt-injection exfiltration path even
though writes are blocked. A crafted commit subject could instruct
the reviewer to read $HOME secrets.

Four fixes:

1. [medium, security] Added `--add-dir "$(git rev-parse --show-toplevel)"`
   to the Copilot CLI example. Confines file-access surface to the
   repo. Combined with the per-command allowlist, eliminates the
   exfiltration path even under prompt injection. Documented the
   reasoning explicitly so future maintainers don't remove --add-dir
   thinking the allowlist alone is sufficient.

2. [medium] /ship's Copilot-skipped fallback had a logic gap: rerun
   review-cycle after bot review still returns partial (CLI is still
   blocked), so it can't clear status. Reworked the clearance path:
   bot review + no unaddressed findings + human marks ready =
   substitute acceptance. The human's `gh pr ready` invocation IS
   the explicit signal. Document substitution in PR body.

3. [medium] Removed the "Partial with only accepted P2 / accepted
   non-blockers" sub-branch in /ship — unreachable under the current
   status contract (accepted P2 → clean per the validation rule).
   The propagation of those fields is now unconditional (added in
   round 5), so the unreachable branch is also redundant.

4. [medium] "Commit and PR" section was still gated by "When
   validation and /review-cycle are clean," contradicting the
   draft-on-partial logic above. Reworded to "when the Review Cycle
   Gate has been satisfied," with explicit guidance that draft vs
   ready follows the gate's branch.

5. [low] Narrowed gitignore from `*.jsonl` (overbroad — hides
   legitimate fixtures) to specific probe-filename patterns
   (`*-test.jsonl`, `.copilot-session-*.jsonl`,
   `copilot-session-*.jsonl`). Plus a comment pointing to the
   actual rule: review outputs go to /tmp, not the repo.

Running round 7 to verify convergence.

This is round 6 of looping on PR #5. 17 substantive findings caught
across the 6 rounds, all valid. The cycle is doing what it should
even if convergence is taking longer than I expected.
… gitignore

Round 7 of pr-review on PR #5. Codex caught two more real issues
(copilot's output truncated mid-verification this round, same
operational pattern as rounds 3-6 — extensive empirical testing,
no Findings section emitted).

1. [medium, security] The round-6 `--add-dir <repo>` "fix" was
   incomplete. Per Copilot CLI's permissions help, the default
   allowed-paths set includes the system temp dir. `--add-dir`
   adds dirs, it doesn't restrict them. So /tmp was still
   reachable — a prompt-injected diff could still ask Copilot to
   read /tmp/secrets and quote them into findings.

   Fixed with a three-part path lock:
   - `-C "$REPO_ROOT"` sets cwd explicitly (default cwd is
     whatever shell invoked, not necessarily the repo)
   - `--add-dir "$REPO_ROOT"` adds the repo
   - `--disallow-temp-dir` removes /tmp from the default set

   Documented why all three flags are needed so future maintainers
   don't remove one thinking the others are sufficient.

2. [low] `*-test.jsonl` pattern in .gitignore was still overbroad —
   would silently hide e.g. `fixtures/payment-test.jsonl` from
   `git status` / `git add`. Narrowed to just the
   `.copilot-session-*.jsonl` / `copilot-session-*.jsonl` prefixes.
   Comment now explicitly says the real rule is "review outputs
   go to /tmp" — gitignore is defense-in-depth, not the primary
   prevention.

This is round 7 of the cycle. The security finding was
iterative — round 6 added --add-dir thinking that was the fix;
round 7 caught that --add-dir alone wasn't enough; the three-part
lock should be complete. Running round 8 to verify.

Cycle metadata so far:
- 19 substantive findings caught across 7 rounds
- All valid, all addressed
- Copilot's output truncation in rounds 3+ suggests its
  empirical-verification approach exceeds output limits in
  longer reviews; codex remains reliable
- The security-scoping concern took 2 iterations to fully nail
  (--add-dir then --disallow-temp-dir + -C) — exactly the
  "fix can introduce or expose the next finding" pattern the
  loop is designed to catch
Rounds 6-8 ratcheted up the Copilot allowlist docs (--add-dir,
--disallow-temp-dir, -C, and a round-8 "HIGH severity" claim about
prompt-injection exfiltration). Will pointed out the underlying
mistake: codex was applying a remote-service threat model to a
local CLI tool. Copilot CLI runs on the user's machine with the
user's credentials; it can already see what the user can see.
"Exfiltration" only matters when findings reach a third party,
which they don't in the normal use case (engineer reviewing own
org's PR pre-push, findings going to own terminal).

I should have caught this. My own pr-review checklist says
"Findings are evidence, not orders — verify each finding against
the code before fixing." I just acted on the HIGH severity tag
without questioning the threat model. Exact failure mode the
checklist warns against.

This commit walks back the elaborate security framing:

- Kept the actual flags (-C, --add-dir, --disallow-temp-dir,
  per-command --allow-tool). They're still useful — for **scope
  hygiene** (keep the reviewer focused on the repo, prevent
  accidental noise from /tmp files), not security.

- Replaced the rounds-6-7 "Why the path-scoping is a three-part
  lock... no exfiltration even under prompt injection" prose with
  a simpler "this is scope hygiene, not a security boundary"
  paragraph.

- Added a "When stricter sandboxing actually matters" callout
  covering the real edge cases where the threat model holds:
  OSS PR review with auto-posted findings, CI runs with restricted
  secrets, untrusted contributor diffs. For those, a sanitized
  temp checkout is the right architecture.

- For the normal HappyVertical case, the existing flags are
  enough. No architectural change needed.

Mirror edit in both claude/ and codex/ command files.

Lesson worth folding back: the pr-review checklist could use a
bullet about "match the threat model to the deployment" or
similar, so I (and any future reviewer using the checklist)
catch the conflation before applying it. Will think about how
to phrase it.
Per Will's request: use the "-cli" suffix consistently when
referring to the local CLI tools as concepts, to keep the
distinction between "Copilot bot" (cloud service that reviews
post-PR) and "copilot-cli" (the local CLI we invoke pre-push)
visible everywhere.

Same pattern for the other two:
- "Codex" / "codex" (as tool reference) → "codex-cli"
- "Claude" (as reviewer-name in lists) → "claude-cli"
- "Copilot" / "Copilot CLI" → "copilot-cli"

PRESERVED (these aren't the CLI tool):
- "Claude Code" — the product name / agent identity
- "Copilot bot" / "Copilot PR review *bot*" — the cloud bot
- "Copilot in Chrome" — the browser extension
- "Copilot policies page" — the org admin setting
- Branch prefixes `claude/<task-name>` / `codex/<task-name>`
- Filesystem paths `/Users/will/.claude/`, `.claude-plugin`, etc.
- Literal invocations `claude -p`, `codex exec`, `gh copilot`
- "Claude Bash tool" — refers to Claude Code's internal tool

Done across all four command files:
- claude/have/commands/review-cycle.md
- codex/plugins/have/commands/review-cycle.md
- claude/have/commands/ship.md
- codex/plugins/have/commands/ship.md

Mechanical rename via Python with preserve-then-substitute pattern;
the first pass produced "codex-cli-cli" doubling which I fixed
with a follow-up pass; final hand-fix for "claude-cli Code" →
"Claude Code" where the standalone-Claude regex collided with the
Claude-Code product name.
…msg in error output

PR #5's commits use `fix(review-cycle,ship):` (multi-scope, one
edit landing in both slash command surfaces). The old regex
`(\([a-z0-9-]+\))?` rejected the comma, blocking the PR from
merging despite the commits being well-formed Conventional Commits.

Relax to `[a-z0-9][a-z0-9,/-]*`:
- First char must be alphanumeric — prevents stray punctuation like
  `(,foo)` or `(-foo)`.
- Allows comma for multi-scope commits.
- Allows forward slash for dep-name scopes like
  `chore(tibdex/github-app-token):`.
- Still rejects uppercase, colons in scope, leading punctuation.
- Verified against 9 synthetic cases (5 pass, 4 fail as expected)
  AND all 13 PR #5 commits (all pass).

Also fix the existing `echo "::error::Invalid commit message: $msg"`
to escape `%`, CR, LF before embedding the user-controlled commit
message in a workflow command. This is the same lesson from PR #6
round 14 / round 4: GitHub workflow commands parse `%`, `%0D`, and
`%0A` as their respective bytes, so an attacker (or just an
inconvenient commit message) containing those bytes can corrupt the
command payload or inject additional workflow commands. Extract
escape logic into `escape_wc()` for reuse.

Also switched `echo "$msg" | grep` to `printf '%s' "$msg" | grep` —
`echo` can interpret leading `-n`/`-e`/`-E` as flags in some shells,
so a commit subject starting with one of those would silently
produce wrong output. `printf '%s'` is unconditional.
Fold in the meta-lesson from have-config#6's loop. That PR ran
codex-cli solo for 12 rounds because the ensemble step in this
command was described but not enforced. At round 13, codex returned
0 findings; I declared convergence. When copilot-cli was added for
round 14, it immediately surfaced two real findings that codex had
been blind to across all 12 rounds:

1. The step-0 partial-release detector matched ANY commit subject
   starting with `chore(release):` — would deadlock on a normal
   human dependency-bump commit like `chore(release): bump
   pnpm/action-setup`. Codex missed; copilot caught.
2. A SIGPIPE/pipefail interaction in the same detector would
   silently bypass it entirely under non-trivial conditions.
   Codex missed; copilot caught.

Both were the kind of failure-path edge case the ensemble exists to
catch. Solo runs converge on what one reviewer's prior knowledge
covers; ensemble convergence covers the union.

Add a Hard Rule covering two specific failure modes:
- *Silent solo*: declaring convergence on one reviewer's 0 findings
  because the others are "redundant" / "slow" / "already passed
  last round". They're not redundant — they have non-overlapping
  blind spots.
- *Unavailable ≠ clean*: if a reviewer is blocked by auth, policy,
  or environment, the absence of findings is not affirmative
  clearance. Record explicitly; don't count silence as agreement.

Mirror edit in both `claude/` and `codex/` review-cycle.md.
@willgriffin willgriffin force-pushed the fix/review-cycle-copilot-required branch from 1a14339 to d9db0cf Compare May 23, 2026 05:19
Ensemble: codex-cli + copilot-cli, 3 distinct findings, no overlap.
All real, all high-confidence.

1. [medium, codex] Final report example omits the required claude-cli
   reviewer

   The new contract requires three reviewer subprocesses: codex-cli,
   claude-cli (separate non-interactive print-mode invocation), and
   copilot-cli. The example string was "3 rounds: codex-cli +
   copilot-cli + me" — implying the parent agent counts as the
   claude-cli reviewer. That contradicts the Hard Rule that claude-cli
   must run as an independent subprocess. An agent following the
   example could report Status: clean with claude-cli never having
   run, and `/ship` would gate-pass on a false-converged report.

   Fix: example string lists all three required reviewers; added a
   clarification that the orchestrator agent's inline opinion does
   not substitute for any subprocess reviewer.

2. [medium, codex] Round-cap `blocked` definition contradicts P3
   non-blocking rule

   Status definition said `blocked = ... cap hit with findings open`
   without scoping to severity. The Hard Rules say P3/nit findings
   never block and never extend the loop. So if the loop hit the
   cap with only P3 findings remaining (the exact "loop converging
   on trivia" pattern these rules are designed to exit cleanly),
   the status definition would force `blocked` and `/ship` would
   refuse to open a ready PR.

   Fix: explicit carve-out that round-cap exit with ONLY P3/nit
   findings remaining is NOT blocked — those go in accepted
   non-blockers and Status stays clean (or partial if a required
   reviewer was skipped).

3. [low, copilot] Scope regex accepted malformed multi-scope strings

   The first cut at the scope regex (`[a-z0-9][a-z0-9,/-]*`) was
   too permissive — it let separators appear in arbitrary positions:
   `fix(a,):`, `fix(a/):`, `fix(a,,b):`, `fix(a-):`, `fix(,b):`,
   `fix(/foo):` all passed. Downstream tooling that assumes
   comma-separated non-empty scope tokens would break.

   Fix: tokenize the grammar — `scope = segment ("," segment)*`,
   `segment = [a-z0-9] ([a-z0-9/-]* [a-z0-9])?`. Each segment must
   start AND end with alphanumeric; separators only between. Verified
   against 16 synthetic cases (8 pass, 8 fail as expected) AND all
   15 PR #5 commits (all pass).

Mirror edits in both claude/ and codex/ review-cycle.md files.
…review polish

Round-2 ensemble (codex-cli + copilot-cli + me as third). 3 distinct
real findings + 1 P3 inline-fix from the self-review pass.

1. [medium, codex] Copilot --allow-tool isn't a hard read-only boundary

   `--allow-tool 'shell(git diff)'` approves any `git diff …`
   invocation INCLUDING write-capable forms like
   `git diff --output=path` which can dirty the working tree. The
   review-cycle docs claimed the allowlist was the structural
   enforcement of "no write tools" — that's false. Same applies to
   `shell(rg)` via redirection-flag forms.

   This breaks Step 11's same-commit guarantee: if a reviewer
   modifies the tree mid-round (whether via prompt injection or just
   accidental tool choice), the commit being reviewed has moved.

   Fix: document the gap, document the structural mitigation
   (post-review `git status --porcelain` tree-clean check), and
   point at the disposable-worktree pattern for repeat offenders.

2. [medium, codex] Draft fallback can wait forever for bot review

   The ship.md fallback opens a draft PR expecting the Copilot bot
   to auto-review. But GitHub's default Copilot review only fires
   on PR open (non-draft) or Draft→Open transition, and doesn't
   auto-re-review subsequent pushes. Without "Review draft pull
   requests" + "Review new pushes" settings enabled, the fallback
   silently waits forever.

   Fix: document the prerequisite check explicitly, document the
   manual review-request fallback (`gh api ... requested_reviewers`),
   and emphasize that "current commit" means re-request after each
   subsequent push.

3. [medium, copilot] Step 11 stop condition omitted reviewer availability

   "Stop as clean" only checked findings + validation, not whether
   all required reviewers ran. The Status contract said skipped
   required reviewer forces `partial`, but Step 11 could have
   allowed `clean` to slip through if a reviewer was unavailable.

   Fix: rewrite Step 11 with three explicit conditions (no
   unaccepted P0/P1/P2 + green validation + all required reviewers
   ran). Inline clarifying "no findings surfaced ≠ clean — a
   reviewer that didn't run produced no findings because it didn't
   run."

4. [P3, self-review polish, me] Forward-reference cleanup

   First draft of Step 11 said "(skipped/unavailable reviewer →
   see the next bullet)" with the actual explanation in the
   paragraph below. "Next bullet" is ambiguous — the third condition
   isn't followed by separate bullets. Made the third condition
   self-contained and dropped the forward pointer.

Mirror edits in both claude/ and codex/ files.

Reviewer notes:
- Three reviewers ran: codex-cli (background subprocess),
  copilot-cli (background subprocess), and me (Claude orchestrator
  doing an explicit checklist pass on the staged diff).
- claude-cli subprocess was NOT run — the documented OAuth blocker
  fired (parent OAuth doesn't propagate to child claude -p, keychain
  token expired). The user-orchestrator (Claude) doing an explicit
  checklist review is the documented fallback. Acceptable per the
  contract this PR is editing IF the orchestrator's review is
  written down, not "I looked, it's fine". That's what the
  Finding 4 / self-review section above is.
…4 reviewers)

Round-3 ensemble: codex-cli + copilot-cli + claude sub-agent (via
parent's Agent tool — the cleanest workaround for the OAuth blocker
documented in PR #5's own contract; same independence property as
a separate `claude -p`) + orchestrator self-review.

Eight distinct findings surfaced across the four reviewers (codex 0,
copilot 2, claude 6, me 2). Six accepted, two rejected:

ACCEPTED (applied this commit):

1. [medium, copilot + claude] `Accepted P2` listed as partial trigger
   in ship.md, but the Status contract defines `partial` only by
   skipped required reviewers (Accepted P2 → clean). Removed the
   ambiguous "or Accepted P2" branch trigger.

2. [medium, claude] Default rounds wording contradicted itself —
   Line 7 / arg list say "Default: 3", but line 308 said "Default: 3
   for code, 5-10 for docs" as if auto-detected. The command does NOT
   auto-detect change type. Reworded to "default is 3 regardless;
   for doc work consider passing rounds=5..10" so the user does the
   adjustment explicitly.

3. [low, copilot] Scope regex allowed `fix(a//b):`, `fix(a--b):`
   (repeated separators within a segment). Combined with finding 4
   below into a single tighter grammar.

4. [low, claude] Scope regex rejected `@scope/pkg` (scoped npm
   packages) — `feat(@happyvertical/sql):` would have failed
   commitlint despite the error message explicitly advertising
   forward slash "e.g. dep names". Combined with finding 3 into:
     scope    = segment ("," segment)*
     segment  = "@"? alpha-run (("/" | "-") alpha-run)*
     alpha-run = [a-z0-9]+
   Verified against 22 synthetic cases + all 17 PR #5 commits.

5. [low, claude] `.gitignore` `*.jsonl` patterns don't match
   copilot's actual session file format. Per `gh copilot -- --help`,
   `--share[=path]` writes `copilot-session-<id>.md` (markdown), not
   jsonl; `--output-format json` writes to stdout, not a file;
   `--log-dir` defaults to `~/.copilot/logs/` (outside repo). Added
   `*.md` patterns (the actual leak vector) while keeping `*.jsonl`
   patterns for defense-in-depth (the original 1fc8677 incident
   leaked `.deny-test.jsonl` / `.revparse-test.jsonl` from internal
   probes — that pattern may resurface under different CLI versions).
   Comment now states what the primary prevention actually is
   (review outputs go to /tmp).

6. [low, claude] Round-cap section was silent on "P0/P1/P2 fix lands
   in final permitted round" — Rule 10 demands a verify round; cap
   forbids it. A literal-following agent could report `clean` based
   on "no findings remaining" when actually no verify round ran to
   surface them. Added explicit special case: report status `partial`
   with note recommending re-run with raised cap.

REJECTED (with rationale recorded):

7. [medium, claude] "copilot-cli `-p` requires `--allow-all-tools`"
   — empirically refuted. Just tested `copilot -p` with per-command
   `--allow-tool` only (no `--allow-all-tools`), got correct output.
   The doc's recipe is correct; copilot's `--help` text saying
   "required for non-interactive mode" appears to be misleading
   guidance, not a hard requirement.

8. [suggestion, claude] "Use `shell(git:*)` prefix consolidation"
   — would WEAKEN the doc's read-only safety. `shell(git:*)` enables
   ALL git subcommands including `git push`, `git apply`, `git
   commit` (writes). The current enumeration of read-only subcommands
   is intentional safety, not redundancy.

P3 DEFERRED (recorded for follow-up, not fixed this round per the
"P3 never extend the loop" rule):

- [P3, me] ship.md cites "Settings → Code & automation → Copilot →
  Code review" UI path. GitHub UI paths drift. Add canonical docs
  URL as durable reference.
- [P3, me] `gh api ... requested_reviewers copilot-pull-request-
  reviewer` 404s if Copilot isn't installed on the repo. Out of
  likely threat model but worth a caveat.

Mirror edits in both claude/ and codex/ files.

Reviewer note: claude-cli subprocess auth (OAuth from parent claude
session) remained blocked this round. The Agent-tool workaround
(spawning a fresh Claude as a sub-agent with the same review prompt)
provides the same fresh-perspective independence — and demonstrably
caught 6 findings the other reviewers + the orchestrator missed.
Worth folding into the docs as the canonical claude-cli substitute
when OAuth fails.
…, no rejections)

Round-4 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
Surfaced 6 distinct findings; 4 accepted, 1 declined as nit, 1 was
my own self-review note that overlapped with #1.

1. [medium, ALL 4 REVIEWERS] Cap-blocked-verify case overloaded
   `partial` status, breaking `/ship`'s partial-handling contract

   Round-3 fix #6 said "report status `partial`" when a P0/P1/P2 fix
   landed in the final permitted round (no verify possible). But the
   Status contract defines `partial` strictly as "skipped required
   reviewer". `/ship` then routes `partial` only by `Skipped
   reviewers`, which would be empty in this case — undefined
   behavior, agent could fall through to "treat as clean and ship
   unverified fix".

   Four-way confirmation:
   - codex: "make it `blocked`/`needs-verify`"
   - copilot: "either add explicit `verify-round-missed` reason +
     new ship branch, or classify as `blocked` to keep partial
     strict"
   - claude sub-agent: "either add third partial sub-branch in
     ship.md, or use different status (`blocked` with verify-needed
     marker)"
   - me (self-review): "contract overload — needs broadened
     definition or different status"

   Fix: reclassify as `blocked` with reason
   `verify-round-blocked-by-cap`. Three coordinated edits:
   (a) Round-cap section: change "report status as `partial`" to
       "report status as `blocked` with reason
       `verify-round-blocked-by-cap`".
   (b) Status contract `blocked =` definition: extend to enumerate
       "P0/P1/P2 fix landed in the final permitted round with no
       verify round possible". Reasoning: an unverified fix counts
       as potentially unaccepted because we don't yet know if the
       fix introduced new findings.
   (c) Status contract `partial =` definition: add "Single cause
       only — other 'incomplete' states (unverified fix, validation
       failed) are `blocked`, not `partial`" to lock down the
       single-cause invariant.
   (d) ship.md `blocked` branch: add explicit sub-case for
       `verify-round-blocked-by-cap` directing operator to re-run
       `/review-cycle rounds=N+1`.

2. [low, copilot + claude] `.gitignore` comment cited prior incident
   filenames the patterns don't actually catch

   Round-3 comment claimed defense-in-depth for `.deny-test.jsonl`
   / `.revparse-test.jsonl` from the 1fc8677 incident, but the
   narrow `copilot-session-*.jsonl` patterns don't match those
   names. Misleading framing.

   Fix: rewrite the comment to be honest about what the patterns
   cover and don't cover. Note explicitly that the narrow patterns
   were a deliberate round-7 walkback (avoided `*-test.jsonl`
   wildcard that would hide legitimate fixtures), and that the
   structural defense for arbitrary probe filenames is the /tmp
   rule, not gitignore.

3. [low, codex] Commitlint workflow had unused `pull-requests: read`
   permission

   The job only does `actions/checkout` + `git log` over commit
   SHAs from the event payload. No PR API calls. `pull-requests:
   read` is dead scope. Per least-privilege, dropped it.

4. [low, claude] Commitlint error message didn't mention `@` (the
   regex was updated to allow it in round 3 but the help text
   lagged)

   A user hitting the error with `feat(@foo/bar): ...` for an
   unrelated reason would read the help message and conclude `@`
   isn't supported. One-line fix to add "optional leading @
   (scoped npm packages)" to the chars list.

DECLINED:

5. [nit, claude] Regex accepts `feat(@scope):` (no `/pkg` suffix) —
   not a valid npm-scoped reference

   Tightening to require `/` after `@` would reject legitimate
   non-npm uses (e.g. someone wanting to use `@`-prefix for their
   own scope convention). The regex isn't claiming to enforce
   npm-semantic validity. Out of scope for a commit-message linter.

Round-4 ran the full 4-reviewer ensemble in parallel against the
same commit (1dd7688). Three rounds total of 4-reviewer ensembles
now — the claude-sub-agent-via-Agent-tool workaround is proving
durable enough to fold into the docs as the canonical claude-cli
substitute when OAuth fails.
Round-5 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
Six distinct findings; five accepted, one declined.

1. [medium, codex] Wrong API path for manual Copilot review request

   Round-2 fix told operators to call `gh api ...
   requested_reviewers -F 'reviewers[]=copilot-pull-request-reviewer'`
   to manually request Copilot review when the auto-review opt-in
   isn't enabled. Per GitHub's current docs, the supported CLI path
   is `gh pr edit <PR> --add-reviewer @copilot`, and
   `copilot-pull-request-reviewer` is a workflow/app slug used for
   billing, not a reviewer alias. The original recipe could 422
   silently while the operator believes the fallback was satisfied
   — leaving the PR un-reviewed.

   Fix: replace the `gh api` recipe with the documented `gh pr edit
   --add-reviewer @copilot` form. Link to GitHub's "request a code
   review" docs. For re-reviews, point at the Reviewers-menu
   re-request button (the documented re-review UX). Also linked
   the prerequisite-settings docs URL instead of a UI path that
   can drift as GitHub renames sections.

2. [low, claude sub-agent] codex variant ship.md frontmatter unquoted

   `description: Prepare ... shipping: validate ...` — embedded
   colon-space is interpreted as a nested mapping by strict YAML
   parsers. The mirror claude variant already quotes the string;
   the codex variant didn't. One-character fix (add outer quotes).
   Pre-existing but in a file this PR touches substantively;
   normalize as part of this round.

3. [low, claude sub-agent + me (overlap)] Top-line said "three
   reviewers" but practice has been "four" (with orchestrator)

   This loop has been running a 4-reviewer ensemble for several
   rounds (codex-cli + copilot-cli + claude sub-agent + orchestrator
   self-review), but the docs still said "three independent reviewer
   subprocesses". Reviewers can keep surfacing the same ambiguity
   each round unless the docs match practice.

   Fix: substantial docs update:
   (a) Intro paragraph in both variants now describes a
       "4-reviewer ensemble": three subprocess reviewers + the
       orchestrator's own checklist pass.
   (b) Claude variant: explicitly documents the claude-sub-agent
       via Agent tool as the canonical claude-cli substitute when
       OAuth fails (preferred: `claude -p` with valid auth;
       fallback: sub-agent with same prompt). Validated across
       3 rounds in this loop.
   (c) Both variants: explicit rule that the orchestrator's pass
       is NOT silent-solo — must be an enumerated checklist run
       with written-out findings, not "I looked, it's fine".
   (d) Reviews field in Final Report updated to list ALL FOUR
       slots and require explicit skip/substitute reasons.

4. [P3, me — self-review] Voice consistency

   Special-case wording used "we don't yet know whether the fix
   introduced new findings" — colloquial "we" vs the doc's
   imperative voice elsewhere. Reworded to "its effect on the
   codebase is unobserved" (third-person, consistent).

5. [P3, me — self-review] P3-only carve-out ambiguous re
   validation-failed

   Round-2 carve-out said "Status stays clean (or partial if a
   required reviewer was skipped)" but didn't address validation
   failure. Could read as implying clean/partial overrides
   validation-failed → blocked. Clarified: "(or blocked if
   validation failed — the carve-out only suppresses the P3-only
   at-cap path; other blocked causes still apply)".

DECLINED:

6. [low (low confidence), claude sub-agent] Bare `Merge` subject
   (no trailing space) bypasses neither check

   The current `if [[ "$msg" =~ ^Merge\  ]]` matches `Merge ` only.
   A subject of just `Merge` with no following text would fail
   both this guard and the regex, rejecting the merge commit.

   Declined because `git merge` never produces a bare `Merge`
   subject — it always emits `Merge branch ...` or `Merge pull
   request ...`. The only way to get a bare `Merge` subject is a
   manually authored message, in which case the operator should
   conform. Not worth tightening the guard for an unobserved edge
   case.
…o rejections)

Round-6 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
Seven distinct findings, all accepted. Most are fallout from round-5's
contract evolution interacting with existing rules.

1. [medium, codex + copilot + me (3-way)] Loop steps still said
   "three reviewers" while intro said "four"

   Round-5 added the orchestrator as the 4th reviewer slot in the
   intro, but the procedural loop (`Run codex-cli, claude-cli, and
   copilot-cli`) and the "For all three" header still said three.
   A literal-following operator would skip the orchestrator pass
   entirely and still report `clean` because the 4th slot wasn't
   required by the loop steps.

   Fix:
   (a) Loop step 2: list all FOUR slots explicitly and require
       each to produce findings before dedup.
   (b) Per-reviewer guidance: rename "For all three" → "For all
       three subprocess reviewers" + add new "Orchestrator
       self-review (the 4th reviewer slot)" section with explicit
       rules: same JSON shape as subprocesses, "no findings" must
       be explicit, runs concurrently with subprocesses.

2. [medium, copilot] Slot-based redefinition needed for claude
   subprocess vs sub-agent

   The claude-cli reviewer can be filled via subprocess (`claude -p`)
   OR sub-agent (via Agent tool when OAuth fails). But the Status
   contract said skipped required reviewer forces `partial`, and the
   `Skipped reviewers` field looked at the subprocess specifically.
   Strict reading: if `claude -p` fails and sub-agent succeeds, the
   subprocess is technically "skipped" → `partial` even though the
   slot was filled.

   Fix: think in terms of SLOTS, not specific invocations. The
   claude slot can be filled by EITHER subprocess or sub-agent;
   the slot is "skipped" only if BOTH fail. Same for ship.md
   partial branch.

3. [medium, codex] Tree-clean post-check breaks for dirty/uncommitted reviews

   Round-2's tree-clean check (`git status --porcelain` after each
   reviewer) only works when the tree was clean BEFORE the reviewer
   ran. For uncommitted-work reviews, `git status` is already
   non-empty, so the check either invalidates every round or
   misses same-status mutations (e.g. reviewer modifies an
   already-modified file; status stays `M path`).

   Fix: rewrite as **pre/post tree-snapshot comparison**:
   - Committed-work case: pre-clean, post-check `git status --porcelain`
     for any output (existing behaviour, documented explicitly).
   - Uncommitted-work case: either stash/commit before reviewing
     (recommended) OR capture pre/post snapshots of status + diff
     + untracked content hashes and diff them.
   Documented both with the "never just 'is git status clean now'
   as the post-check" reminder.

4. [medium, claude] `gh pr edit --add-reviewer @copilot` requires
   gh CLI v2.88.0+

   Round-5 replaced the wrong `gh api ... requested_reviewers`
   recipe with `gh pr edit --add-reviewer @copilot`. Per
   cli/cli#v2.88.0 release notes, `@copilot` on `--add-reviewer`
   was added in v2.88.0; older gh returns `Could not request
   reviewer: '@copilot' not found` and silently skips the request
   — same silent-fallback-fails-forever pattern round-5 was trying
   to prevent.

   Fix: document the version requirement explicitly, with
   `gh --version` check and upgrade path. Re-reviews use the PR
   page's Reviewers menu re-request button (not gh pr edit, which
   is initial-add only).

5. [medium, claude] claude variant ship.md "claude-cli subprocess
   auth fails" → partial contradicts round-5's sub-agent fallback

   Round-5 documented the sub-agent fallback for claude-cli, but
   the ship.md partial-branch still listed "claude-cli subprocess
   auth fails" as a partial trigger. If sub-agent succeeded, the
   slot is filled and `/review-cycle` should return `clean`, not
   `partial`. Round-5 contract update wasn't propagated to ship.md.

   Fix: update claude variant ship.md to require "claude-cli AND
   sub-agent fallback both failed" before treating the slot as
   skipped. Codex variant unchanged (no Agent-tool equivalent).
   Added defensive note that if `/review-cycle` returned `partial`
   despite a successful sub-agent fill, the bug is in the
   orchestrator's classification, not ship.md's handling.

6. [low, claude] codex variant Reviews template cited undefined
   "sub-agent fallback"

   Round-5's mirror edit added "(subprocess OR sub-agent fallback)"
   to the Reviews template in both variants. But the codex variant
   has no body explanation of "sub-agent fallback" — the Codex CLI
   orchestrator doesn't have an Anthropic-Agent-tool equivalent.
   Operator reading the codex variant report template would be
   confused.

   Fix: drop "OR sub-agent fallback" from the codex variant
   Reviews template; explicitly note that the Codex CLI
   orchestrator has no documented substitute for claude-cli, so
   unavailability means accept the reduced-coverage tradeoff.

7. [low, claude] ship.md step 7 auto-`gh pr ready` collided with
   partial-branch human-clearance requirement

   Step 7 said "if existing PR is draft and the work is now clean,
   `gh pr ready`". "Now clean" was ambiguous — could be read as
   subjective ("the tree looks fine now") rather than the Review
   Cycle Gate output specifically. On `partial` (e.g. Copilot CLI
   blocked, draft opened, bot reviewed manually), a literal reader
   could auto-ready before the human explicitly accepted the
   bot-for-CLI substitution.

   Fix: tie step 7 explicitly to "/review-cycle returned status
   `clean`" (not `partial`, not `blocked`). On `partial`, the
   human runs `gh pr ready` after the documented clearance path.
…aveat

Round-7 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
7 findings; 6 accepted, 1 declined (false positive on my part — claude
sub-agent claimed claude variant ship.md frontmatter was unquoted but
empirical check shows both variants are already quoted).

1. [medium, codex + copilot + claude (3-way)] claude variant
   required-reviewer wording said "claude-cli subprocess" but the
   slot-based model elsewhere allows EITHER subprocess OR sub-agent

   Round-5 documented the sub-agent fallback. Round-6 updated some
   contract text to slot-based language but missed the explicit
   "Status MUST drop to partial when ... claude-cli subprocess is
   skipped" clause. Strict reading: subprocess fails → partial,
   even when sub-agent fills the slot. ship.md already calls this
   "a bug in classification".

   Fix: rewrite the required-reviewer list as an enumerated slot
   list. Claude variant explicitly says "the claude reviewer slot —
   filled by either `claude -p` subprocess OR Agent-tool sub-agent
   fallback when OAuth fails. The slot is 'filled' if EITHER
   succeeds; only 'skipped' if BOTH fail."

2. [medium, codex] codex variant treated blocked-claude as
   "filled with caveat" in the loop step

   Round-6 wrote "claude-cli (or accepted-tradeoff if blocked)" in
   the codex variant's loop step 2, which mixes filled and skipped
   semantics. A later run can treat the blocked slot as filled and
   bypass the partial-status gate.

   Fix: drop the "(or accepted-tradeoff if blocked)" parenthetical.
   Explicit: if claude-cli fails on Codex CLI (no sub-agent
   substitute), the slot is skipped → partial. Accepted-tradeoff
   rationale lives in the final report, not in the slot list.

3. [medium, copilot] codex variant required-reviewer list omitted
   the orchestrator slot

   Round-6 added the 4-reviewer ensemble but the codex variant's
   required-reviewer-for-status-gate definition only listed the
   three subprocess reviewers. A run that skipped the orchestrator
   pass could still report `clean` because the gate's required
   list didn't include it.

   Fix: enumerate all four slots in both variants' required-
   reviewer definitions. The orchestrator slot fills itself but
   must be listed as required for the partial-status gate to bite
   when an operator forgets the self-review pass.

4. [medium, claude] "Run claude-cli review" section had no sub-agent
   fallback recipe

   Round-5 documented the fallback existence in the intro but the
   actual "Run claude-cli review (as a subprocess)" section only
   showed the subprocess invocation. An orchestrator hitting the
   401 error and looking for fallback instructions in the recipe
   section would find none, then either skip the slot wrong or
   ad-hoc invoke a sub-agent that may not produce the standard
   JSON shape.

   Fix: rename section to "Run claude-cli review (subprocess
   preferred, sub-agent fallback)". Document both paths:
   - Subprocess: existing `claude -p` recipe, plus explicit note
     about CLAUDE_CODE_OAUTH_TOKEN / ANTHROPIC_API_KEY setup so
     subprocess auth actually works from a Claude Code parent.
   - Sub-agent fallback: concrete Agent({}) shape with
     `subagent_type: "general-purpose"`, the standard JSON output
     instruction, and run_in_background.

   Also note explicitly the model-family overlap caveat (sub-agent
   is same model as orchestrator, so independence is
   "no-shared-conversation-context" not "different-model-family").

5. [low, claude] "For all three subprocess reviewers" rules don't
   cleanly apply to the sub-agent fallback path

   The renamed section bundles rules that are subprocess-specific
   (15-min background-Bash timeout, stdout/stderr capture to
   separate files). The sub-agent doesn't run as a subprocess, so
   these don't apply.

   Fix: rename to "For all three external reviewer slots" and
   prepend a note explaining which rules map to the subprocess
   path vs the sub-agent fallback path.

6. [low, claude — philosophical but valid] Orchestrator-as-4th-
   reviewer has structural confirmation bias the doc didn't
   acknowledge

   The ensemble's stated justification is "Different models have
   different blind spots." The 4th slot (orchestrator self-review)
   is the SAME model as the claude-cli reviewer AND the agent that
   wrote the code being reviewed. It has the LEAST blind-spot
   coverage of all four slots and has full author context (=
   confirmation bias). Treating "4/4 clean" as equivalent to "3/3
   clean from independent reviewers" over-weights the orchestrator.

   Fix: add an explicit caveat in the orchestrator self-review
   section: this slot's role is "explicit checklist
   accountability" (forces the discipline), not independent
   blind-spot coverage. A finding the orchestrator alone surfaces
   is real; a "no findings" pass from the orchestrator alone is
   weak. Findings flagged ONLY by orchestrator get extra scrutiny
   for over-flagging (the bias works both ways).

DECLINED:

7. [low (low conf), claude] Cross-variant frontmatter inconsistency
   — empirically refuted. Checked both files: both already use
   quoted `description: "..."`. Claude sub-agent's claim was wrong.
…bug, 5 low polish)

Round-8 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
6 findings; all accepted. Most were polish, but one was a real bug
I introduced in round 6.

1. [medium, codex] `git stash` recommendation actively breaks the
   dirty-tree review workflow

   Round-6 said "create a `wip` commit or `git stash`, run the
   round on the committed/stashed state, then unstash/reset
   after." That works for the WIP commit (changes stay in the
   tree). But `git stash` REMOVES changes from the worktree — so
   reviewers run against the pre-WIP tree, report findings on
   code that wasn't being reviewed, then unstash restores changes
   that never got reviewed.

   Real concrete failure: stash WIP → reviewers report clean on
   pre-WIP state → unstash → `/ship` continues with unreviewed
   restored changes.

   Fix: drop `git stash` from the recommended path. Document the
   WIP-commit-then-`git reset --mixed HEAD~1` dance explicitly,
   with the warning that stash is the wrong tool here.

2. [low, codex] Dirty snapshot recipe missed staged/index state

   The snapshot recipe captured `git status --porcelain` + `git
   diff` + untracked-file hashes. But `git diff` only covers
   UNSTAGED worktree changes. If the baseline has staged changes
   and a reviewer mutates the index while leaving the status
   shape as `M path`, snapshots can compare equal even though the
   bytes being reviewed changed.

   Fix: include `git diff --cached` (staged/index) in both before
   and after captures.

3. [low, claude] `/ship` Final Report `Reviews:` field never
   upgraded to the 4-slot enumeration

   Round-7 updated `/review-cycle`'s Reviews field to require all
   four slots be enumerated. `/ship`'s own Final Report template
   still said `Reviews: <rounds and tools>` — vague pre-slot-model
   wording. The `Accepted P2` / `Accepted non-blockers` /
   `Skipped reviewers` fields were also missing from `/ship`'s
   own report even though `/ship` is required to copy them into
   the PR body.

   Fix: rewrite the `/ship` Final Report Reviews field to copy
   from `/review-cycle`'s report verbatim, and add the three
   missing fields. `/ship`'s own report is now self-contained.

4. [low, claude] codex variant orchestrator concurrency wording
   was weaker than claude variant's

   The claude variant said "Run the orchestrator pass in parallel
   with the subprocesses (while they run in the background, the
   orchestrator reads the diff against the checklist)." The codex
   variant said only "Run the orchestrator pass while the
   subprocesses are running in the background" — could be
   misread as "wait for them, then run while they idle."

   Fix: mirror the claude variant's exact wording in the codex
   variant. Wall-clock parallelism is the intent.

5. [low, claude] "All reviewers in parallel" rule didn't carve
   out the sub-agent launch-on-failure semantic

   The loop rule says "all reviewers in parallel from t=0", but
   the sub-agent fallback is inherently sequential — it can only
   launch after `claude -p` returns auth failure. A strict
   reading could push an orchestrator into launching both
   concurrently as a defensive precaution, doubling Anthropic
   API spend on rounds where the subprocess succeeds.

   Fix: one-paragraph carve-out below the parallel rule —
   launch subprocess first, wait briefly for the 401, spawn the
   sub-agent only if auth failed. The fallback still counts as
   "parallel" for the same-commit guarantee.

6. [P3, me — self-review] Agent({}) pseudo-code in sub-agent
   fallback recipe could mislead readers into using JS-like
   syntax

   The example used `Agent({ subagent_type: "...", ... })` shape
   which looks like a JavaScript function call. The actual tool
   invocation uses JSON parameter values.

   Fix: rewrote as "Agent tool invocation:" + key-value list +
   explicit "illustrative — not a literal JavaScript call" note.
…viewers)

Round-9 ensemble (codex-cli + copilot-cli + claude sub-agent + me).
5 findings; 4 accepted, 1 rejected as empirically refuted.

1. [medium, claude] codex variant intro contradicted the new
   partial-on-skip Status contract

   Round-5 intro paragraph (which I added for codex variant) said
   the parent should "explicitly accept the reduced-coverage
   tradeoff with rationale" when a reviewer is blocked. The
   tightened Status contract from rounds 6-7 explicitly forbids
   this: skipped slot → partial, no "accept and continue" path.
   Driver reading top-to-bottom hits the old framing first, applies
   it, never re-reads the contract → unreviewed code through
   `/ship`'s gate.

   Fix: rewrite both spots (intro and Reviews template) to point
   at the Status contract: skipped means partial, rationale lives
   in the Skipped reviewers field as documentation, not as a
   loophole that keeps Status=clean.

2+3. [medium, codex + claude] Dirty-tree WIP-commit recipe from
     round-8 has multiple real footguns

   Codex caught: `git add -A` stages unrelated untracked files
   (`.copilot-session-*.md`, scratch notes), which end up in the
   WIP commit AND in `pr-review --base`'s diff sent to external
   reviewers — potential local-data leak.

   Claude sub-agent caught: `git reset --mixed HEAD~1` is a
   `git reset` (Hard Rules normally prohibit `git reset --hard /
   --mixed`), and `HEAD~1` is positional — if HEAD moved during
   the review for any reason, the reset silently drops the wrong
   commit.

   Both reviewers caught (independently): the recipe destroys the
   user's pre-WIP staging discipline (specific staged/unstaged
   split → everything unstaged after reset --mixed).

   Fix: substantial rework of the dirty-work section.
   (a) Flip recommendation: **snapshot comparison is now the
       primary path** (no footguns — preserves staging, doesn't
       touch commits, doesn't leak untracked files).
   (b) WIP commit demoted to "manual-friendly alternative" with
       a fully-specified safe recipe that:
       - Captures pre-state (HEAD SHA, staged patch, unstaged
         patch) for restoration.
       - Uses explicit pathspecs (never `git add -A`).
       - Verifies HEAD didn't move during review.
       - Verifies the commit subject matches "wip: review
         snapshot" before resetting.
       - Restores staging discipline via stored patches.
   (c) Hard Rules carve-out: the WIP-undo `git reset --hard` is
       permitted ONLY as part of this verified-undo dance with
       the shape shown. Carve-out is scoped and explicit.
   (d) Three explicit DON'T warnings (stash, `add -A`, unguarded
       `reset --mixed HEAD~1`) explaining why each is wrong.

4. [low, claude] ship.md partial-branch didn't cover orchestrator-
   slot-unfilled

   The Status contract requires `partial` whenever ANY required
   slot is unfilled. The orchestrator is in the required list,
   but ship.md's two partial sub-branches only covered copilot-cli
   skipped and claude/codex slot unfilled. If the orchestrator
   skips its own checklist pass, a literal reader maps to neither
   branch and stalls.

   Fix: extend partial-(b)'s parenthetical in both variants to
   include "orchestrator slot was unfilled because no explicit
   `{findings: []}` checklist pass was produced this round".

REJECTED:

5. [low, claude] `--effort xhigh` may not be valid for copilot-cli
   — empirically refuted. Just ran `copilot --help`: choices are
   `none, low, medium, high, xhigh, max`. `xhigh` IS valid. The
   docs are correct. Claude sub-agent's concern about "codex
   value reused for different binary without verification" was a
   plausible heuristic but the value happens to be supported in
   both.
… converged)

Round-10 ensemble (codex + copilot + claude sub-agent + me).
4 distinct findings, all converging on the same conclusion: the
WIP-commit recipe added in round 8 and reworked in round 9 has
accumulated too many independent footguns to keep patching.

The footgun inventory (caught across this and prior rounds):

1. [HIGH, round 10, codex + copilot 2-way] `git reset --hard` step
   loses untracked files included in the WIP commit. `git diff`
   doesn't capture them, so the restore patch can't recreate them.
   Concrete local data loss for new files/binary assets.

2. [medium, round 10, claude sub-agent] `git apply` of the
   unstaged patch is unsound when staged + unstaged changes touch
   overlapping hunks in the same file. The unstaged patch's
   context is the INDEX state, but we apply it to a worktree at
   pre-WIP HEAD (no staged applied yet). Hunks fuzz-fail, leave
   `.rej` files, silent failure if not run with `set -e`.

3. [medium, round 10, claude sub-agent] `git commit` failures
   (extremely common — commitlint hooks, gitleaks, gpg signing,
   biome, plus this very repo's own commitlint workflow which
   rejects `wip:` as a non-Conventional-Commits type) silently
   make `WIP_SHA` empty, and the downstream verify check
   misdiagnoses it as "HEAD moved during review". User goes
   hunting for non-existent HEAD movement.

4. [medium, round 10, copilot + claude + me 3-way] Fixed `/tmp/`
   paths collide across concurrent runs and across retry runs
   after a failure (second invocation overwrites recovery state
   from first failed run).

5. [round 9, codex + claude] `git add -A` stages unrelated
   untracked files which end up in the WIP commit AND in
   `pr-review --base`'s diff sent to external reviewers —
   potential local data leak.

6. [round 9, claude] `git reset --mixed HEAD~1` is positional;
   if HEAD moved (interrupted run, stray amend, side commit),
   wrong commit gets reset.

7. [round 9, claude + me 2-way] Hard Rules section forbids
   `git reset --hard`; recipe's "carve-out" was only inline,
   not in the Hard Rules section itself.

8. [round 9, me] WIP dance destroys user's pre-WIP staging
   discipline (carefully staged/unstaged split → everything
   unstaged after reset).

That's 8 distinct sharp edges across 2 rounds. Each round of fixes
introduces another caveat to document. Same pattern as PR #6's
round-6.5 BREAKING-CHANGE-scanning removal: when a recipe needs
more caveats than recipe, delete it.

Fix: remove the WIP-commit alternative entirely. Replace with a
fully-fleshed-out snapshot-comparison recipe that:
- Uses `mktemp -d` for per-run snapshot directory (no /tmp
  collisions).
- Captures status + unstaged diff + staged diff + untracked file
  hashes BEFORE the reviewer.
- Captures same shape AFTER.
- Diffs the four pairs; ANY difference → round invalid.
- Catches same-status-but-different-content cases (e.g.
  `M path → M path` with different bytes) that a simple
  "is status empty" check would miss.
- Catches added/modified untracked files via the hash list.
- Preserves staging discipline.
- No `git reset` calls → Hard Rules don't need a carve-out.

Brief "why not WIP" explanation kept inline as a tombstone so
future contributors don't reinvent the WIP recipe — enumerates the
8 footguns so the choice is documented, not just made.

Net: -71 / +49 lines across both variants. Removes the most
caveat-heavy section of these docs and replaces it with a single
correct recipe.
…(4 reviewers)

Round-11 was the verify round for round-10's WIP-recipe deletion +
snapshot-recipe replacement. The snapshot recipe itself had a real
HIGH severity shell-injection bug — 3-way confirmation across all
three external reviewers + my self-review caught it.

**The bug**: round-10's snapshot recipe used:

```bash
git ls-files --others --exclude-standard -z \
  | xargs -0 -I{} sh -c 'printf "%s  %s\n" "$(git hash-object "{}")" "{}"'
```

`xargs -I{}` performs TEXTUAL substitution of `{}` into the
`sh -c` string BEFORE sh parses it. So a filename containing
`$(...)` gets evaluated as command substitution inside the
double-quoted argument.

Empirically reproduced:

```
$ touch 'evil$(echo PWNED >&2).txt'
$ git ls-files --others -z | xargs -0 -I{} sh -c 'printf "%s  %s\n" "$(git hash-object "{}")" "{}"'
PWNED                                          ← echo executed
fatal: could not open 'evil.txt' for reading
PWNED                                          ← executed again for the second {}
```

This is especially bad because:
1. It directly contradicts the doc's load-bearing claim from
   round 10 — "Use snapshot comparison — it's the only approach
   with no footguns" was the justification for deleting the
   WIP recipe. A footgun in the replacement undermines the
   whole simplification.
2. The recipe targets dirty-tree review of `codex review
   --uncommitted` flows — exactly the situations where untrusted
   content is most likely to be in the worktree (extracted
   archives, fetched untrusted contributor diffs, etc.).
3. Operators are told to copy this recipe verbatim into their
   review-cycle implementation.

**Reviewer ratings**:
- codex: HIGH
- copilot: HIGH (x2 — one per file)
- claude sub-agent: medium (with empirical reproduction)
- me (self-review): low (correctness/edge case framing, missed
  the security angle)

The two HIGH ratings + concrete reproduction are the right call.

**Fix**: replace with a null-delimited `while read` loop that
passes each filename as a bash variable (data) rather than
substituting into shell source:

```bash
git ls-files --others --exclude-standard -z \
  | while IFS= read -r -d '' f; do
      printf '%s  %s\n' "$(git hash-object -- "$f")" "$f"
    done \
  | sort > "$SNAPSHOT_DIR/pre-untracked.txt"
```

Verified clean: the same `evil$(echo PWNED >&2).txt` filename
that triggered execution under the old form is now printed
literally with its hash.

Added an inline comment explaining the gotcha so future
maintainers don't "simplify" back to the xargs form. The
comment cites the exact failure shape.

Same fix in both claude/ and codex/ variants (recipe is
byte-identical across them).
@willgriffin willgriffin merged commit b0023bf into main May 23, 2026
1 check passed
@willgriffin willgriffin deleted the fix/review-cycle-copilot-required branch May 23, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants