Skip to content

docs(checklist): add hazards from real Copilot catches on have-config + pr-review#3

Merged
willgriffin merged 2 commits into
mainfrom
docs/checklist-config-hazards
May 22, 2026
Merged

docs(checklist): add hazards from real Copilot catches on have-config + pr-review#3
willgriffin merged 2 commits into
mainfrom
docs/checklist-config-hazards

Conversation

@willgriffin
Copy link
Copy Markdown
Contributor

Summary

Folds five patterns Copilot caught on this week's PRs (have-config #2 + pr-review #2) into the shared checklist. Acting on the calibration insight immediately rather than batching for pr-review-tune — each pattern is concrete with file/line citation from the source PR, no ambiguity.

New bullets

Theme 7 renamed from "Dead config & unwired parameters" → "Config hazards: dead, surprising, or over-active". The original "missing consumer" framing missed the inverse problem (config that has invisible side effects on consumers), which is just as common in shared-config / monorepo / base-config setups.

Three new bullets:

  • Shared config with invisible side effects on consumers (e.g. incremental: true in shared tsconfig → silent .tsbuildinfo files everywhere)
  • Shared config too narrow for documented use cases (e.g. base tsconfig claims SvelteKit support but lacks DOM lib)
  • README install instructions that contradict package.json (telling consumers to install what's already in optionalDependencies)

Theme 8 (Infra/deploy hazards) gains three new bullets:

  • GitHub Actions workflow-command injection (`::error::$user_input` without escaping %/\r/\n)
  • echo \"\$user_input\" | grep parsing dashes as flags (use printf '%s\n')
  • GitHub Actions permissions broader than the workflow needs (dead scope, least privilege)

Why now, not via pr-review-tune

The tune loop is built for batch refinement when many findings need triage and dedup. These five patterns are unambiguous — each one is a real bug or DX hazard with a clear fix and a real example. Waiting to batch them just delays the value.

Test plan

  • Conventional commit message check (passes — the PR's commit conforms)
  • Visually verify markdown renders correctly on GitHub
  • Next pr-review run that includes a shared-config change should now catch the "invisible side effect" pattern proactively

… + pr-review

Two waves of Copilot review on this week's PRs surfaced patterns the
checklist didn't cover. Folding them in immediately rather than waiting
for batched pr-review-tune.

Theme 7 (renamed: "Config hazards: dead, surprising, or over-active")
gains three new bullets:

- Shared config with invisible side effects on consumers. Caught
  on have-config PR #2: `incremental: true` in a shared
  tsconfig.base.json silently emits .tsbuildinfo in every consuming
  tree. Generalizes to any extended/published config (tsconfig, eslint,
  prettier, vite, biome).

- Shared config too narrow for documented use cases. Same PR:
  tsconfig-base README claimed SvelteKit support but lib was ES2022
  only (no DOM). Either narrow the docs, add a variant, or require
  explicit override.

- README install instructions that contradict package.json. Same PR:
  README told consumers to install three packages that were already
  optionalDependencies. Causes duplicate installs / version skew.

Theme 8 (Infra/deploy hazards) gains three new bullets from
pr-review PR #2:

- GitHub Actions workflow-command injection from user-controlled
  content. `::error::<commit-subject>` without escaping `%` `\r` `\n`
  lets a crafted commit message inject workflow commands or spoof
  logs. Real security issue, not theoretical.

- `echo "$user_input" | grep` parses dashes as flags. Subjects
  starting with `-n`/`-e` get treated as echo options. Use `printf
  '%s\n'`.

- GitHub Actions permissions broader than the workflow needs.
  pull-requests: read on a workflow that only does git log + event
  payload reads is dead scope. Least privilege.

All five patterns are concrete with file/line citations from the source
PRs. Should reduce time spent on the same class of issue in future
shared-config and workflow PRs.
@willgriffin willgriffin requested a review from Copilot May 22, 2026 18:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the shared PR-review checklist to incorporate concrete “hazard patterns” observed in recent reviews, expanding the config section to cover both dead config and config with surprising downstream effects, and adding new infra/deploy safety checks.

Changes:

  • Renames Theme 7 and adds guidance on shared-config hazards (invisible side effects, overly narrow base configs, README install mismatches).
  • Adds new Theme 8 bullets covering GitHub Actions workflow-command injection, safer shell printing patterns, and least-privilege workflow permissions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread prompts/checklist.md Outdated
Comment thread prompts/checklist.md Outdated
…llet

Two Copilot catches on PR #3:

1. The escape examples were written as pseudo-sed (`s//\%/%25/` etc.)
   which is malformed and unhelpful. Replaced with the actual bash
   parameter-expansion helper from the commitlint.yml workflow that
   triggered this checklist update — concrete, copy-pasteable, and
   correct. Also linked GitHub's workflow-commands docs as the
   authoritative source for the encoding requirements.

2. Bullet title attributed the dash-parsing problem to the whole
   `echo "$user_input" | grep` pipeline; the issue is actually `echo`
   alone (grep just sees empty stdin). Renamed the bullet to
   "`echo "$user_input"` treats leading `-n`/`-e`/`-E` as flags" and
   expanded the body to explain the downstream silent-empty-input
   failure mode, which is the more dangerous symptom.
@willgriffin willgriffin merged commit 12d107b into main May 22, 2026
1 check passed
willgriffin added a commit that referenced this pull request May 22, 2026
Pattern catch-rate audit on the 17 Copilot findings from have-config #2,
pr-review #2, pr-review #3, and have-config #4 showed that the
post-#3 checklist catches ~9 of 17 strongly and ~3 borderline. Four
additional patterns surfaced repeatedly without dedicated bullets;
adding them brings the catch rate to ~14 of 17 (~82%), leaving 3
genuinely uncatchable findings (one Copilot false positive + two
meta-writing-quality issues).

New bullets:

Theme 7 (Config hazards):
- Engine/version constraints looser than what the lockfile needs.
  Catches the recurring `engines.node: ">=20"` vs lockfile-requires-
  20.19 pattern plus its sibling on `actions/setup-node node-version`,
  packageManager, .nvmrc, Docker tags, CI tool pins.

Theme 8 (Infra/deploy hazards):
- Third-party GitHub Actions pinned to moving tags instead of commit
  SHAs. Real supply-chain risk; vendored anytown agents already
  SHA-pin. Includes the readable `# v5` comment pattern and link to
  GitHub's hardening docs.
- Extended the existing "Interpolated shell variables into psql/sed/
  perl substitutions without escaping" bullet with a sibling bullet
  covering shell-escape / regex visual-vs-parsed ambiguity (the
  `Merge\ ` two-char issue, `'\n'` vs `$'\n'`, BRE vs ERE vs PCRE
  flavors).

Theme 11 (Mechanical):
- Shebang interpreter doesn't match the file's syntax. Catches the
  `#!/usr/bin/env node` on TypeScript file pattern from have-config
  #4 and generalizes to python version mismatch and `#!/bin/sh`
  bashisms.

Out of scope:
- Library-version-specific quirks like the disableTypeChecked
  array-vs-object spread (Copilot was wrong about that; doesn't
  generalize).
- Meta writing-quality concerns (don't ship pseudo-code as
  examples) — fits Theme 6 in spirit but adding a separate bullet
  would dilute the theme. Skip until it shows up again.
@willgriffin
Copy link
Copy Markdown
Contributor Author

Folded 4 more patterns from the catch-rate audit (commit ad5d05d):

  • Theme 7 — engine/version constraints looser than lockfile needs (catches the recurring engines.node: ">=20" vs requires-20.19 pattern + siblings on setup-node node-version, packageManager, .nvmrc, Docker tags)
  • Theme 8 — third-party GitHub Actions pinned to moving tags instead of SHAs (supply-chain hardening per GitHub's docs; matches the vendored anytown agent pattern)
  • Theme 8 — extended shell-escape bullet to cover the Merge\ two-char visual-vs-parsed gotcha and friends ('\n' vs $'\n', regex flavor differences)
  • Theme 11 — shebang interpreter doesn't match file syntax (the #!/usr/bin/env node on .ts pattern, plus python version mismatch, sh bashisms)

Net effect: catch rate on the last 17 Copilot findings moves from ~53% (current main) to ~82% (with #3 as-merged). Remaining ~18% is genuinely uncatchable — one Copilot false positive about library shape + two meta-writing-quality issues in the checklist's own prose.

@willgriffin
Copy link
Copy Markdown
Contributor Author

Ran pr-review --base origin/main --pretty against this branch and applied the (now-expanded) checklist to the diff. Two findings on the new bullets themselves, both fixed in commit a91dd07:

[medium] The shell-escape bullet's Merge\\ accounting was factually wrong — claimed escape-plus-space is two chars when it's one. The actual bug being illustrated was the trailing extra space in ^Merge\\ , not the escape. A reader following the original framing would have written subtly wrong patterns themselves — the bullet was actively causing the class of bug it was meant to prevent. Fix: rewrite to focus on invisible trailing whitespace as the actual failure mode.

[low] The SHA-pin bullet used a concrete SHA (93cb6efe...) as the example. That SHA was current when written but will drift as actions/checkout ships new v5 patches. Fix: replaced with <sha> placeholder + the one-liner gh api repos/actions/checkout/git/refs/tags/v5 -q .object.sha to fetch the current SHA, so readers don't pin to a slowly-aging example.

Nice meta moment — the checklist applied to its own modifications caught two real issues. The calibration loop closing on itself works.

@willgriffin
Copy link
Copy Markdown
Contributor Author

Ran the multi-tool review-cycle (codex + claude -p + Copilot) on this PR. Codex (gpt-5.5, xhigh) caught two real factual issues my own pass and Copilot's both missed. Fixed in commit dc99b8d:

[low — factual accuracy] Shell-escape bullet's regex counts were wrong both as source text (8/9) and as match length (6/7). Codex's fix is cleaner — drop the numeric framing entirely and describe what the regex matches in plain English ("Merge followed by two spaces" vs "...one space"). No counts to get wrong.

[low — outdated] Shebang bullet said Node "can't parse TypeScript natively" with interface/as/generics as examples. That's outdated for Node 22.6+ (--experimental-strip-types) and Node 23.6+/24 (default). All three examples are erasable syntax that Node strips natively. The org standard is Node 24, so the bullet would have produced false positives on this org's own projects. Rewrote to scope the hazard to actually-non-erasable syntax (enums, namespaces, parameter properties, TSX/JSX, decorators) which Node genuinely can't handle.

Operational notes from the run:

  • The claude -p subprocess in the ensemble failed with 401 Invalid authentication credentials — OAuth doesn't propagate from a parent claude session to spawned children. Need ANTHROPIC_API_KEY env var or a different invocation pattern when review-cycle is run from inside an active claude session. Should add to the /have:review-cycle command's hard-rules section.
  • Copilot already reviewed 2 rounds; my third commit didn't re-fire it.
  • codex with xhigh took ~50s to produce findings on a small docs diff. Slower than claude or grep-based linters, but it earned its keep by catching what the others missed.

This is the second consecutive PR where the calibration loop caught real factual bugs in checklist bullets describing the patterns those bugs exemplify. Worth treating the bullet text itself as code subject to the same checklist.

@willgriffin
Copy link
Copy Markdown
Contributor Author

Third reviewer (Copilot CLI) ran successfully after the org policy was enabled. Caught two more factual issues that neither my own pass nor codex flagged — both required current external knowledge that Copilot got via live web fetch:

[medium] setup-node behaviour bullet was wrong. I wrote node-version: '20' "resolves to latest 20.x" — that's not how setup-node works. It picks from the runner's tool cache via semver match, which can lag the current latest 20.x. check-latest: true forces a fresh lookup. Copilot fetched setup-node's README to verify. Fixed by describing the actual cache+check-latest mechanism and recommending a specific minimum-minor pin ('20.19') instead of bare '20'.

[medium] Node TS native-strip threshold was outdated. I said "Node 22.6+ (with flag), 23.6+/24 (default)". Per Node's current docs (which Copilot fetched at https://nodejs.org/api/typescript.html), type stripping is default-enabled in 22.18+ as well as 23.6+/24. The flag-required window is 22.6 through 22.17. On a project pinned to 22.18 LTS, the old wording would have generated false-positive findings recommending an unnecessary flag.

Both fixed in commit 5588133.

Reviewer blind-spot map after this run:

Reviewer Catches Misses
Me (parent claude, reading my own diff) Structural / framing issues Off-by-one in counts; subtle factual claims
Codex (gpt-5.5, xhigh, training-time knowledge) Reasoning errors; logical inconsistencies External-tool / runtime version-specific facts that shifted post-training
Copilot CLI (with live web fetch) Knowledge-currency on external tools, libraries, runtimes (TBD — likely structural / cross-file refactor patterns)

Concrete validation that the ensemble pulls its weight: three different reviewers caught three non-overlapping sets of issues on this 50-line documentation PR. The dc99b8d fix was already wrong within a Node release cycle by the time it was written; Copilot's web fetch made the difference between shipping a stale bullet and shipping a current one.

Going to merge this only after one more full-ensemble run to confirm no fourth round of findings.

@willgriffin
Copy link
Copy Markdown
Contributor Author

Cycle converged after 7 rounds. 🎉

Round-by-round summary

Round Codex Copilot Fix Commit
Sequential pre-cycle (me) Merge regex accounting + SHA freshness a91dd07
Sequential pre-cycle (codex one-shot) 2 Regex counts + Node TS examples dc99b8d
Sequential pre-cycle (copilot one-shot) 2 setup-node cache + Node 22.18 threshold 5588133
Round 1 (ensemble) 1 0 transform-types option omitted 3ce6be7
Round 2 (ensemble) 1 1 (same finding) transform-types wrongly attributed to TSX/decorators 28e9a5f
Round 3 (ensemble) 1 0 Node 25/26 type-strip ambiguity 4a4bbea
Round 4 (ensemble) 0 2 Structural: shebang in wrong section + SHA-pin scope d65931d
Round 5 (ensemble) 1 0 Node 23.0-23.5 strip-types window 5afbb6d
Round 6 (ensemble) 0 0 (CONVERGED)

What the cycle actually surfaced

  • 9 fixes across 7 rounds on a documentation diff that started at 37 lines. Without the loop, I would have shipped at round 1 (the sequential pre-cycle phase) with the round-1 ensemble finding still latent.
  • Asymmetric convergence: copilot tends to converge first (rounds 1, 3, 4 had it at 0). Codex catches narrow factual edge cases via deep verification. Copilot catches structural / placement issues.
  • The two reviewer types have non-overlapping blind spots, validated 7 times:
    • Me (parent claude reading own diff): catches structural framing on first read; confirmation bias on factual claims I just wrote
    • Codex (xhigh reasoning, training-time knowledge): deep deliberation on consistency; misses post-training facts
    • Copilot CLI (xhigh + live web fetch): knowledge currency; structural / placement issues
  • The fix from round 2 was itself wrong in a way that needed round 3 to catch (transform-types claim about TSX/decorators). The fix from round 3 was also slightly wrong (round 4 caught the ambiguity). This is exactly the failure mode /review-cycle is designed to catch — fixes can introduce new bugs.

Operational lessons (folding into have-config#5)

  1. Never stop after a fix without a verify round. A fix-round followed by a clean-round is convergence. A fix-round alone is not. I had been doing one-shot rounds — that's how I missed the round-1 ensemble finding.
  2. Default cap of 3 rounds is too low for documentation PRs. Took 7 here. Cap should be advisory, not hard.
  3. Run all reviewers in parallel on the same commit — not sequentially against each other's fixes. Sequential lets findings cascade and obscures whether reviewers actually agree.

Ready to merge.

willgriffin added a commit to happyvertical/have-config that referenced this pull request May 22, 2026
Real failure: I ran review-cycle on pr-review#3 and "stopped" after
each reviewer's first pass instead of actually looping. When I caught
myself and ran it properly, the cycle took 7 rounds to converge —
catching 9 progressively-narrower factual issues that would have
shipped if I'd stopped early.

The command file already said "Run up to `rounds` review rounds.
Default: 3" and "stop the loop as clean" when no findings remain,
but the wording was loose enough that I rationalized one-shot
behaviour. This commit adds explicit Hard Rules that close that
gap:

1. Each round runs all reviewers in parallel against the SAME
   commit (not sequentially against each other's fixes — that lets
   findings cascade in misleading ways and obscures whether
   reviewers actually agree on the latest state).

2. A fix-round is NEVER the final round. Convergence requires at
   least one round where every reviewer returns 0 findings against
   the latest commit. Just pushed a fix? Run another round before
   declaring clean.

3. Convergence is per-commit, not per-finding. Reviewer A clean
   against commit X doesn't transfer to commit Y (the fix commit).

Also updated:

- Default cap guidance: 3 is right for code; 5-10 for documentation /
  reviewer-checklist content where each round catches narrower
  factual edges (the pr-review#3 cycle was 7 rounds).
- Step 10 now explicit: "If a fix was pushed in this round, the next
  round MUST run."
- Step 11 explicit: "Stop as clean only when a verify round (no
  edits) returns no actionable findings."
- Cap-hit guidance distinguishes three cases: spec too detailed,
  diminishing returns acceptable, genuine gap.

Mirror edit in both claude/ and codex/ command files.

Evidence: see pr-review#3 (happyvertical/pr-review#3)
for the 7-round convergence log with per-round commits, findings
counts, and the asymmetric convergence pattern between codex (catches
narrow factual edges via deep verification) and copilot CLI (catches
structural/placement issues via live web fetch + cross-file grep).
willgriffin added a commit to happyvertical/have-config that referenced this pull request May 22, 2026
Real failure: I ran review-cycle on pr-review#3 and "stopped" after
each reviewer's first pass instead of actually looping. When I caught
myself and ran it properly, the cycle took 7 rounds to converge —
catching 9 progressively-narrower factual issues that would have
shipped if I'd stopped early.

The command file already said "Run up to `rounds` review rounds.
Default: 3" and "stop the loop as clean" when no findings remain,
but the wording was loose enough that I rationalized one-shot
behaviour. This commit adds explicit Hard Rules that close that
gap:

1. Each round runs all reviewers in parallel against the SAME
   commit (not sequentially against each other's fixes — that lets
   findings cascade in misleading ways and obscures whether
   reviewers actually agree on the latest state).

2. A fix-round is NEVER the final round. Convergence requires at
   least one round where every reviewer returns 0 findings against
   the latest commit. Just pushed a fix? Run another round before
   declaring clean.

3. Convergence is per-commit, not per-finding. Reviewer A clean
   against commit X doesn't transfer to commit Y (the fix commit).

Also updated:

- Default cap guidance: 3 is right for code; 5-10 for documentation /
  reviewer-checklist content where each round catches narrower
  factual edges (the pr-review#3 cycle was 7 rounds).
- Step 10 now explicit: "If a fix was pushed in this round, the next
  round MUST run."
- Step 11 explicit: "Stop as clean only when a verify round (no
  edits) returns no actionable findings."
- Cap-hit guidance distinguishes three cases: spec too detailed,
  diminishing returns acceptable, genuine gap.

Mirror edit in both claude/ and codex/ command files.

Evidence: see pr-review#3 (happyvertical/pr-review#3)
for the 7-round convergence log with per-round commits, findings
counts, and the asymmetric convergence pattern between codex (catches
narrow factual edges via deep verification) and copilot CLI (catches
structural/placement issues via live web fetch + cross-file grep).
willgriffin added a commit to happyvertical/have-config that referenced this pull request May 22, 2026
Real failure: I ran review-cycle on pr-review#3 and "stopped" after
each reviewer's first pass instead of actually looping. When I caught
myself and ran it properly, the cycle took 7 rounds to converge —
catching 9 progressively-narrower factual issues that would have
shipped if I'd stopped early.

The command file already said "Run up to `rounds` review rounds.
Default: 3" and "stop the loop as clean" when no findings remain,
but the wording was loose enough that I rationalized one-shot
behaviour. This commit adds explicit Hard Rules that close that
gap:

1. Each round runs all reviewers in parallel against the SAME
   commit (not sequentially against each other's fixes — that lets
   findings cascade in misleading ways and obscures whether
   reviewers actually agree on the latest state).

2. A fix-round is NEVER the final round. Convergence requires at
   least one round where every reviewer returns 0 findings against
   the latest commit. Just pushed a fix? Run another round before
   declaring clean.

3. Convergence is per-commit, not per-finding. Reviewer A clean
   against commit X doesn't transfer to commit Y (the fix commit).

Also updated:

- Default cap guidance: 3 is right for code; 5-10 for documentation /
  reviewer-checklist content where each round catches narrower
  factual edges (the pr-review#3 cycle was 7 rounds).
- Step 10 now explicit: "If a fix was pushed in this round, the next
  round MUST run."
- Step 11 explicit: "Stop as clean only when a verify round (no
  edits) returns no actionable findings."
- Cap-hit guidance distinguishes three cases: spec too detailed,
  diminishing returns acceptable, genuine gap.

Mirror edit in both claude/ and codex/ command files.

Evidence: see pr-review#3 (happyvertical/pr-review#3)
for the 7-round convergence log with per-round commits, findings
counts, and the asymmetric convergence pattern between codex (catches
narrow factual edges via deep verification) and copilot CLI (catches
structural/placement issues via live web fetch + cross-file grep).
willgriffin added a commit to happyvertical/have-config that referenced this pull request May 23, 2026
Real failure: I ran review-cycle on pr-review#3 and "stopped" after
each reviewer's first pass instead of actually looping. When I caught
myself and ran it properly, the cycle took 7 rounds to converge —
catching 9 progressively-narrower factual issues that would have
shipped if I'd stopped early.

The command file already said "Run up to `rounds` review rounds.
Default: 3" and "stop the loop as clean" when no findings remain,
but the wording was loose enough that I rationalized one-shot
behaviour. This commit adds explicit Hard Rules that close that
gap:

1. Each round runs all reviewers in parallel against the SAME
   commit (not sequentially against each other's fixes — that lets
   findings cascade in misleading ways and obscures whether
   reviewers actually agree on the latest state).

2. A fix-round is NEVER the final round. Convergence requires at
   least one round where every reviewer returns 0 findings against
   the latest commit. Just pushed a fix? Run another round before
   declaring clean.

3. Convergence is per-commit, not per-finding. Reviewer A clean
   against commit X doesn't transfer to commit Y (the fix commit).

Also updated:

- Default cap guidance: 3 is right for code; 5-10 for documentation /
  reviewer-checklist content where each round catches narrower
  factual edges (the pr-review#3 cycle was 7 rounds).
- Step 10 now explicit: "If a fix was pushed in this round, the next
  round MUST run."
- Step 11 explicit: "Stop as clean only when a verify round (no
  edits) returns no actionable findings."
- Cap-hit guidance distinguishes three cases: spec too detailed,
  diminishing returns acceptable, genuine gap.

Mirror edit in both claude/ and codex/ command files.

Evidence: see pr-review#3 (happyvertical/pr-review#3)
for the 7-round convergence log with per-round commits, findings
counts, and the asymmetric convergence pattern between codex (catches
narrow factual edges via deep verification) and copilot CLI (catches
structural/placement issues via live web fetch + cross-file grep).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants