docs(checklist): add hazards from real Copilot catches on have-config + pr-review#3
Conversation
… + pr-review Two waves of Copilot review on this week's PRs surfaced patterns the checklist didn't cover. Folding them in immediately rather than waiting for batched pr-review-tune. Theme 7 (renamed: "Config hazards: dead, surprising, or over-active") gains three new bullets: - Shared config with invisible side effects on consumers. Caught on have-config PR #2: `incremental: true` in a shared tsconfig.base.json silently emits .tsbuildinfo in every consuming tree. Generalizes to any extended/published config (tsconfig, eslint, prettier, vite, biome). - Shared config too narrow for documented use cases. Same PR: tsconfig-base README claimed SvelteKit support but lib was ES2022 only (no DOM). Either narrow the docs, add a variant, or require explicit override. - README install instructions that contradict package.json. Same PR: README told consumers to install three packages that were already optionalDependencies. Causes duplicate installs / version skew. Theme 8 (Infra/deploy hazards) gains three new bullets from pr-review PR #2: - GitHub Actions workflow-command injection from user-controlled content. `::error::<commit-subject>` without escaping `%` `\r` `\n` lets a crafted commit message inject workflow commands or spoof logs. Real security issue, not theoretical. - `echo "$user_input" | grep` parses dashes as flags. Subjects starting with `-n`/`-e` get treated as echo options. Use `printf '%s\n'`. - GitHub Actions permissions broader than the workflow needs. pull-requests: read on a workflow that only does git log + event payload reads is dead scope. Least privilege. All five patterns are concrete with file/line citations from the source PRs. Should reduce time spent on the same class of issue in future shared-config and workflow PRs.
There was a problem hiding this comment.
Pull request overview
Updates the shared PR-review checklist to incorporate concrete “hazard patterns” observed in recent reviews, expanding the config section to cover both dead config and config with surprising downstream effects, and adding new infra/deploy safety checks.
Changes:
- Renames Theme 7 and adds guidance on shared-config hazards (invisible side effects, overly narrow base configs, README install mismatches).
- Adds new Theme 8 bullets covering GitHub Actions workflow-command injection, safer shell printing patterns, and least-privilege workflow permissions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…llet Two Copilot catches on PR #3: 1. The escape examples were written as pseudo-sed (`s//\%/%25/` etc.) which is malformed and unhelpful. Replaced with the actual bash parameter-expansion helper from the commitlint.yml workflow that triggered this checklist update — concrete, copy-pasteable, and correct. Also linked GitHub's workflow-commands docs as the authoritative source for the encoding requirements. 2. Bullet title attributed the dash-parsing problem to the whole `echo "$user_input" | grep` pipeline; the issue is actually `echo` alone (grep just sees empty stdin). Renamed the bullet to "`echo "$user_input"` treats leading `-n`/`-e`/`-E` as flags" and expanded the body to explain the downstream silent-empty-input failure mode, which is the more dangerous symptom.
Pattern catch-rate audit on the 17 Copilot findings from have-config #2, pr-review #2, pr-review #3, and have-config #4 showed that the post-#3 checklist catches ~9 of 17 strongly and ~3 borderline. Four additional patterns surfaced repeatedly without dedicated bullets; adding them brings the catch rate to ~14 of 17 (~82%), leaving 3 genuinely uncatchable findings (one Copilot false positive + two meta-writing-quality issues). New bullets: Theme 7 (Config hazards): - Engine/version constraints looser than what the lockfile needs. Catches the recurring `engines.node: ">=20"` vs lockfile-requires- 20.19 pattern plus its sibling on `actions/setup-node node-version`, packageManager, .nvmrc, Docker tags, CI tool pins. Theme 8 (Infra/deploy hazards): - Third-party GitHub Actions pinned to moving tags instead of commit SHAs. Real supply-chain risk; vendored anytown agents already SHA-pin. Includes the readable `# v5` comment pattern and link to GitHub's hardening docs. - Extended the existing "Interpolated shell variables into psql/sed/ perl substitutions without escaping" bullet with a sibling bullet covering shell-escape / regex visual-vs-parsed ambiguity (the `Merge\ ` two-char issue, `'\n'` vs `$'\n'`, BRE vs ERE vs PCRE flavors). Theme 11 (Mechanical): - Shebang interpreter doesn't match the file's syntax. Catches the `#!/usr/bin/env node` on TypeScript file pattern from have-config #4 and generalizes to python version mismatch and `#!/bin/sh` bashisms. Out of scope: - Library-version-specific quirks like the disableTypeChecked array-vs-object spread (Copilot was wrong about that; doesn't generalize). - Meta writing-quality concerns (don't ship pseudo-code as examples) — fits Theme 6 in spirit but adding a separate bullet would dilute the theme. Skip until it shows up again.
|
Folded 4 more patterns from the catch-rate audit (commit
Net effect: catch rate on the last 17 Copilot findings moves from ~53% (current main) to ~82% (with #3 as-merged). Remaining ~18% is genuinely uncatchable — one Copilot false positive about library shape + two meta-writing-quality issues in the checklist's own prose. |
|
Ran [medium] The shell-escape bullet's [low] The SHA-pin bullet used a concrete SHA ( Nice meta moment — the checklist applied to its own modifications caught two real issues. The calibration loop closing on itself works. |
|
Ran the multi-tool review-cycle (codex + claude -p + Copilot) on this PR. Codex (gpt-5.5, xhigh) caught two real factual issues my own pass and Copilot's both missed. Fixed in commit [low — factual accuracy] Shell-escape bullet's regex counts were wrong both as source text (8/9) and as match length (6/7). Codex's fix is cleaner — drop the numeric framing entirely and describe what the regex matches in plain English ("Merge followed by two spaces" vs "...one space"). No counts to get wrong. [low — outdated] Shebang bullet said Node "can't parse TypeScript natively" with Operational notes from the run:
This is the second consecutive PR where the calibration loop caught real factual bugs in checklist bullets describing the patterns those bugs exemplify. Worth treating the bullet text itself as code subject to the same checklist. |
|
Third reviewer (Copilot CLI) ran successfully after the org policy was enabled. Caught two more factual issues that neither my own pass nor codex flagged — both required current external knowledge that Copilot got via live web fetch: [medium] [medium] Node TS native-strip threshold was outdated. I said "Node 22.6+ (with flag), 23.6+/24 (default)". Per Node's current docs (which Copilot fetched at https://nodejs.org/api/typescript.html), type stripping is default-enabled in 22.18+ as well as 23.6+/24. The flag-required window is 22.6 through 22.17. On a project pinned to 22.18 LTS, the old wording would have generated false-positive findings recommending an unnecessary flag. Both fixed in commit Reviewer blind-spot map after this run:
Concrete validation that the ensemble pulls its weight: three different reviewers caught three non-overlapping sets of issues on this 50-line documentation PR. The Going to merge this only after one more full-ensemble run to confirm no fourth round of findings. |
|
Cycle converged after 7 rounds. 🎉 Round-by-round summary
What the cycle actually surfaced
Operational lessons (folding into have-config#5)
Ready to merge. |
Real failure: I ran review-cycle on pr-review#3 and "stopped" after each reviewer's first pass instead of actually looping. When I caught myself and ran it properly, the cycle took 7 rounds to converge — catching 9 progressively-narrower factual issues that would have shipped if I'd stopped early. The command file already said "Run up to `rounds` review rounds. Default: 3" and "stop the loop as clean" when no findings remain, but the wording was loose enough that I rationalized one-shot behaviour. This commit adds explicit Hard Rules that close that gap: 1. Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes — that lets findings cascade in misleading ways and obscures whether reviewers actually agree on the latest state). 2. A fix-round is NEVER the final round. Convergence requires at least one round where every reviewer returns 0 findings against the latest commit. Just pushed a fix? Run another round before declaring clean. 3. Convergence is per-commit, not per-finding. Reviewer A clean against commit X doesn't transfer to commit Y (the fix commit). Also updated: - Default cap guidance: 3 is right for code; 5-10 for documentation / reviewer-checklist content where each round catches narrower factual edges (the pr-review#3 cycle was 7 rounds). - Step 10 now explicit: "If a fix was pushed in this round, the next round MUST run." - Step 11 explicit: "Stop as clean only when a verify round (no edits) returns no actionable findings." - Cap-hit guidance distinguishes three cases: spec too detailed, diminishing returns acceptable, genuine gap. Mirror edit in both claude/ and codex/ command files. Evidence: see pr-review#3 (happyvertical/pr-review#3) for the 7-round convergence log with per-round commits, findings counts, and the asymmetric convergence pattern between codex (catches narrow factual edges via deep verification) and copilot CLI (catches structural/placement issues via live web fetch + cross-file grep).
Real failure: I ran review-cycle on pr-review#3 and "stopped" after each reviewer's first pass instead of actually looping. When I caught myself and ran it properly, the cycle took 7 rounds to converge — catching 9 progressively-narrower factual issues that would have shipped if I'd stopped early. The command file already said "Run up to `rounds` review rounds. Default: 3" and "stop the loop as clean" when no findings remain, but the wording was loose enough that I rationalized one-shot behaviour. This commit adds explicit Hard Rules that close that gap: 1. Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes — that lets findings cascade in misleading ways and obscures whether reviewers actually agree on the latest state). 2. A fix-round is NEVER the final round. Convergence requires at least one round where every reviewer returns 0 findings against the latest commit. Just pushed a fix? Run another round before declaring clean. 3. Convergence is per-commit, not per-finding. Reviewer A clean against commit X doesn't transfer to commit Y (the fix commit). Also updated: - Default cap guidance: 3 is right for code; 5-10 for documentation / reviewer-checklist content where each round catches narrower factual edges (the pr-review#3 cycle was 7 rounds). - Step 10 now explicit: "If a fix was pushed in this round, the next round MUST run." - Step 11 explicit: "Stop as clean only when a verify round (no edits) returns no actionable findings." - Cap-hit guidance distinguishes three cases: spec too detailed, diminishing returns acceptable, genuine gap. Mirror edit in both claude/ and codex/ command files. Evidence: see pr-review#3 (happyvertical/pr-review#3) for the 7-round convergence log with per-round commits, findings counts, and the asymmetric convergence pattern between codex (catches narrow factual edges via deep verification) and copilot CLI (catches structural/placement issues via live web fetch + cross-file grep).
Real failure: I ran review-cycle on pr-review#3 and "stopped" after each reviewer's first pass instead of actually looping. When I caught myself and ran it properly, the cycle took 7 rounds to converge — catching 9 progressively-narrower factual issues that would have shipped if I'd stopped early. The command file already said "Run up to `rounds` review rounds. Default: 3" and "stop the loop as clean" when no findings remain, but the wording was loose enough that I rationalized one-shot behaviour. This commit adds explicit Hard Rules that close that gap: 1. Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes — that lets findings cascade in misleading ways and obscures whether reviewers actually agree on the latest state). 2. A fix-round is NEVER the final round. Convergence requires at least one round where every reviewer returns 0 findings against the latest commit. Just pushed a fix? Run another round before declaring clean. 3. Convergence is per-commit, not per-finding. Reviewer A clean against commit X doesn't transfer to commit Y (the fix commit). Also updated: - Default cap guidance: 3 is right for code; 5-10 for documentation / reviewer-checklist content where each round catches narrower factual edges (the pr-review#3 cycle was 7 rounds). - Step 10 now explicit: "If a fix was pushed in this round, the next round MUST run." - Step 11 explicit: "Stop as clean only when a verify round (no edits) returns no actionable findings." - Cap-hit guidance distinguishes three cases: spec too detailed, diminishing returns acceptable, genuine gap. Mirror edit in both claude/ and codex/ command files. Evidence: see pr-review#3 (happyvertical/pr-review#3) for the 7-round convergence log with per-round commits, findings counts, and the asymmetric convergence pattern between codex (catches narrow factual edges via deep verification) and copilot CLI (catches structural/placement issues via live web fetch + cross-file grep).
Real failure: I ran review-cycle on pr-review#3 and "stopped" after each reviewer's first pass instead of actually looping. When I caught myself and ran it properly, the cycle took 7 rounds to converge — catching 9 progressively-narrower factual issues that would have shipped if I'd stopped early. The command file already said "Run up to `rounds` review rounds. Default: 3" and "stop the loop as clean" when no findings remain, but the wording was loose enough that I rationalized one-shot behaviour. This commit adds explicit Hard Rules that close that gap: 1. Each round runs all reviewers in parallel against the SAME commit (not sequentially against each other's fixes — that lets findings cascade in misleading ways and obscures whether reviewers actually agree on the latest state). 2. A fix-round is NEVER the final round. Convergence requires at least one round where every reviewer returns 0 findings against the latest commit. Just pushed a fix? Run another round before declaring clean. 3. Convergence is per-commit, not per-finding. Reviewer A clean against commit X doesn't transfer to commit Y (the fix commit). Also updated: - Default cap guidance: 3 is right for code; 5-10 for documentation / reviewer-checklist content where each round catches narrower factual edges (the pr-review#3 cycle was 7 rounds). - Step 10 now explicit: "If a fix was pushed in this round, the next round MUST run." - Step 11 explicit: "Stop as clean only when a verify round (no edits) returns no actionable findings." - Cap-hit guidance distinguishes three cases: spec too detailed, diminishing returns acceptable, genuine gap. Mirror edit in both claude/ and codex/ command files. Evidence: see pr-review#3 (happyvertical/pr-review#3) for the 7-round convergence log with per-round commits, findings counts, and the asymmetric convergence pattern between codex (catches narrow factual edges via deep verification) and copilot CLI (catches structural/placement issues via live web fetch + cross-file grep).
Summary
Folds five patterns Copilot caught on this week's PRs (have-config #2 + pr-review #2) into the shared checklist. Acting on the calibration insight immediately rather than batching for
pr-review-tune— each pattern is concrete with file/line citation from the source PR, no ambiguity.New bullets
Theme 7 renamed from "Dead config & unwired parameters" → "Config hazards: dead, surprising, or over-active". The original "missing consumer" framing missed the inverse problem (config that has invisible side effects on consumers), which is just as common in shared-config / monorepo / base-config setups.
Three new bullets:
incremental: truein shared tsconfig → silent.tsbuildinfofiles everywhere)Theme 8 (Infra/deploy hazards) gains three new bullets:
%/\r/\n)echo \"\$user_input\" | grepparsing dashes as flags (useprintf '%s\n')Why now, not via
pr-review-tuneThe tune loop is built for batch refinement when many findings need triage and dedup. These five patterns are unambiguous — each one is a real bug or DX hazard with a clear fix and a real example. Waiting to batch them just delays the value.
Test plan