Skip to content

ci(changelog-review): refactor to triage-only + skip-if-exists#1302

Merged
laurigates merged 1 commit into
mainfrom
ci/changelog-review-triage-only
May 11, 2026
Merged

ci(changelog-review): refactor to triage-only + skip-if-exists#1302
laurigates merged 1 commit into
mainfrom
ci/changelog-review-triage-only

Conversation

@laurigates
Copy link
Copy Markdown
Owner

@laurigates laurigates commented May 11, 2026

Summary

Restructures the changelog-review workflow so a 62-version gap no longer blows the turn budget. The previous design tried to do triage AND rule edits in a single Claude invocation; with the 2.1.76 → 2.1.138 gap that pattern failed in run 25651584541 at turn 51/50 after 2h 7min. (Recovery of that run's actual rule-edit work is in #1301.)

Why it failed (data)

The SDK reports total_cost_usd (notional, not billed under Max 20x — but still a clean proxy for token volume since it's computed from token counts). Per-call token counts aren't exposed at the result level, only the aggregate.

Comparison against this workflow's recent weekly runs:

Run Date Turns Duration Notional $ Notes
2026-05-04 success 41 8m 31s $2.01 Normal weekly run
2026-04-27 success 4 47m $57.45 Big-context first-hit; 4 huge turns
2026-04-20 max_turns 41 36m $39.18 Same failure mode, smaller gap
2026-04-15 success 20 1m 30s $0.40 No-op week
2026-04-06 success 9 11m $6.18 Modest gap
2026-05-11 max_turns 51 2h 7m $63.69 The recovered run

For reference, a typical claude-code-review.yml PR review run is $0.50–$1.00 with 15-20 turns. The changelog-review workflow on a fresh gap is 30–100× that, dominated by context size on first hit (4 turns / $57 in one run) rather than turn count alone.

What changed

Triage-only phase (this workflow):

  • Read the excerpt + the JSON tracking file
  • Open ONE tracking issue summarizing relevant changes grouped by candidate rule file
  • Ratchet .claude-code-version-check.json
  • Open one JSON-only PR
  • No rule edits. Bounded to --max-turns 25 with sonnet.

Apply phase (delegated to existing claude.yml):

  • User mentions @claude please update <rule-file> for issue #N on the tracking issue
  • Each apply is small-scope with its own turn budget — single rule file at a time

Skip-if-exists check:

  • Before invoking Claude, query gh issue list --label changelog-review --state open for a title matching the current range. If found, skip. workflow_dispatch + force_full_review=true overrides.
  • Makes the workflow safe to re-run and idempotent across cron triggers.

Expanded additional_permissions:

  • Adds Bash(gh api *), Bash(gh label *), Bash(jq *), Bash(grep *), Bash(awk *), Bash(wc *), Bash(cat *), Bash(date *), etc.
  • The failed run logged 201 permission denials in 51 turns — ~4 wasted turns per usable turn just retrying denied commands. These additions cover the patterns the agent reached for.

Prompt hardening:

  • Strict step-by-step triage flow with explicit non-goals ("DO NOT edit rule files")
  • Final sanity check: confirm exactly one commit, exactly one file changed, no rule-file edits, before finishing the turn
  • Explicit warning about the previous run's self-misdiagnosis loop ("if you think a branch is empty, re-check with git diff/git log before destructive recovery")
  • Excerpt truncated to 200KB if larger — keeps the prompt cache predictable

Why this design over alternatives

Option Why not
Matrix: one job per version 62 jobs is noisy; most versions are pure bug fixes (e.g. 2.1.137 is one bullet)
Sub-agents / agent team within one run Doesn't multiply turn budget — Task/SendMessage count as turns from the parent loop. Only separate jobs get independent budgets.
Keep monolithic single-PR design Will fail again on the next gap
Iterative-with-checkpoint Possible, but the issue/apply split matches the historical pattern (#78, #786, #854, #909, #949) and lets human review gate each apply

Test plan

  • Validate YAML parse + actionlint
  • Dry-run via workflow_dispatch with force_full_review=true AFTER docs(rules): apply Claude Code changelog review 2.1.76 → 2.1.138 #1301 merges (tracked version will be 2.1.138, latest will be whatever's current — small or no gap, ideal for dry-running the new shape)
  • Verify the triage issue is created with grouped per-rule-file sections
  • Verify the JSON-only PR contains exactly one file change
  • Run a second dispatch (without force) and confirm skip-if-exists fires

Follow-up (not blocking)

  • Consider extracting the workflow into laurigates/.github as reusable-changelog-review.yml once the new shape is proven — per .claude/rules/ci-cd-workflows.md we prefer reusable workflows for cross-repo automation, though this one is plugin-collection specific so reuse value is currently low.
  • The show_full_output: false setting in anthropics/claude-code-action keeps the per-call token counts hidden. If we ever want to track token-level efficiency over time, we could either flip that on for the triage job specifically, or scrape the OpenTelemetry log events that the SDK emits (richer detail than the summary result).

🤖 Drafted with Claude Code

The previous design tried to do triage AND rule edits in a single Claude
invocation. With a 62-version gap (2.1.76 → 2.1.138) this failed in run
25651584541 after burning $63.69 and 2h 7min:

- 201 permission denials (4 per turn) bleeding the budget on commands
  that weren't in additional_permissions
- The agent shipped a correct PR (#1299) mid-run, misdiagnosed it as an
  empty branch 72 seconds later, closed it, then spent ~75 minutes
  thrashing to "recreate" it before hitting max-turns at turn 51/50

Restructure splits the work into two phases:

1. Triage (this workflow): open ONE tracking issue summarizing relevant
   changes grouped by candidate rule file, ratchet
   .claude-code-version-check.json, open a JSON-only PR. No rule edits.
   Bounded budget: --max-turns 25.

2. Apply (existing claude.yml @-mention responder): user mentions
   @claude on the tracking issue to drive targeted rule edits. Each
   apply is small-scope and gets its own turn budget.

Other changes:
- Skip-if-exists check: before running Claude, look for an open issue
  matching "Review Claude Code changelog: <tracked> → <latest>". If
  found, skip. force_full_review=true overrides.
- Expand additional_permissions to cover gh api/label, jq, grep, awk,
  cat, wc, date — eliminates the per-turn denial bleed observed in
  the failed run.
- Truncate excerpts above 200KB to keep prompt cache predictable.
- Replace open-ended rule-edit prompt with strict triage steps and a
  final sanity-check (one commit, one file, no rule edits) so the
  agent fails fast on misdiagnosis instead of looping.

The recovery of the failed run's actual work is in PR #1301; this PR
fixes the workflow so the next big gap doesn't repeat the failure.
@laurigates laurigates added github-actions Pull requests that update GitHub Actions code maintenance labels May 11, 2026
@laurigates laurigates self-assigned this May 11, 2026
laurigates added a commit that referenced this pull request May 11, 2026
## Summary

Recovers the rule updates produced by run
[25651584541](https://github.com/laurigates/claude-plugins/actions/runs/25651584541),
which hit `error_max_turns` after running for 2h 7min and 51/50 turns.
The work itself was correct — the bot misdiagnosed its own PR #1299 as
an empty branch 72 seconds after opening it, then spent the remaining
~75 minutes thrashing to re-create what it had already shipped, and hit
max-turns. This PR salvages commit `e58aed8c` as a single clean commit
so the documentation work isn't wasted.

## Recovery scope

- **`.claude-code-version-check.json`** — `lastCheckedVersion` →
`2.1.138`, consolidated `reviewedChanges` entry covering 2.1.77 →
2.1.138
- **`hooks-reference.md`** — `effort.level` in Common Fields (2.1.133+),
`duration_ms` on PostToolUse (2.1.119+), new "PostToolUse — Replace Tool
Output" section with `updatedToolOutput` (2.1.121+)
- **`skill-development.md`** — `${CLAUDE_EFFORT}` string substitution
row (2.1.120+)
- **`agent-development.md`** — `--agent` mode hooks/mcpServers note
(2.1.116+), `worktree.baseRef` setting table (2.1.133+), subagent
Skill-tool discovery note (2.1.133+)
- **`plugin-structure.md`** — `bin/` directory entry, `experimental`
block in Optional Fields (2.1.129+), `skills` entry fix note (2.1.136+)
- **`agentic-permissions.md`** — `autoMode.hard_deny` subsection
(2.1.136+)
- **`sandbox-guidance.md`** — `CLAUDE_CODE_SESSION_ID` env var
(2.1.132+), `sandbox.network.deniedDomains` (2.1.113+)

## Run resource usage

The repo is on Claude Max 20x, so the SDK's notional cost figure isn't
actually invoiced — but it's still a useful workload-size signal because
it's computed directly from token counts. The SDK result line for the
failed run:

```
"subtype": "error_max_turns",
"duration_ms": 7611491,        # 2h 7min
"num_turns": 51,
"total_cost_usd": 63.69354,    # notional, not billed under Max 20x
"permission_denials_count": 201
```

For comparison, the same workflow's prior weekly runs:

| Run | Date | Conclusion | Turns | Duration | Notional $ |
|---|---|---|---|---|---|
|
[25315527872](https://github.com/laurigates/claude-plugins/actions/runs/25315527872)
| 2026-05-04 | success | 41 | 8m 31s | $2.01 |
|
[24991465607](https://github.com/laurigates/claude-plugins/actions/runs/24991465607)
| 2026-04-27 | success | 4 | 47m 13s | $57.45 |
|
[24662476905](https://github.com/laurigates/claude-plugins/actions/runs/24662476905)
| 2026-04-20 | max_turns | 41 | 35m 51s | $39.18 |
|
[24453356948](https://github.com/laurigates/claude-plugins/actions/runs/24453356948)
| 2026-04-15 | success | 20 | 1m 30s | $0.40 |
|
[24027905006](https://github.com/laurigates/claude-plugins/actions/runs/24027905006)
| 2026-04-06 | success | 9 | 11m 02s | $6.18 |
|
[23739805721](https://github.com/laurigates/claude-plugins/actions/runs/23739805721)
| 2026-03-30 | success | 38 | 7m 10s | $1.47 |
|
[25651584541](https://github.com/laurigates/claude-plugins/actions/runs/25651584541)
| 2026-05-11 | **max_turns** | **51** | **2h 7m** | **$63.69** |

The failed run's notional cost is ~60-100× a typical PR review run
($0.50–$1) and ~30× a normal weekly changelog-review tick. Worth noting:
per-call token counts are in the SDK's hidden full output stream — only
the aggregate cost/turn/duration is exposed in the result event. The
4-turn / $57 outlier on 2026-04-27 suggests the cost is dominated by
context size on first hit after a long gap, not by turn count alone.

## Test plan

- [ ] Spot-check each rule edit against the upstream changelog bullet
that motivated it
- [ ] Verify `.claude-code-version-check.json` parses and ratchets
`lastCheckedVersion` to `2.1.138`
- [ ] Confirm no placeholder/template artifacts remain (the failed run's
16 commits included several "fix placeholder" iterations; the final tree
at `e58aed8c` was checked clean before salvage)

## Follow-up

Workflow restructure to triage-as-issues + skip-if-exists is in #1302 to
prevent the next 60-version backlog from blowing the budget again.

🤖 Recovered with [Claude Code](https://claude.com/claude-code)
@laurigates
Copy link
Copy Markdown
Owner Author

@claude let's create the follow-up issues

@laurigates laurigates merged commit 257b69c into main May 11, 2026
3 checks passed
@laurigates laurigates deleted the ci/changelog-review-triage-only branch May 11, 2026 12:03
@claude
Copy link
Copy Markdown

claude Bot commented May 11, 2026

Claude finished @laurigates's task in 3m 34s —— View job


Follow-up issues for PR #1302

  • Gather context from PR description
  • Review the workflow file for additional follow-up candidates
  • Create follow-up issues

Created 3 issues:

Issue Title Source
#1303 ci(changelog-review): validate new triage-only workflow shape PR test plan (unchecked items)
#1304 ci(changelog-review): extract as reusable workflow once shape is proven PR follow-up item 1
#1305 ci(changelog-review): track per-call token efficiency over time PR follow-up item 2

#1303 covers the unverified test plan items (actionlint, dry-run dispatch, skip-if-exists) — these need a live run now that the tracked version is at 2.1.138.

#1304 and #1305 are the two explicit "not blocking" follow-ups from the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github-actions Pull requests that update GitHub Actions code maintenance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant