forked from steveyegge/gastown
-
Notifications
You must be signed in to change notification settings - Fork 0
fix(shutdown): Improve gastown shutdown reliability #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sauerdaniel
wants to merge
44
commits into
main
Choose a base branch
from
polecat/organic-mkabz4tm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sauerdaniel
added a commit
that referenced
this pull request
Jan 12, 2026
Changed from nuclear=true to nuclear=false when polecats self-destruct via gt done. The nuclear flag bypasses ALL safety checks including the cleanup_status field that was added as part of ZFC #10 to prevent accidental work loss. Now polecats will validate their self-reported cleanup_status before removing themselves, consistent with how the witness handler handles cleanup. Fixes steveyegge#360
sauerdaniel
added a commit
that referenced
this pull request
Jan 12, 2026
Changed from nuclear=true to nuclear=false when polecats self-destruct via gt done. The nuclear flag bypasses ALL safety checks including the cleanup_status field that was added as part of ZFC #10 to prevent accidental work loss. Now polecats will validate their self-reported cleanup_status before removing themselves, consistent with how the witness handler handles cleanup. Fixes steveyegge#360
Three related fixes for polecat lifecycle management: 1. Push branch to origin before self-nuke (done.go) - Ensures work is preserved on remote before worktree cleanup - Prevents orphaned local-only branches 2. Respect cleanup_status in selfNukePolecat (done.go) - Changed nuclear=true to nuclear=false - Validates cleanup_status before removal - Prevents destruction with uncommitted/unpushed work 3. Respawn done polecats with hooked work (manager.go, handlers.go) - loadFromBeads now checks hook_bead field - Added FindPolecatsWithHookedWork() and RespawnPolecatWithHookedWork() - Witness can auto-respawn polecats that have pending work Fixes steveyegge#360
Two fixes for daemon-managed agent startup: 1. Boot watchdog CLAUDE.md creation (boot.go, templates.go) - Add CreateBootCLAUDEmd function to templates package - Add EnsureCLAUDEmd method to create context before session spawn - Enables Boot to perform intelligent triage decisions 2. Deacon startup auto-execution (deacon.go) - Execute gt prime directly via SendKeys instead of nudge message - Prevents text appearing in prompt area without execution - Fixes endless restart loop in Claude Code v2.1.4+
1. Attach new patrol wisp to hook for autonomous continuation - Ensures witness continues patrol after session restart 2. Add --hook flag to SessionStart hooks in createPatrolHooks - Properly signals hook attachment during session creation
After polecats push their work branches to origin before self-nuke, the refinery was only deleting local branches after merge, leaving stale remote branches accumulating. Added remote branch deletion in handleSuccess and handleSuccessFromQueue to clean up both local and remote copies after successful merge.
When convoy leg beads complete, they now record output_path metadata so synthesis workflows can discover and aggregate outputs without hunting through worktrees or guessing branch names. Changes: - formula.go: parse output section, include output_path in leg descriptions - convoy.go: add Description field to issueDetails for metadata parsing - synthesis.go: parse output_path from leg descriptions with template fallback Fixes steveyegge#303
Adds support for configuring a separate push URL (fork) when the upstream repository is read-only. This allows polecats to push to a personal fork while still pulling from the upstream repository. Changes: - Added PushURL field to RigConfig and Rig struct - Added PushURL to AddRigOptions - Added ConfigurePushURL function to git package - Configure push URL in bare repo when PushURL is set Usage: gt rig add --git-url=https://github.com/upstream/repo \ --push-url=https://github.com/user/fork \ myrig
Add Community section with link to Discord server for real-time support and collaboration. Fixes steveyegge#305
…/witness-improvements', 'pr/refinery-branch-cleanup', 'pr/synthesis-output-metadata', 'pr/push-url-config' and 'pr/discord-link'
The post-startup nudges were arriving before Claude Code's input was ready, causing only the Enter key to make it through (empty input). Changes: - Pass "gt prime" as CLI argument to Claude Code startup command - Remove unreliable post-startup nudges and timing delays - The SessionStart hook provides a backup propulsion mechanism The CLI prompt approach is more reliable because the prompt is queued before Claude even starts, avoiding timing issues entirely. Fixes: gt-x7p3
The boot role was added but the test expectation wasn't updated, causing TestRoleNames to fail. Fixes: gt-j7wl
Apply the same fix as Mayor (d509f7c) to Deacon, Witness, Refinery, and Polecat. Post-startup nudges arrive before Claude Code's input is ready, causing only the Enter key to make it through (empty input). Changes for each agent: - Pass "gt prime" as CLI argument to startup command - Remove unreliable post-startup nudges and timing delays - Keep SessionStart hook as backup propulsion mechanism The CLI prompt approach is more reliable because the prompt is queued before Claude even starts, avoiding timing issues entirely. Fixes: gt-mghw
Boot agent was getting wrong settings template due to: 1. RoleTypeFor() missing "boot" - fell through to Interactive 2. spawnTmux() not calling EnsureSettingsForRole() Add "boot" to autonomous roles list and call EnsureSettingsForRole() in spawnTmux() to create proper .claude/settings.json for Boot. Fixes: gt-hnjp
Adds per-agent-type health tracking to the Mayor's tmux statusline, showing working/idle counts for Polecats, Witnesses, Refineries, and Deacon. All agent types are always displayed, even when no agents of that type are running (shows as '0/0 😺'). Format: active: 4/4 😺 6/10 👁️ 7/10 🏭 1/1 ⛪
- Abbreviate long rig names (design_forge→df, gastown→gt, etc.) - Update tests for new abbreviations - Addresses issue hq-dn15
- Add AgentCrew to tracked agent types in mayor statusline - Show 👷 icon for crew agents - Display crew count in statusline (e.g., 👷1/5) - Removes crew from skip filter so they're properly tracked Fixes issue where crew agents were not shown in statusline.
Active rigs now appear first (alphabetically), followed by parked/docked rigs (also alphabetically). This makes it easier to see which rigs are operational at a glance.
Move dynamic status content from status-right to status-left to utilize available space and prevent rig name truncation. - SetStatusFormat: Now sets status-right with compact identity - SetDynamicStatus: Now sets status-left with dynamic content - Increased status-left-length to 150 for more space - Removed time from dynamic status (was %H:%M) Fixes hq-s1il
The issue describes Mayor not monitoring convoys, but the root cause is that Deacon's patrol loop never called the existing infrastructure (gt convoy stranded + mol-convoy-feed). This implements the daemon-driven convoy progression approach (suggested option #1 in the issue). Changes: - Added feed-stranded-convoys step to mol-deacon-patrol formula - Deacon now runs gt convoy stranded --json each patrol cycle - For each stranded convoy, dispatches mol-convoy-feed dog - Updated dependency chain - Bumped formula version from 8 to 9
- Removed space between counts and emojis (e.g., "3 😺" → "3😺") - Removed space between emojis and counts/subjects (e.g., "📬 3" → "📬3") - Removed space between hook emoji and text (e.g., "🪝 work" → "🪝work")
IsClaudeRunning was calling IsAgentRunning (which calls GetPaneCommand), then immediately calling GetPaneCommand again. This duplicate subprocess call was slowing down gt startup and daemon heartbeat operations. Changed IsAgentRunning to return (bool, string) - the running status and the pane command it checked. IsClaudeRunning now reuses the command instead of making a redundant tmux subprocess call. Fixes gt-kpii: zombie session detection slows gt up
The notifyRecipient function was using NudgeSession which sends notifications to the input buffer. Changed to use SendNotificationBanner which displays the banner in the message history using echo. This fixes the issue where notification banners appeared in Claude Code's input buffer instead of in the conversation history. Fixes hq-nc9mr Replaces: hq-1qhj
Previously, the witness statusline only showed the crew count when it was greater than 0. Now all agent types (polecats 😺 and crew 👷) are always displayed, even when their count is 0.
For Claude Code sessions, mail notifications now use NudgeSession instead of SendNotificationBanner. This ensures notifications appear in the message history rather than being injected into the input buffer. Fixes: hq-1qhj
Changed the statusline format from "1/10😺" to "😺1/10" to match the documented format in the comment. This ensures the icon appears before the working/total counts for all agent types.
The warning when processes respawn after 'gt down --all' now includes more comprehensive troubleshooting guidance, including checking gt status and mentioning that the gt daemon itself could be the cause.
After the polecat self-nuke fix, branches are now pushed to origin before the polecat's worktree is deleted. The refinery was only deleting local branches after merge, leaving stale remote branches. Fix: Updated handleSuccess and handleSuccessFromQueue to also delete the remote branch from origin after deleting the local branch. Related to: hq-nju99, GitHub issue steveyegge#359
Boot was designed as an ephemeral triage agent that runs on each daemon tick, observes Deacon's state, and exits. However, Boot was getting stuck at interactive prompts after completing triage, which prevented the daemon from spawning fresh Boot instances. Fix: Create CLAUDE.md for Boot that instructs it to: 1. Check Deacon status and heartbeat 2. Take action if needed (nudge/restart Deacon) 3. Exit immediately using `tmux kill-session -t gt-boot` This ensures Boot functions as designed - ephemeral watchdog that runs triage and exits, allowing the daemon to spawn fresh Boot instances on each heartbeat. Related: hq-6p7g4
Fixes hq-lglmw When gt sling assigns work to a polecat, it now automatically attaches the mol-polecat-work molecule to the polecat's agent bead. Changes: - Added attachPolecatWorkMolecule() function that cooks the formula and attaches the molecule to the polecat's agent bead - Added molecule attachment call after hooking work (single sling mode) - Added molecule attachment call after hooking work (batch sling mode) - Implementation is idempotent (checks if already attached) - Non-blocking: logs warnings but doesn't fail sling operation
Issue steveyegge#197: Polecat fails to hook when slinging a bead with a molecule to a rig. Root cause: attachPolecatWorkMolecule was running 'bd cook' from the polecat's worktree (which doesn't have a .beads directory) instead of from the rig directory where the bead database lives. Fix: Use beads.ResolveHookDir() to resolve the correct rig directory for running bd commands, consistent with how the hook command works.
The sling command needs to handle .repo.git symlinks correctly for polecat spawning across all rigs. Related: hq-dp3ss
Fixes bug where work slung to 'done' polecats (no active tmux session) would never get processed. Now when gt sling resolves an existing polecat target and finds no active session, it spawns a fresh polecat instead of failing or leaving the work stuck. This addresses hq-50u3h: 43+ stale convoys were not progressing because polecats in 'done' state had work hooked to them but weren't processing it.
The health tracking loop in runMayorStatusLine was counting all agents regardless of whether their rig was registered in rigs.json. This caused count discrepancies when sessions existed for unregistered rigs. Now the health tracking loop applies the same registeredRigs filter that the earlier rig status loop uses, ensuring consistent counts across all statusline displays. Fixes hq-auhq
Boot was designed to be a watchdog that runs on daemon ticks and manages Deacon lifecycle, but it wasn't functioning because Boot's CLAUDE.md context file was missing from the boot directory. Changes: - Add CreateBootCLAUDEmd function to templates package - Add EnsureCLAUDEmd method to Boot to create CLAUDE.md from template - Update spawnTmux to call EnsureCLAUDEmd before creating session - Add "boot" to RoleNames list This ensures Boot has proper context when spawned by the daemon, enabling it to perform intelligent triage (start/wake/nudge/interrupt decisions) instead of running without instructions. Fixes: hq-6p7g4
Fixes steveyegge#210 - Creating a convoy as mayor results in prefix mismatch The town-level beads database is initialized with issue_prefix=hq, but convoy creation was generating IDs with hq-cv- prefix, causing bd create to fail with prefix mismatch error. Changed convoy ID generation from hq-cv-<hash> to hq-<hash>. Convoys are distinguished by type=convoy attribute, not by special ID prefix.
Comprehensive research on media processing optimization covering: - Performance bottleneck analysis (I/O, CPU, memory) - Parallel processing strategies (pipeline, data, hybrid) - Multi-layer caching architecture (Redis + local SSD) - Format optimization matrix and codec comparisons - Cost reduction opportunities (40-60% estimated savings) - 6-week proof of concept implementation plan - Recommended technology stack and code examples Deliverables complete: Performance audit, optimization recommendations, PoC plan.
Implements GitHub issue steveyegge#220 - Worktree setup hook for injecting local configurations. When polecats are spawned, their worktrees are created from the rig's repo. Previously, there was no way to inject custom configurations during this process. Now users can place executable hooks in <rig>/.runtime/setup-hooks/ to run custom scripts during worktree creation: rig/ .runtime/ setup-hooks/ 01-git-config.sh <- Inject git config 02-copy-secrets.sh <- Copy secrets 99-finalize.sh <- Final setup Features: - Hooks execute in alphabetical order - Non-executable files are skipped with a warning - Hooks run with worktree as working directory - Environment variables: GT_WORKTREE_PATH, GT_RIG_PATH - Hook failures are non-fatal (warn but continue) Example hook to inject git config: #!/bin/sh git config --local user.signingkey ~/.ssh/key.asc git config --local commit.gpgsign true Related to: hq-fq2zg, GitHub issue steveyegge#220
Adds per-agent-type health tracking to the Mayor's tmux statusline, showing working/idle counts for Polecats, Witnesses, Refineries, and Deacon. All agent types are always displayed, even when no agents of that type are running (shows as '0/0 😺'). Format: active: 4/4 😺 6/10 👁️ 7/10 🏭 1/1 ⛪
Fixes steveyegge#291 - gastown is very hard to kill/shutdown/stop Changes: - Add shutdown coordination: daemon checks shutdown.lock and skips heartbeat auto-restarts during shutdown to prevent fighting shutdown - Extend grace period from 100ms to 30 seconds for graceful session exit - Add polling to detect when sessions exit gracefully before force kill - Add orphaned Claude/node process detection in shutdown verification The daemon's heartbeat now checks for shutdown.lock (created by gt down) and skips auto-restart logic when shutdown is in progress. This prevents the daemon from restarting agents that were intentionally killed during shutdown. Sessions now receive Ctrl-C and have up to 30 seconds to exit cleanly, with polling every 500ms to detect graceful exit. Only sessions that don't exit within the grace period are force-killed. Shutdown verification now includes detection of orphaned Claude/node processes that may be left behind when tmux sessions are killed but child processes don't terminate.
The sling refactor (cd2de6e) split the 1560-line sling.go into 7 focused modules, but left duplicate function declarations in the original file. This commit removes the duplicates, keeping only the implementations in the split files. Also fixes related build issues: - Remove unused claude import from boot/boot.go - Fix IsAgentRunning() calls to handle multiple return values - Fix atomic operation on startedAny counter in start.go - Remove duplicate health tracking code in statusline.go - Add missing imports (strings, config) to sling.go
IsClaudeRunning was calling IsAgentRunning (which calls GetPaneCommand), then immediately calling GetPaneCommand again. This duplicate subprocess call was slowing down gt startup and daemon heartbeat operations. Changed IsAgentRunning to return (bool, string) - the running status and the pane command it checked. IsClaudeRunning now reuses the command instead of making a redundant tmux subprocess call. Fixes gt-kpii: zombie session detection slows gt up
Fixes steveyegge#291 - gastown is very hard to kill/shutdown/stop Changes: - Add shutdown coordination: daemon checks shutdown.lock and skips heartbeat auto-restarts during shutdown to prevent fighting shutdown - Extend grace period from 100ms to 30 seconds for graceful session exit - Add polling to detect when sessions exit gracefully before force kill - Add orphaned Claude/node process detection in shutdown verification The daemon's heartbeat now checks for shutdown.lock (created by gt down) and skips auto-restart logic when shutdown is in progress. This prevents the daemon from restarting agents that were intentionally killed during shutdown. Sessions now receive Ctrl-C and have up to 30 seconds to exit cleanly, with polling every 500ms to detect graceful exit. Only sessions that don't exit within the grace period are force-killed. Shutdown verification now includes detection of orphaned Claude/node processes that may be left behind when tmux sessions are killed but child processes don't terminate.
f26d421 to
eea3230
Compare
a67da82 to
60ed204
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Test plan