fix(scripts): add signal handlers to nemoclaw-start.sh for graceful container shutdown#1024
Conversation
📝 WalkthroughWalkthroughThe entrypoint script now captures the auto-pair watcher's PID, defines a Changes
Sequence Diagram(s)sequenceDiagram
participant Entrypoint as Entrypoint script (PID 1)
participant Gateway as Gateway process
participant AutoPair as Auto-pair watcher
Entrypoint->>Gateway: start gateway (capture GATEWAY_PID)
Entrypoint->>AutoPair: start auto-pair watcher in background (capture AUTO_PAIR_PID)
note right of Entrypoint: register trap for SIGTERM/SIGINT
rect rgba(0,128,0,0.5)
External->>Entrypoint: SIGTERM / SIGINT
end
Entrypoint->>Gateway: forward SIGTERM/SIGINT
Entrypoint->>AutoPair: forward SIGTERM/SIGINT (if set)
Entrypoint->>Entrypoint: wait for Gateway and AutoPair to exit
Gateway-->>Entrypoint: exit
AutoPair-->>Entrypoint: exit
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
250-255: Consider extracting the duplicatedcleanup()function.The
cleanup()function is defined identically in both the non-root (lines 250-254) and root (lines 311-315) code paths. Consider defining it once near the top of the script (afterstart_auto_pairbut before the branching logic), or as a helper function that can be reused.Also applies to: 311-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 250 - 255, The cleanup() function is duplicated in both branches; extract it into a single reusable function placed after start_auto_pair and before the root vs non-root branching so both paths call the same cleanup. Remove the duplicate definitions at the later branch locations, ensure the single cleanup still references GATEWAY_PID and AUTO_PAIR_PID (and uses kill -TERM and wait "$GATEWAY_PID"), and keep the trap registration (trap cleanup SIGTERM SIGINT) once after the unified definition so signal handling remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 311-320: The cleanup() function currently only waits for
GATEWAY_PID and returns (causing the later wait "$GATEWAY_PID" to fail/reap
twice and leaving AUTO_PAIR_PID un-waited); update cleanup() to kill both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}" (as already done), then wait for both
PIDs (conditionally if AUTO_PAIR_PID is set), capture the exit code of the
gateway wait into a variable, and call exit with that status from inside
cleanup() so the script doesn't perform a second wait afterwards; reference
GATEWAY_PID, AUTO_PAIR_PID, and cleanup to locate and change the logic
accordingly.
- Around line 250-258: The trap handler cleanup() currently reaps GATEWAY_PID
and doesn't wait for AUTO_PAIR_PID, and the later wait "$GATEWAY_PID" can fail
and overwrite the real exit code; modify cleanup() to forward the signal to both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}", then wait for each child (use
conditional waits if AUTO_PAIR_PID may be empty), capture their exit statuses
into variables (e.g., gateway_status and autopair_status), and exit the script
from cleanup() with the gateway_status so the gateway's real exit code is
preserved; remove or guard the standalone wait "$GATEWAY_PID" + exit $? at the
end to avoid double-waiting.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 250-255: The cleanup() function is duplicated in both branches;
extract it into a single reusable function placed after start_auto_pair and
before the root vs non-root branching so both paths call the same cleanup.
Remove the duplicate definitions at the later branch locations, ensure the
single cleanup still references GATEWAY_PID and AUTO_PAIR_PID (and uses kill
-TERM and wait "$GATEWAY_PID"), and keep the trap registration (trap cleanup
SIGTERM SIGINT) once after the unified definition so signal handling remains
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70f002ed-6a4b-4377-a208-153e8039af56
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
…ontainer shutdown The entrypoint runs as PID 1 but had no trap for SIGTERM/SIGINT. On `docker stop`, wait was interrupted and child processes (gateway, auto-pair watcher) were orphaned until Docker sent SIGKILL after the grace period. - Capture auto-pair watcher PID in AUTO_PAIR_PID - Register trap that forwards SIGTERM to gateway and auto-pair - Apply to both root and non-root code paths Fixes NVIDIA#1015
b6173b9 to
85b8f4e
Compare
…e-wait Address code review feedback: - Wait for AUTO_PAIR_PID in cleanup() to prevent zombie - Exit from within cleanup() to avoid double-waiting on GATEWAY_PID which would fail under set -e and lose the real exit status
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)
247-255:⚠️ Potential issue | 🟠 MajorExit from
cleanup()after reaping both children.Because
set -euo pipefailis active, returning fromcleanup()to the surroundingwaitat Line 257 / Line 320 is still risky: the shell can propagate the interrupted wait status instead of the gateway's actual exit code, andAUTO_PAIR_PIDis never waited on at all. Please capture the gateway status insidecleanup(), wait the watcher too, andexitfrom the trap so PID 1 owns the full shutdown path.🐛 Proposed fix pattern for both
cleanup()blockscleanup() { echo "[gateway] received signal, forwarding to children..." - kill -TERM "$GATEWAY_PID" "${AUTO_PAIR_PID:-}" 2>/dev/null - wait "$GATEWAY_PID" 2>/dev/null + local gateway_status=0 + kill -TERM "$GATEWAY_PID" 2>/dev/null || true + if [ -n "${AUTO_PAIR_PID:-}" ]; then + kill -TERM "$AUTO_PAIR_PID" 2>/dev/null || true + fi + wait "$GATEWAY_PID" 2>/dev/null || gateway_status=$? + if [ -n "${AUTO_PAIR_PID:-}" ]; then + wait "$AUTO_PAIR_PID" 2>/dev/null || true + fi + exit "$gateway_status" }Run this read-only check to confirm the current trap/wait shape before updating both branches. Expected result: both
cleanup()blocks showwait "$GATEWAY_PID", there is nowait "$AUTO_PAIR_PID", and the later foregroundwait "$GATEWAY_PID"is still present.#!/bin/bash printf '== cleanup / wait sites ==\n' rg -n -C2 'cleanup\(\)|trap cleanup|AUTO_PAIR_PID|wait "\$GATEWAY_PID"' scripts/nemoclaw-start.sh printf '\n== bash wait builtin ==\n' bash -lc 'help wait' | sed -n '1,25p'Also applies to: 308-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 247 - 255, The cleanup() trap functions currently forward SIGTERM/SIGINT but do not capture the gateway's exit status, fail to wait on AUTO_PAIR_PID, and return to the outer wait (causing set -euo pipefail to propagate an interrupted wait status); update both cleanup() definitions to: store the exit code of the gateway process (GATEWAY_PID) after killing it, explicitly wait for AUTO_PAIR_PID if set, then exit with the saved gateway status (use exit <saved_status>) so PID 1 performs the full shutdown path; ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)
205-206: Add a regression test for the shutdown path.
test/nemoclaw-start.test.js:10-18currently only checks the non-root nohup/log regexes, so neithercleanup()nor the new PID tracking is exercised. A focused test that starts the entrypoint, sendsSIGTERM, and asserts both child processes are signaled/reaped would make this fix much harder to regress.Also applies to: 247-255, 308-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/nemoclaw-start.sh` around lines 205 - 206, Add a regression test that exercises the shutdown path by launching the entrypoint script, sending it SIGTERM, and asserting that cleanup() runs and all child watcher PIDs (e.g., AUTO_PAIR_PID and any other watcher PID variables created around the auto-pair/launcher blocks) are signaled and reaped; specifically, the test should spawn the script, wait for the "launched (pid ...)" outputs to capture child PIDs, send SIGTERM to the parent, then verify both that the child processes exit (reaped) and that any PID-tracking logic (the variables like AUTO_PAIR_PID and the cleanup() handler) were invoked/cleared. Ensure the test replaces the current non-root nohup/log regex checks in the existing test file with this focused lifecycle test so the new PID tracking and cleanup path cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 247-255: The cleanup() trap functions currently forward
SIGTERM/SIGINT but do not capture the gateway's exit status, fail to wait on
AUTO_PAIR_PID, and return to the outer wait (causing set -euo pipefail to
propagate an interrupted wait status); update both cleanup() definitions to:
store the exit code of the gateway process (GATEWAY_PID) after killing it,
explicitly wait for AUTO_PAIR_PID if set, then exit with the saved gateway
status (use exit <saved_status>) so PID 1 performs the full shutdown path;
ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later
standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.
---
Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 205-206: Add a regression test that exercises the shutdown path by
launching the entrypoint script, sending it SIGTERM, and asserting that
cleanup() runs and all child watcher PIDs (e.g., AUTO_PAIR_PID and any other
watcher PID variables created around the auto-pair/launcher blocks) are signaled
and reaped; specifically, the test should spawn the script, wait for the
"launched (pid ...)" outputs to capture child PIDs, send SIGTERM to the parent,
then verify both that the child processes exit (reaped) and that any
PID-tracking logic (the variables like AUTO_PAIR_PID and the cleanup() handler)
were invoked/cleared. Ensure the test replaces the current non-root nohup/log
regex checks in the existing test file with this focused lifecycle test so the
new PID tracking and cleanup path cannot regress.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 78dc414f-b793-41c6-9acb-90be152abfb2
📒 Files selected for processing (1)
scripts/nemoclaw-start.sh
Summary
The entrypoint (
scripts/nemoclaw-start.sh) runs as PID 1 inside the sandbox container but had no signal handlers. Ondocker stopornemoclaw <name> destroy, SIGTERM interruptedwaitand child processes (gateway, auto-pair watcher) were orphaned until Docker sent SIGKILL after the grace period.This adds a trap that forwards SIGTERM/SIGINT to both child processes for graceful shutdown.
Related Issue
Fixes #1015
Changes
AUTO_PAIR_PID(previously only echo'd, never stored)cleanup()trap that forwards SIGTERM to gateway and auto-pair processes${AUTO_PAIR_PID:-}for safe expansion if auto-pair wasn't startedType of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Manual verification:
docker stop <container>should now show[gateway] received signal, forwarding to children...in logsChecklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Summary by CodeRabbit