Skip to content

fix(scripts): add signal handlers to nemoclaw-start.sh for graceful container shutdown#1024

Open
latenighthackathon wants to merge 2 commits intoNVIDIA:mainfrom
latenighthackathon:fix/signal-handler-entrypoint
Open

fix(scripts): add signal handlers to nemoclaw-start.sh for graceful container shutdown#1024
latenighthackathon wants to merge 2 commits intoNVIDIA:mainfrom
latenighthackathon:fix/signal-handler-entrypoint

Conversation

@latenighthackathon
Copy link
Copy Markdown

@latenighthackathon latenighthackathon commented Mar 27, 2026

Summary

The entrypoint (scripts/nemoclaw-start.sh) runs as PID 1 inside the sandbox container but had no signal handlers. On docker stop or nemoclaw <name> destroy, SIGTERM interrupted wait and child processes (gateway, auto-pair watcher) were orphaned until Docker sent SIGKILL after the grace period.

This adds a trap that forwards SIGTERM/SIGINT to both child processes for graceful shutdown.

Related Issue

Fixes #1015

Changes

  • Capture auto-pair watcher PID in AUTO_PAIR_PID (previously only echo'd, never stored)
  • Register cleanup() trap that forwards SIGTERM to gateway and auto-pair processes
  • Apply to both root and non-root code paths
  • Use ${AUTO_PAIR_PID:-} for safe expansion if auto-pair wasn't started

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)

Manual verification:

  • docker stop <container> should now show [gateway] received signal, forwarding to children... in logs
  • Gateway process should receive SIGTERM and exit cleanly before Docker's grace period expires
  • Auto-pair watcher should also be terminated

Checklist

General

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Summary by CodeRabbit

  • Bug Fixes
    • Improved shutdown handling so termination signals are reliably forwarded to background processes.
    • Ensures background tasks and gateway services are cleanly stopped on exit to prevent orphaned processes.
    • More robust wait-and-cleanup behavior during shutdown reduces lingering resources and enables more predictable restarts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

The entrypoint script now captures the auto-pair watcher's PID, defines a cleanup() function, and registers trap cleanup SIGTERM SIGINT in both root and non-root paths; the handler forwards SIGTERM/SIGINT to the gateway and auto-pair processes and waits for their termination. (47 words)

Changes

Cohort / File(s) Summary
Signal handling and shutdown
scripts/nemoclaw-start.sh
Capture and store auto-pair watcher PID in AUTO_PAIR_PID; introduce cleanup() that forwards SIGTERM/SIGINT to the gateway and auto-pair PIDs and waits for them; install trap cleanup SIGTERM SIGINT in both non-root and root execution paths to prevent orphaned children.

Sequence Diagram(s)

sequenceDiagram
  participant Entrypoint as Entrypoint script (PID 1)
  participant Gateway as Gateway process
  participant AutoPair as Auto-pair watcher

  Entrypoint->>Gateway: start gateway (capture GATEWAY_PID)
  Entrypoint->>AutoPair: start auto-pair watcher in background (capture AUTO_PAIR_PID)
  note right of Entrypoint: register trap for SIGTERM/SIGINT

  rect rgba(0,128,0,0.5)
  External->>Entrypoint: SIGTERM / SIGINT
  end

  Entrypoint->>Gateway: forward SIGTERM/SIGINT
  Entrypoint->>AutoPair: forward SIGTERM/SIGINT (if set)
  Entrypoint->>Entrypoint: wait for Gateway and AutoPair to exit
  Gateway-->>Entrypoint: exit
  AutoPair-->>Entrypoint: exit
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I'm a rabbit on the start-up mat,
I catch the signals—just like that.
I tap the gateway, calm the pair,
then wait till both are safe and fair. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding signal handlers to the nemoclaw-start.sh script for graceful container shutdown.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #1015: captures auto-pair PID in AUTO_PAIR_PID variable, registers trap handler for SIGTERM/SIGINT, forwards signals to both gateway and auto-pair processes, applies cleanup to both root and non-root paths, and waits for graceful shutdown.
Out of Scope Changes check ✅ Passed All changes in scripts/nemoclaw-start.sh are directly scoped to implementing signal handlers and graceful shutdown as specified in issue #1015, with no extraneous modifications.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)

250-255: Consider extracting the duplicated cleanup() function.

The cleanup() function is defined identically in both the non-root (lines 250-254) and root (lines 311-315) code paths. Consider defining it once near the top of the script (after start_auto_pair but before the branching logic), or as a helper function that can be reused.

Also applies to: 311-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nemoclaw-start.sh` around lines 250 - 255, The cleanup() function is
duplicated in both branches; extract it into a single reusable function placed
after start_auto_pair and before the root vs non-root branching so both paths
call the same cleanup. Remove the duplicate definitions at the later branch
locations, ensure the single cleanup still references GATEWAY_PID and
AUTO_PAIR_PID (and uses kill -TERM and wait "$GATEWAY_PID"), and keep the trap
registration (trap cleanup SIGTERM SIGINT) once after the unified definition so
signal handling remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 311-320: The cleanup() function currently only waits for
GATEWAY_PID and returns (causing the later wait "$GATEWAY_PID" to fail/reap
twice and leaving AUTO_PAIR_PID un-waited); update cleanup() to kill both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}" (as already done), then wait for both
PIDs (conditionally if AUTO_PAIR_PID is set), capture the exit code of the
gateway wait into a variable, and call exit with that status from inside
cleanup() so the script doesn't perform a second wait afterwards; reference
GATEWAY_PID, AUTO_PAIR_PID, and cleanup to locate and change the logic
accordingly.
- Around line 250-258: The trap handler cleanup() currently reaps GATEWAY_PID
and doesn't wait for AUTO_PAIR_PID, and the later wait "$GATEWAY_PID" can fail
and overwrite the real exit code; modify cleanup() to forward the signal to both
"$GATEWAY_PID" and "${AUTO_PAIR_PID:-}", then wait for each child (use
conditional waits if AUTO_PAIR_PID may be empty), capture their exit statuses
into variables (e.g., gateway_status and autopair_status), and exit the script
from cleanup() with the gateway_status so the gateway's real exit code is
preserved; remove or guard the standalone wait "$GATEWAY_PID" + exit $? at the
end to avoid double-waiting.

---

Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 250-255: The cleanup() function is duplicated in both branches;
extract it into a single reusable function placed after start_auto_pair and
before the root vs non-root branching so both paths call the same cleanup.
Remove the duplicate definitions at the later branch locations, ensure the
single cleanup still references GATEWAY_PID and AUTO_PAIR_PID (and uses kill
-TERM and wait "$GATEWAY_PID"), and keep the trap registration (trap cleanup
SIGTERM SIGINT) once after the unified definition so signal handling remains
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70f002ed-6a4b-4377-a208-153e8039af56

📥 Commits

Reviewing files that changed from the base of the PR and between 5c269c1 and b6173b9.

📒 Files selected for processing (1)
  • scripts/nemoclaw-start.sh

…ontainer shutdown

The entrypoint runs as PID 1 but had no trap for SIGTERM/SIGINT.
On `docker stop`, wait was interrupted and child processes (gateway,
auto-pair watcher) were orphaned until Docker sent SIGKILL after the
grace period.

- Capture auto-pair watcher PID in AUTO_PAIR_PID
- Register trap that forwards SIGTERM to gateway and auto-pair
- Apply to both root and non-root code paths

Fixes NVIDIA#1015
@latenighthackathon latenighthackathon force-pushed the fix/signal-handler-entrypoint branch from b6173b9 to 85b8f4e Compare March 27, 2026 05:52
…e-wait

Address code review feedback:
- Wait for AUTO_PAIR_PID in cleanup() to prevent zombie
- Exit from within cleanup() to avoid double-waiting on
  GATEWAY_PID which would fail under set -e and lose the
  real exit status
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
scripts/nemoclaw-start.sh (1)

247-255: ⚠️ Potential issue | 🟠 Major

Exit from cleanup() after reaping both children.

Because set -euo pipefail is active, returning from cleanup() to the surrounding wait at Line 257 / Line 320 is still risky: the shell can propagate the interrupted wait status instead of the gateway's actual exit code, and AUTO_PAIR_PID is never waited on at all. Please capture the gateway status inside cleanup(), wait the watcher too, and exit from the trap so PID 1 owns the full shutdown path.

🐛 Proposed fix pattern for both cleanup() blocks
  cleanup() {
    echo "[gateway] received signal, forwarding to children..."
-   kill -TERM "$GATEWAY_PID" "${AUTO_PAIR_PID:-}" 2>/dev/null
-   wait "$GATEWAY_PID" 2>/dev/null
+   local gateway_status=0
+   kill -TERM "$GATEWAY_PID" 2>/dev/null || true
+   if [ -n "${AUTO_PAIR_PID:-}" ]; then
+     kill -TERM "$AUTO_PAIR_PID" 2>/dev/null || true
+   fi
+   wait "$GATEWAY_PID" 2>/dev/null || gateway_status=$?
+   if [ -n "${AUTO_PAIR_PID:-}" ]; then
+     wait "$AUTO_PAIR_PID" 2>/dev/null || true
+   fi
+   exit "$gateway_status"
  }

Run this read-only check to confirm the current trap/wait shape before updating both branches. Expected result: both cleanup() blocks show wait "$GATEWAY_PID", there is no wait "$AUTO_PAIR_PID", and the later foreground wait "$GATEWAY_PID" is still present.

#!/bin/bash
printf '== cleanup / wait sites ==\n'
rg -n -C2 'cleanup\(\)|trap cleanup|AUTO_PAIR_PID|wait "\$GATEWAY_PID"' scripts/nemoclaw-start.sh

printf '\n== bash wait builtin ==\n'
bash -lc 'help wait' | sed -n '1,25p'

Also applies to: 308-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nemoclaw-start.sh` around lines 247 - 255, The cleanup() trap
functions currently forward SIGTERM/SIGINT but do not capture the gateway's exit
status, fail to wait on AUTO_PAIR_PID, and return to the outer wait (causing set
-euo pipefail to propagate an interrupted wait status); update both cleanup()
definitions to: store the exit code of the gateway process (GATEWAY_PID) after
killing it, explicitly wait for AUTO_PAIR_PID if set, then exit with the saved
gateway status (use exit <saved_status>) so PID 1 performs the full shutdown
path; ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later
standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.
🧹 Nitpick comments (1)
scripts/nemoclaw-start.sh (1)

205-206: Add a regression test for the shutdown path.

test/nemoclaw-start.test.js:10-18 currently only checks the non-root nohup/log regexes, so neither cleanup() nor the new PID tracking is exercised. A focused test that starts the entrypoint, sends SIGTERM, and asserts both child processes are signaled/reaped would make this fix much harder to regress.

Also applies to: 247-255, 308-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/nemoclaw-start.sh` around lines 205 - 206, Add a regression test that
exercises the shutdown path by launching the entrypoint script, sending it
SIGTERM, and asserting that cleanup() runs and all child watcher PIDs (e.g.,
AUTO_PAIR_PID and any other watcher PID variables created around the
auto-pair/launcher blocks) are signaled and reaped; specifically, the test
should spawn the script, wait for the "launched (pid ...)" outputs to capture
child PIDs, send SIGTERM to the parent, then verify both that the child
processes exit (reaped) and that any PID-tracking logic (the variables like
AUTO_PAIR_PID and the cleanup() handler) were invoked/cleared. Ensure the test
replaces the current non-root nohup/log regex checks in the existing test file
with this focused lifecycle test so the new PID tracking and cleanup path cannot
regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 247-255: The cleanup() trap functions currently forward
SIGTERM/SIGINT but do not capture the gateway's exit status, fail to wait on
AUTO_PAIR_PID, and return to the outer wait (causing set -euo pipefail to
propagate an interrupted wait status); update both cleanup() definitions to:
store the exit code of the gateway process (GATEWAY_PID) after killing it,
explicitly wait for AUTO_PAIR_PID if set, then exit with the saved gateway
status (use exit <saved_status>) so PID 1 performs the full shutdown path;
ensure trap cleanup SIGTERM SIGINT remains and remove/replace any later
standalone wait "$GATEWAY_PID" that would be bypassed by the trap if necessary.

---

Nitpick comments:
In `@scripts/nemoclaw-start.sh`:
- Around line 205-206: Add a regression test that exercises the shutdown path by
launching the entrypoint script, sending it SIGTERM, and asserting that
cleanup() runs and all child watcher PIDs (e.g., AUTO_PAIR_PID and any other
watcher PID variables created around the auto-pair/launcher blocks) are signaled
and reaped; specifically, the test should spawn the script, wait for the
"launched (pid ...)" outputs to capture child PIDs, send SIGTERM to the parent,
then verify both that the child processes exit (reaped) and that any
PID-tracking logic (the variables like AUTO_PAIR_PID and the cleanup() handler)
were invoked/cleared. Ensure the test replaces the current non-root nohup/log
regex checks in the existing test file with this focused lifecycle test so the
new PID tracking and cleanup path cannot regress.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 78dc414f-b793-41c6-9acb-90be152abfb2

📥 Commits

Reviewing files that changed from the base of the PR and between b6173b9 and 85b8f4e.

📒 Files selected for processing (1)
  • scripts/nemoclaw-start.sh

Copy link
Copy Markdown
Author

@latenighthackathon latenighthackathon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw] nemoclaw-start.sh has no signal handlers — SIGTERM orphans child processes and skips graceful shutdown

1 participant