Skip to content

Fix/onboard spawnsync timeout#1031

Open
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/onboard-spawnsync-timeout
Open

Fix/onboard spawnsync timeout#1031
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/onboard-spawnsync-timeout

Conversation

@Junior00619
Copy link
Copy Markdown
Contributor

@Junior00619 Junior00619 commented Mar 27, 2026

Summary

Adds timeout to all spawnSync calls in bin/lib/onboard.js that run
external commands. Previously, if a subprocess hung (DNS stall, unresponsive
endpoint, network timeout beyond curl's --max-time), the onboarding wizard
froze indefinitely with no way to recover other than killing the process.

Related Issue

Fixes #1017

Changes

  • curl-based probes/fetches: timeout: 30_000 (30s) — 5 call sites
  • ollama pull: timeout: 300_000 (5 min) — model downloads can be large
  • install-openshell.sh: timeout: 120_000 (2 min)
  • Timed-out calls check result.signal === "SIGTERM" and return a graceful
    { ok: false, message: "timed out" } error instead of hanging
  • Intentional sleep calls left unchanged (blocking by design)
  • Adds a regression test that statically verifies every non-sleep spawnSync
    call includes a timeout property

Type of Change

Code change for a new feature, bug fix, or refactor.

Testing

  • npm test — 479 passed, 2 skipped
  • npx prek run --all-files — all hooks passed
  • New test: spawnSync timeout safety (2 assertions)

Summary by CodeRabbit

  • Bug Fixes
    • Added explicit timeouts to remote endpoint probes and model-fetch operations to prevent indefinite hangs; updated failure responses to report timeouts consistently. Increased timeouts for some longer local model operations.
  • Tests
    • Added a test suite that enforces all spawned external operations include timeout settings to avoid hanging.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3f1c2483-3974-45c3-aa28-029544d4dfb2

📥 Commits

Reviewing files that changed from the base of the PR and between 37ec4f5 and 583a90c.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/onboard.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Adds timeout handling to synchronous shell calls in the onboarding module: introduces an isSpawnTimeout helper and applies timeouts (30s for remote probes/fetches; 300s/120s for local operations). Adjusts per-call control flow to return or record failures on timeout to match existing return shapes.

Changes

Cohort / File(s) Summary
spawnSync Timeout Protection
bin/lib/onboard.js
Added isSpawnTimeout(result); added timeout options to spawnSync calls: 30,000ms for remote probe/fetch functions (probeOpenAiLikeEndpoint, probeAnthropicEndpoint, fetchNvidiaEndpointModels, fetchOpenAiLikeModels, fetchAnthropicModels); 300,000ms for pullOllamaModel; 120,000ms for installOpenshell. Updated timeout-handling branches to return or record failures consistent with existing shapes.
Timeout Validation Test
test/onboard.test.js
New Vitest suite that statically scans bin/lib/onboard.js for spawnSync( blocks, asserts at least one non-spawnSync("sleep"... ) block exists, and enforces that each collected non-sleep spawnSync block includes a timeout option; failing assertion reports the first line of the offending block.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I found the hangs that froze the trail,
I nudged the clocks so shells won't fail.
Thirty seconds for the distant run,
Longer naps when local work is done,
Hops are safe — the onboarding's hale. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix/onboard spawnsync timeout' directly addresses the main change: adding timeout options to spawnSync calls in the onboard.js file to prevent hanging.
Linked Issues check ✅ Passed The PR fully implements the requirements from issue #1017: timeout options added to all spawnSync calls (curl probes/fetches, ollama pull, OpenShell install) with appropriate durations and proper timeout handling.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the timeout issue: timeout additions to spawnSync calls and a regression test to ensure timeouts are present on non-sleep spawnSync calls.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
scripts/check-markdown-links.sh (1)

20-21: Inconsistency between SKIP_DIRS and find exclusions.

The SKIP_DIRS pattern on line 20 includes __pycache__ and nemoclaw/node_modules, but the find command on lines 117-122 doesn't exclude these directories. This means:

  • Running without arguments: find may return files in __pycache__, but they'll be skipped later by the SKIP_DIRS check
  • This works but is inconsistent and slightly inefficient

Consider aligning the exclusions for clarity:

♻️ Align find exclusions with SKIP_DIRS
 done < <(find "$REPO_ROOT" -name '*.md' -not -path '*/node_modules/*' \
   -not -path '*/.venv/*' \
   -not -path '*/dist/*' \
   -not -path '*/_build/*' \
   -not -path '*/_deps/*' \
+  -not -path '*/__pycache__/*' \
   -print0)

Also applies to: 117-122

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/check-markdown-links.sh` around lines 20 - 21, The SKIP_DIRS variable
includes patterns like __pycache__ and nemoclaw/node_modules that are not
currently mirrored in the find command's exclusion (-path ... -prune) logic;
update the find invocation to prune the same directories listed in SKIP_DIRS (or
build the find exclusion list from SKIP_DIRS) so find never descends into
__pycache__ and nemoclaw/node_modules (mirror SKIP_DIRS into the -path/-prune
clauses used in the find command to keep exclusions consistent and avoid extra
work).
test/onboard.test.js (1)

798-817: Make the static scanner less brittle to harmless refactors.

The i + 15 scan window and the hardcoded >= 7 count can fail when formatting or implementation evolves, even if timeout safety remains correct. Consider deriving target calls dynamically and removing the fixed-count assumption.

♻️ Suggested refactor
-      for (let j = i + 1; j < Math.min(i + 15, lines.length); j++) {
+      for (let j = i + 1; j < lines.length; j++) {
         block += "\n" + lines[j];
-        if (/^\s*\}\);/.test(lines[j])) break;
+        if (/^\s*\);/.test(lines[j]) || /^\s*\}\);/.test(lines[j])) break;
       }
...
-  it("has spawnSync calls to test", () => {
-    expect(spawnBlocks.length).toBeGreaterThanOrEqual(7);
+  it("has non-sleep spawnSync calls to test", () => {
+    const actionable = spawnBlocks.filter(({ text }) => !/spawnSync\("sleep"/.test(text));
+    expect(actionable.length).toBeGreaterThan(0);
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/onboard.test.js` around lines 798 - 817, The test's static scanner is
brittle because collectSpawnBlocks uses a fixed 15-line window and the assertion
expects >= 7 calls; update collectSpawnBlocks to scan forward until it detects
the block terminator (e.g., a line matching /^\s*\}\);/ or end of file) instead
of stopping at i + 15, and change the assertion that uses spawnBlocks to derive
the expected count dynamically (for example compute totalSpawnCalls =
lines.filter(l => /spawnSync\(/.test(l)).length and assert spawnBlocks.length
=== totalSpawnCalls or >= totalSpawnCalls if partial grouping is acceptable).
Reference: the collectSpawnBlocks function and spawnBlocks usage in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/pr.yaml:
- Around line 199-203: The GH Actions step is expanding $MD_FILES unquoted which
breaks filenames with spaces; update the run invocation that calls
scripts/check-markdown-links.sh to pass "$MD_FILES" quoted or pipe $MD_FILES
through a newline-safe mechanism (e.g., echo "$MD_FILES" | xargs -d '\n' -r
./scripts/check-markdown-links.sh) so the check-markdown-links.sh script
receives each filename intact; ensure references to MD_FILES and
scripts/check-markdown-links.sh are updated accordingly.

In `@bin/lib/onboard.js`:
- Around line 584-587: Replace the brittle timeout checks that compare
result.signal === "SIGTERM" with a small helper function (e.g.,
isTimedOut(result)) that returns true when result.error?.code === "ETIMEDOUT" or
as a fallback when result.signal === "SIGTERM" && result.status == null; add
this helper near other utility functions in onboard.js and then change all five
locations (the checks currently at the blocks around result handling where
failures.push is called) to use isTimedOut(result) before setting message:
"timed out" so only true timeouts are reported.

In `@scripts/check-markdown-links.sh`:
- Around line 71-75: The current markdown link extraction regex (link_re) in the
while loop that processes remaining will truncate link targets containing
parentheses (e.g., docs/file_(1).md); update the link_re used in that loop to a
more robust pattern that supports parentheses inside the target (a
balanced-parentheses-aware regex) or, if you want a minimal change, add a clear
code comment next to link_re and the loop documenting this limitation and
suggesting a future fix; locate link_re and the while [[ "$remaining" =~
$link_re ]] loop to apply the change (these symbols identify where to update).

---

Nitpick comments:
In `@scripts/check-markdown-links.sh`:
- Around line 20-21: The SKIP_DIRS variable includes patterns like __pycache__
and nemoclaw/node_modules that are not currently mirrored in the find command's
exclusion (-path ... -prune) logic; update the find invocation to prune the same
directories listed in SKIP_DIRS (or build the find exclusion list from
SKIP_DIRS) so find never descends into __pycache__ and nemoclaw/node_modules
(mirror SKIP_DIRS into the -path/-prune clauses used in the find command to keep
exclusions consistent and avoid extra work).

In `@test/onboard.test.js`:
- Around line 798-817: The test's static scanner is brittle because
collectSpawnBlocks uses a fixed 15-line window and the assertion expects >= 7
calls; update collectSpawnBlocks to scan forward until it detects the block
terminator (e.g., a line matching /^\s*\}\);/ or end of file) instead of
stopping at i + 15, and change the assertion that uses spawnBlocks to derive the
expected count dynamically (for example compute totalSpawnCalls = lines.filter(l
=> /spawnSync\(/.test(l)).length and assert spawnBlocks.length ===
totalSpawnCalls or >= totalSpawnCalls if partial grouping is acceptable).
Reference: the collectSpawnBlocks function and spawnBlocks usage in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fadd8dea-4085-4ab4-917a-1b23f9b1b2d7

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9d530 and 2317895.

📒 Files selected for processing (4)
  • .github/workflows/pr.yaml
  • bin/lib/onboard.js
  • scripts/check-markdown-links.sh
  • test/onboard.test.js

@Junior00619 Junior00619 force-pushed the fix/onboard-spawnsync-timeout branch from 2317895 to 6228431 Compare March 27, 2026 18:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
bin/lib/onboard.js (1)

640-643: ⚠️ Potential issue | 🟡 Minor

Use a stronger timeout predicate in isSpawnTimeout.

Line 642 currently infers timeout from SIGTERM + null status only. That can misclassify non-timeout terminations as "timed out".

🔧 Proposed fix
 /** True when spawnSync killed the child due to its timeout option. */
 function isSpawnTimeout(result) {
-  return result.status == null && result.signal === "SIGTERM";
+  return (
+    result?.error?.code === "ETIMEDOUT" ||
+    (result?.status == null && result?.signal === "SIGTERM")
+  );
 }
In current Node.js child_process.spawnSync behavior, is `result.error.code === "ETIMEDOUT"` the authoritative timeout indicator, and can `result.signal === "SIGTERM"` occur for non-timeout terminations?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 640 - 643, The timeout detection in
isSpawnTimeout is too weak: replace the current predicate with one that first
checks for result.error && result.error.code === "ETIMEDOUT" (the authoritative
Node timeout indicator) and only as a fallback consider the previous heuristic
(result.status == null && result.signal === "SIGTERM") if no error code is
available; update the function isSpawnTimeout to prefer the error.code check
while preserving backwards compatibility by keeping the signal/status check as
secondary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@bin/lib/onboard.js`:
- Around line 640-643: The timeout detection in isSpawnTimeout is too weak:
replace the current predicate with one that first checks for result.error &&
result.error.code === "ETIMEDOUT" (the authoritative Node timeout indicator) and
only as a fallback consider the previous heuristic (result.status == null &&
result.signal === "SIGTERM") if no error code is available; update the function
isSpawnTimeout to prefer the error.code check while preserving backwards
compatibility by keeping the signal/status check as secondary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 498d2caf-cb8c-464c-9f7f-acb24d59b778

📥 Commits

Reviewing files that changed from the base of the PR and between 2317895 and 6228431.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • test/onboard.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/onboard.test.js

@Junior00619 Junior00619 force-pushed the fix/onboard-spawnsync-timeout branch from 6228431 to 37ec4f5 Compare March 27, 2026 18:45
All spawnSync calls in onboard.js that run external commands (curl probes,
ollama pull, install-openshell) now include a timeout property. Previously,
if a subprocess hung (DNS stall, unresponsive endpoint), the onboarding
wizard would freeze indefinitely with no way to recover.

- curl-based probes/fetches: 30s timeout (5 call sites)
- ollama pull: 300s timeout (large model downloads)
- install-openshell: 120s timeout
- Timed-out calls return a graceful error instead of hanging
- Intentional sleep calls left unchanged

Adds a regression test that statically verifies every non-sleep spawnSync
call in onboard.js includes a timeout property.

Fixes NVIDIA#1017
@Junior00619 Junior00619 force-pushed the fix/onboard-spawnsync-timeout branch from 37ec4f5 to 583a90c Compare March 27, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw] spawnSync calls in onboard.js have no timeout — hung curl freezes entire onboarding wizard

1 participant