Skip to content

fix(registry): atomic writes and advisory file locking for sandboxes.…#1033

Open
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/registry-atomic-writes-locking
Open

fix(registry): atomic writes and advisory file locking for sandboxes.…#1033
Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Junior00619:fix/registry-atomic-writes-locking

Conversation

@Junior00619
Copy link
Copy Markdown
Contributor

@Junior00619 Junior00619 commented Mar 27, 2026

Changes in registry.js:

Atomic writes — save() now writes to a .tmp. file and renameSyncs it into place, so a crash mid-write never leaves a half-written sandboxes.json
Advisory file locking — acquireLock()/releaseLock() using mkdirSync (atomic on POSIX), with stale lock detection (10s threshold) and retry loop (50 attempts × 100ms)
withLock() wrapper — all 4 mutating functions (registerSandbox, updateSandbox, removeSandbox, setDefault) now hold the lock during their entire read-modify-write cycle
Zero new dependencies — uses Atomics.wait for sync sleep, mkdirSync for locking
Tests in registry.test.js: (5 new)

Atomic write leaves no .tmp files
Lock acquire/release lifecycle
Lock release on exception (finally block)
Stale lock removal
Concurrent writers — 4 parallel Node processes each writing 5 sandboxes, all 20 must be present (proves no lost updates)

Summary by CodeRabbit

  • Refactor

    • Registry persistence now uses atomic on-disk writes with per-process temp-file replace and robust cleanup; registry mutations are protected by an inter-process advisory locking layer to prevent concurrent-update corruption and ensure locks are released on failure.
  • Bug Fixes

    • Updates that attempt to rename an existing sandbox are now rejected to prevent accidental renames or duplicated entries.
  • Tests

    • Added coverage for atomic writes, temp-file cleanup on error, lock lifecycle and recovery, lock behavior on failures, and concurrent-writer scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds advisory filesystem locking and atomic registry writes: introduces acquireLock, releaseLock, and withLock; runs mutating registry operations under the lock; changes save(data) to write to a PID‑suffixed temporary file and atomically rename it to the registry file with best‑effort temp cleanup on error.

Changes

Cohort / File(s) Summary
Registry Locking & Atomic Writes
bin/lib/registry.js
Added acquireLock, releaseLock, and withLock (exported). Wrapped mutating APIs (registerSandbox, updateSandbox, removeSandbox, setDefault) to execute inside withLock. updateSandbox now rejects attempts to rename a sandbox via updates.name. Reimplemented save(data) to write to a per‑PID temp file and atomically rename into place, with best‑effort temp cleanup on error.
Concurrency & Persistence Tests
test/registry.test.js
Added tests for atomic writes (temp file + rename, no leftover sandboxes.json.tmp.*, cleanup when renameSync throws), advisory file locking (owner file, stale-lock removal, release on callback throw), and a multi‑process concurrency test verifying concurrent registrations produce all expected entries.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Registry
  participant FS as Filesystem
  Client->>Registry: registerSandbox(data)
  Registry->>FS: acquireLock(lockDir)
  alt lock acquired
    Registry->>FS: read `sandboxes.json`
    Registry->>Registry: modify in-memory registry
    Registry->>FS: write `sandboxes.json.tmp.<pid>`
    Registry->>FS: rename `sandboxes.json.tmp.<pid>` -> `sandboxes.json`
    Registry->>FS: releaseLock(lockDir)
    Registry-->>Client: return result
  else lock acquisition fails
    FS-->>Registry: error
    Registry-->>Client: throw
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I dug a hole called lock and key,
Temp hops wait, then swing to be free.
Writes renamed, no crumbs remain,
Concurrent carrots all sustain.
Rabbits clap — the registry's plain.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: atomic writes and advisory file locking for the registry module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
bin/lib/registry.js (1)

99-116: Per-operation locking is correct, but callers needing multi-operation atomicity should be aware.

Each mutating operation now holds the lock for its full read-modify-write cycle—this prevents lost updates.

However, based on the onboard.js snippets (lines 1755-1880, 2966-2978), there are workflows that call removeSandbox() followed by registerSandbox(). With per-call locking, another process could interleave between these two operations, potentially leading to:

  • A concurrent listSandboxes() seeing a transient state where the sandbox is absent
  • A concurrent registerSandbox() with the same name succeeding between remove and re-register

If these sequences require atomicity, consider either:

  1. Exporting withLock for callers to wrap their own multi-step workflows, or
  2. Adding a composite operation like replaceSandbox(name, entry) that performs both under one lock

This isn't a bug in the registry module—just a consideration for upstream callers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/registry.js` around lines 99 - 116, The current per-operation locking
in registerSandbox (and removeSandbox) prevents lost updates but does not
provide atomicity for multi-step workflows; export the withLock function from
this module so callers (e.g., the onboarding flows that call removeSandbox()
then registerSandbox()) can wrap multi-operation sequences in a single lock OR
implement a new composite API like replaceSandbox(name, entry) that performs the
remove-and-register logic inside one withLock invocation; update module exports
to include withLock (or add replaceSandbox) and adjust callers in onboard.js to
use the exported withLock or the new replaceSandbox to ensure the
remove/register sequence is executed atomically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/registry.test.js`:
- Around line 136-146: The test currently only exercises normal save behavior;
update it to simulate a write/rename failure and assert the temporary file is
cleaned up: make the test create regDir and regFile as before, then stub or mock
fs.renameSync (or the internal method used by Registry.save) to throw an error
when called, call registry.registerSandbox / registry.save to trigger the
failure, then assert that the temp file (the .tmp or whatever filename
Registry.save writes) no longer exists and that the thrown error is
handled/propagated as expected; ensure the mock is restored after the test to
avoid leaking into other tests. Use symbols: save(), registry.registerSandbox(),
fs.renameSync (or the specific internal rename call) and check regFile/regDir
for temp file removal.

---

Nitpick comments:
In `@bin/lib/registry.js`:
- Around line 99-116: The current per-operation locking in registerSandbox (and
removeSandbox) prevents lost updates but does not provide atomicity for
multi-step workflows; export the withLock function from this module so callers
(e.g., the onboarding flows that call removeSandbox() then registerSandbox())
can wrap multi-operation sequences in a single lock OR implement a new composite
API like replaceSandbox(name, entry) that performs the remove-and-register logic
inside one withLock invocation; update module exports to include withLock (or
add replaceSandbox) and adjust callers in onboard.js to use the exported
withLock or the new replaceSandbox to ensure the remove/register sequence is
executed atomically.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79d889d3-5d39-48e5-b3d3-e9cf4f6d64d2

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9d530 and 0203910.

📒 Files selected for processing (2)
  • bin/lib/registry.js
  • test/registry.test.js

@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from 0203910 to ed0edee Compare March 27, 2026 19:44
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
test/registry.test.js (1)

136-151: ⚠️ Potential issue | 🟡 Minor

This test still doesn’t reach renameSync().

chmod 0444 on the directory prevents creating sandboxes.json.tmp.<pid> at all, so save() throws from writeFileSync, not the rename branch this test names. If you want coverage for cleanup-after-rename-failure, stub fs.renameSync or otherwise allow temp creation and force only the final rename to fail.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/registry.test.js` around lines 136 - 151, The test currently fails
before renameSync because chmod 0444 blocks creating the temp file; change the
test to allow temp file creation and instead force the final rename to fail by
stubbing fs.renameSync to throw (e.g. use jest.spyOn(fs,
"renameSync").mockImplementation(() => { throw new Error("EACCES"); })), call
registry.save({ sandboxes: {}, defaultSandbox: null }) to trigger the rename
error branch in registry.save, then restore the stub and verify no leftover
files matching "sandboxes.json.tmp." exist; refer to registry.save,
fs.renameSync, regDir and the "sandboxes.json.tmp.<pid>" temp filename in the
test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/registry.js`:
- Around line 10-13: The current stale-lock policy can lose mutual exclusion
because mtime won't change while the owner is alive; update the lock logic
around LOCK_DIR, LOCK_STALE_MS, LOCK_RETRY_MS and LOCK_MAX_RETRIES to either (1)
implement owner liveness/heartbeat: have the lock owner write an owner record
(PID and heartbeat timestamp) and refresh it periodically while in the critical
section, and change stale detection to consider PID liveness (e.g., kill/ps
check or platform-appropriate process existence) and heartbeat age, or (2) if
you want a minimal change, make the retry budget match the stale threshold by
computing LOCK_MAX_RETRIES = Math.ceil(LOCK_STALE_MS / LOCK_RETRY_MS) (or
increase LOCK_RETRY_MS/LOCK_MAX_RETRIES so retry_wait >= LOCK_STALE_MS) so a
freshly orphaned lock can be reclaimed within the same attempt; pick one
approach and apply consistently to the locking/unlock code paths.
- Around line 44-47: The releaseLock function currently swallows all errors when
removing LOCK_DIR; change the catch to capture the exception (e.g., catch (err))
and only ignore it when err.code === 'ENOENT', otherwise rethrow (or propagate)
the error so EACCES/EPERM/ENOTEMPTY surface; update the releaseLock
implementation that calls fs.rmdirSync(LOCK_DIR) to use this targeted error
handling.

In `@test/registry.test.js`:
- Around line 192-199: The test falsely proves locking by only checking lockDir
absence; instead spy on the filesystem lock operations to assert the lock is
actually created and removed: in the test replace the before/after existence
checks with jest.spyOn(fs, "mkdirSync") and jest.spyOn(fs, "rmdirSync") (or
their equivalents) and assert they were called with lockDir when calling
registry.registerSandbox({ name: "lock-test" }); you can also add a contention
case by stubbing fs.mkdirSync to throw EEXIST once and assert registerSandbox
waits/retries or fails appropriately; reference registry.registerSandbox,
lockDir, and fs.mkdirSync/fs.rmdirSync in the test change.

---

Duplicate comments:
In `@test/registry.test.js`:
- Around line 136-151: The test currently fails before renameSync because chmod
0444 blocks creating the temp file; change the test to allow temp file creation
and instead force the final rename to fail by stubbing fs.renameSync to throw
(e.g. use jest.spyOn(fs, "renameSync").mockImplementation(() => { throw new
Error("EACCES"); })), call registry.save({ sandboxes: {}, defaultSandbox: null
}) to trigger the rename error branch in registry.save, then restore the stub
and verify no leftover files matching "sandboxes.json.tmp." exist; refer to
registry.save, fs.renameSync, regDir and the "sandboxes.json.tmp.<pid>" temp
filename in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7611ad0-112b-4700-90b8-98c306df95c0

📥 Commits

Reviewing files that changed from the base of the PR and between 0203910 and ed0edee.

📒 Files selected for processing (2)
  • bin/lib/registry.js
  • test/registry.test.js

@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from ed0edee to d38a432 Compare March 27, 2026 19:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bin/lib/registry.js (1)

28-38: ⚠️ Potential issue | 🔴 Critical

Still applies: stale-lock reaping is unsafe.

Lines 29-31 use directory mtime as owner liveness, so a live holder becomes reapable after 10s. The statSyncrmdirSync path is also racy: if another waiter recreates .lock between those calls, this code can delete that fresh lock by pathname. Mutual exclusion can still be lost.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/registry.js` around lines 28 - 38, The current stale-lock reaper that
relies on LOCK_DIR mtime is unsafe and racy; replace the mtime-based logic with
an owner-based check: write an owner file (e.g., inside LOCK_DIR or as LOCK_DIR
+ '/owner') containing the holder PID and timestamp when acquiring the lock, and
in the reaper (the try/catch block that reads LOCK_DIR and compares to
LOCK_STALE_MS) read that owner file atomically, parse the PID, check process
liveness with process.kill(pid, 0) (or equivalent) and only attempt to remove
the lock if the owner PID is confirmed dead; before removing call a second read
of the owner file to ensure the PID hasn’t changed (preventing TOCTOU) and
remove the lock directory only when the owner file still matches the earlier
PID; keep the wait behavior using LOCK_RETRY_MS and preserve error handling
around fs operations (the symbols to edit are LOCK_DIR, LOCK_STALE_MS, and the
try/catch + rmdirSync/Atomics.wait block).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/registry.js`:
- Around line 121-125: The merge in the withLock() callback updates
data.sandboxes[name] with updates but allows updates.name to diverge from the
registry key; either forbid renames or perform a proper re-key. Modify the block
in the withLock callback (where load(), save(), data.sandboxes[name], updates,
and defaultSandbox are used) to: 1) if updates.name is present and differs from
name, either throw/reject the update (validation path) or 2) remove the old key,
insert the object under updates.name (copying merged fields), and if
data.defaultSandbox === name update it to updates.name (rename path); ensure
save(data) is called after the change and that no duplicate sandbox keys are
created.

---

Duplicate comments:
In `@bin/lib/registry.js`:
- Around line 28-38: The current stale-lock reaper that relies on LOCK_DIR mtime
is unsafe and racy; replace the mtime-based logic with an owner-based check:
write an owner file (e.g., inside LOCK_DIR or as LOCK_DIR + '/owner') containing
the holder PID and timestamp when acquiring the lock, and in the reaper (the
try/catch block that reads LOCK_DIR and compares to LOCK_STALE_MS) read that
owner file atomically, parse the PID, check process liveness with
process.kill(pid, 0) (or equivalent) and only attempt to remove the lock if the
owner PID is confirmed dead; before removing call a second read of the owner
file to ensure the PID hasn’t changed (preventing TOCTOU) and remove the lock
directory only when the owner file still matches the earlier PID; keep the wait
behavior using LOCK_RETRY_MS and preserve error handling around fs operations
(the symbols to edit are LOCK_DIR, LOCK_STALE_MS, and the try/catch +
rmdirSync/Atomics.wait block).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 274cc2b5-f836-49dd-aea2-9f6f683670af

📥 Commits

Reviewing files that changed from the base of the PR and between ed0edee and d38a432.

📒 Files selected for processing (2)
  • bin/lib/registry.js
  • test/registry.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/registry.test.js

@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from d38a432 to 4b851c9 Compare March 27, 2026 20:26
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/registry.js`:
- Around line 23-26: The current block creates LOCK_DIR with fs.mkdirSync and
then writes LOCK_OWNER with fs.writeFileSync; if writeFileSync throws the
created LOCK_DIR is left behind. Modify the logic around LOCK_DIR/LOCK_OWNER
creation (the fs.mkdirSync and fs.writeFileSync calls) to track that you created
the directory, wrap the writeFileSync in a try/catch, and on failure remove the
created LOCK_DIR (e.g., fs.rmdirSync or fs.rmSync) before rethrowing the error
so no stale .lock directory remains when owner creation fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3b39e52c-8b5f-43ab-b9d9-0bf3da208c86

📥 Commits

Reviewing files that changed from the base of the PR and between d38a432 and 4b851c9.

📒 Files selected for processing (2)
  • bin/lib/registry.js
  • test/registry.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/registry.test.js

@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from 4b851c9 to 0ee4d61 Compare March 27, 2026 20:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/registry.js`:
- Around line 24-27: The lock ownership file (LOCK_OWNER) is written
non-atomically which allows readers to see an empty or truncated PID and falsely
treat the lock as stale; instead, write the PID to a temporary file (e.g.,
LOCK_OWNER + ".tmp" or similar), fs.writeFileSync that temp file with mode
0o600, then fs.renameSync the temp into LOCK_OWNER to publish atomically, and
update any logic that reads LOCK_OWNER (the readFileSync/owner parsing and the
stale-reaping path) to treat missing/invalid/empty owner content as “lock
acquisition in progress” (i.e., do not consider the lock stale) until a valid
PID is atomically in place. Ensure the same atomic write approach is used
wherever LOCK_OWNER is created or updated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 177d9d60-fbb6-4324-9498-e07c236f8d21

📥 Commits

Reviewing files that changed from the base of the PR and between 4b851c9 and 0ee4d61.

📒 Files selected for processing (2)
  • bin/lib/registry.js
  • test/registry.test.js

@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from 0ee4d61 to a20b7fa Compare March 27, 2026 20:56
…json

The save() function now writes to a temp file and renames it into place,
preventing half-written JSON from corrupting the registry.  All mutating
operations (registerSandbox, updateSandbox, removeSandbox, setDefault) are
wrapped in a mkdir-based advisory lock so concurrent nemoclaw processes
cannot produce lost updates.

Fixes NVIDIA#1011
@Junior00619 Junior00619 force-pushed the fix/registry-atomic-writes-locking branch from a20b7fa to a1c5bc2 Compare March 27, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant