fix(registry): atomic writes and advisory file locking for sandboxes.…#1033
fix(registry): atomic writes and advisory file locking for sandboxes.…#1033Junior00619 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds advisory filesystem locking and atomic registry writes: introduces Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Registry
participant FS as Filesystem
Client->>Registry: registerSandbox(data)
Registry->>FS: acquireLock(lockDir)
alt lock acquired
Registry->>FS: read `sandboxes.json`
Registry->>Registry: modify in-memory registry
Registry->>FS: write `sandboxes.json.tmp.<pid>`
Registry->>FS: rename `sandboxes.json.tmp.<pid>` -> `sandboxes.json`
Registry->>FS: releaseLock(lockDir)
Registry-->>Client: return result
else lock acquisition fails
FS-->>Registry: error
Registry-->>Client: throw
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
bin/lib/registry.js (1)
99-116: Per-operation locking is correct, but callers needing multi-operation atomicity should be aware.Each mutating operation now holds the lock for its full read-modify-write cycle—this prevents lost updates.
However, based on the
onboard.jssnippets (lines 1755-1880, 2966-2978), there are workflows that callremoveSandbox()followed byregisterSandbox(). With per-call locking, another process could interleave between these two operations, potentially leading to:
- A concurrent
listSandboxes()seeing a transient state where the sandbox is absent- A concurrent
registerSandbox()with the same name succeeding between remove and re-registerIf these sequences require atomicity, consider either:
- Exporting
withLockfor callers to wrap their own multi-step workflows, or- Adding a composite operation like
replaceSandbox(name, entry)that performs both under one lockThis isn't a bug in the registry module—just a consideration for upstream callers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/registry.js` around lines 99 - 116, The current per-operation locking in registerSandbox (and removeSandbox) prevents lost updates but does not provide atomicity for multi-step workflows; export the withLock function from this module so callers (e.g., the onboarding flows that call removeSandbox() then registerSandbox()) can wrap multi-operation sequences in a single lock OR implement a new composite API like replaceSandbox(name, entry) that performs the remove-and-register logic inside one withLock invocation; update module exports to include withLock (or add replaceSandbox) and adjust callers in onboard.js to use the exported withLock or the new replaceSandbox to ensure the remove/register sequence is executed atomically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/registry.test.js`:
- Around line 136-146: The test currently only exercises normal save behavior;
update it to simulate a write/rename failure and assert the temporary file is
cleaned up: make the test create regDir and regFile as before, then stub or mock
fs.renameSync (or the internal method used by Registry.save) to throw an error
when called, call registry.registerSandbox / registry.save to trigger the
failure, then assert that the temp file (the .tmp or whatever filename
Registry.save writes) no longer exists and that the thrown error is
handled/propagated as expected; ensure the mock is restored after the test to
avoid leaking into other tests. Use symbols: save(), registry.registerSandbox(),
fs.renameSync (or the specific internal rename call) and check regFile/regDir
for temp file removal.
---
Nitpick comments:
In `@bin/lib/registry.js`:
- Around line 99-116: The current per-operation locking in registerSandbox (and
removeSandbox) prevents lost updates but does not provide atomicity for
multi-step workflows; export the withLock function from this module so callers
(e.g., the onboarding flows that call removeSandbox() then registerSandbox())
can wrap multi-operation sequences in a single lock OR implement a new composite
API like replaceSandbox(name, entry) that performs the remove-and-register logic
inside one withLock invocation; update module exports to include withLock (or
add replaceSandbox) and adjust callers in onboard.js to use the exported
withLock or the new replaceSandbox to ensure the remove/register sequence is
executed atomically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 79d889d3-5d39-48e5-b3d3-e9cf4f6d64d2
📒 Files selected for processing (2)
bin/lib/registry.jstest/registry.test.js
0203910 to
ed0edee
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
test/registry.test.js (1)
136-151:⚠️ Potential issue | 🟡 MinorThis test still doesn’t reach
renameSync().
chmod 0444on the directory prevents creatingsandboxes.json.tmp.<pid>at all, sosave()throws fromwriteFileSync, not the rename branch this test names. If you want coverage for cleanup-after-rename-failure, stubfs.renameSyncor otherwise allow temp creation and force only the final rename to fail.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/registry.test.js` around lines 136 - 151, The test currently fails before renameSync because chmod 0444 blocks creating the temp file; change the test to allow temp file creation and instead force the final rename to fail by stubbing fs.renameSync to throw (e.g. use jest.spyOn(fs, "renameSync").mockImplementation(() => { throw new Error("EACCES"); })), call registry.save({ sandboxes: {}, defaultSandbox: null }) to trigger the rename error branch in registry.save, then restore the stub and verify no leftover files matching "sandboxes.json.tmp." exist; refer to registry.save, fs.renameSync, regDir and the "sandboxes.json.tmp.<pid>" temp filename in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/registry.js`:
- Around line 10-13: The current stale-lock policy can lose mutual exclusion
because mtime won't change while the owner is alive; update the lock logic
around LOCK_DIR, LOCK_STALE_MS, LOCK_RETRY_MS and LOCK_MAX_RETRIES to either (1)
implement owner liveness/heartbeat: have the lock owner write an owner record
(PID and heartbeat timestamp) and refresh it periodically while in the critical
section, and change stale detection to consider PID liveness (e.g., kill/ps
check or platform-appropriate process existence) and heartbeat age, or (2) if
you want a minimal change, make the retry budget match the stale threshold by
computing LOCK_MAX_RETRIES = Math.ceil(LOCK_STALE_MS / LOCK_RETRY_MS) (or
increase LOCK_RETRY_MS/LOCK_MAX_RETRIES so retry_wait >= LOCK_STALE_MS) so a
freshly orphaned lock can be reclaimed within the same attempt; pick one
approach and apply consistently to the locking/unlock code paths.
- Around line 44-47: The releaseLock function currently swallows all errors when
removing LOCK_DIR; change the catch to capture the exception (e.g., catch (err))
and only ignore it when err.code === 'ENOENT', otherwise rethrow (or propagate)
the error so EACCES/EPERM/ENOTEMPTY surface; update the releaseLock
implementation that calls fs.rmdirSync(LOCK_DIR) to use this targeted error
handling.
In `@test/registry.test.js`:
- Around line 192-199: The test falsely proves locking by only checking lockDir
absence; instead spy on the filesystem lock operations to assert the lock is
actually created and removed: in the test replace the before/after existence
checks with jest.spyOn(fs, "mkdirSync") and jest.spyOn(fs, "rmdirSync") (or
their equivalents) and assert they were called with lockDir when calling
registry.registerSandbox({ name: "lock-test" }); you can also add a contention
case by stubbing fs.mkdirSync to throw EEXIST once and assert registerSandbox
waits/retries or fails appropriately; reference registry.registerSandbox,
lockDir, and fs.mkdirSync/fs.rmdirSync in the test change.
---
Duplicate comments:
In `@test/registry.test.js`:
- Around line 136-151: The test currently fails before renameSync because chmod
0444 blocks creating the temp file; change the test to allow temp file creation
and instead force the final rename to fail by stubbing fs.renameSync to throw
(e.g. use jest.spyOn(fs, "renameSync").mockImplementation(() => { throw new
Error("EACCES"); })), call registry.save({ sandboxes: {}, defaultSandbox: null
}) to trigger the rename error branch in registry.save, then restore the stub
and verify no leftover files matching "sandboxes.json.tmp." exist; refer to
registry.save, fs.renameSync, regDir and the "sandboxes.json.tmp.<pid>" temp
filename in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a7611ad0-112b-4700-90b8-98c306df95c0
📒 Files selected for processing (2)
bin/lib/registry.jstest/registry.test.js
ed0edee to
d38a432
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
bin/lib/registry.js (1)
28-38:⚠️ Potential issue | 🔴 CriticalStill applies: stale-lock reaping is unsafe.
Lines 29-31 use directory
mtimeas owner liveness, so a live holder becomes reapable after 10s. ThestatSync→rmdirSyncpath is also racy: if another waiter recreates.lockbetween those calls, this code can delete that fresh lock by pathname. Mutual exclusion can still be lost.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/registry.js` around lines 28 - 38, The current stale-lock reaper that relies on LOCK_DIR mtime is unsafe and racy; replace the mtime-based logic with an owner-based check: write an owner file (e.g., inside LOCK_DIR or as LOCK_DIR + '/owner') containing the holder PID and timestamp when acquiring the lock, and in the reaper (the try/catch block that reads LOCK_DIR and compares to LOCK_STALE_MS) read that owner file atomically, parse the PID, check process liveness with process.kill(pid, 0) (or equivalent) and only attempt to remove the lock if the owner PID is confirmed dead; before removing call a second read of the owner file to ensure the PID hasn’t changed (preventing TOCTOU) and remove the lock directory only when the owner file still matches the earlier PID; keep the wait behavior using LOCK_RETRY_MS and preserve error handling around fs operations (the symbols to edit are LOCK_DIR, LOCK_STALE_MS, and the try/catch + rmdirSync/Atomics.wait block).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/registry.js`:
- Around line 121-125: The merge in the withLock() callback updates
data.sandboxes[name] with updates but allows updates.name to diverge from the
registry key; either forbid renames or perform a proper re-key. Modify the block
in the withLock callback (where load(), save(), data.sandboxes[name], updates,
and defaultSandbox are used) to: 1) if updates.name is present and differs from
name, either throw/reject the update (validation path) or 2) remove the old key,
insert the object under updates.name (copying merged fields), and if
data.defaultSandbox === name update it to updates.name (rename path); ensure
save(data) is called after the change and that no duplicate sandbox keys are
created.
---
Duplicate comments:
In `@bin/lib/registry.js`:
- Around line 28-38: The current stale-lock reaper that relies on LOCK_DIR mtime
is unsafe and racy; replace the mtime-based logic with an owner-based check:
write an owner file (e.g., inside LOCK_DIR or as LOCK_DIR + '/owner') containing
the holder PID and timestamp when acquiring the lock, and in the reaper (the
try/catch block that reads LOCK_DIR and compares to LOCK_STALE_MS) read that
owner file atomically, parse the PID, check process liveness with
process.kill(pid, 0) (or equivalent) and only attempt to remove the lock if the
owner PID is confirmed dead; before removing call a second read of the owner
file to ensure the PID hasn’t changed (preventing TOCTOU) and remove the lock
directory only when the owner file still matches the earlier PID; keep the wait
behavior using LOCK_RETRY_MS and preserve error handling around fs operations
(the symbols to edit are LOCK_DIR, LOCK_STALE_MS, and the try/catch +
rmdirSync/Atomics.wait block).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 274cc2b5-f836-49dd-aea2-9f6f683670af
📒 Files selected for processing (2)
bin/lib/registry.jstest/registry.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/registry.test.js
d38a432 to
4b851c9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/registry.js`:
- Around line 23-26: The current block creates LOCK_DIR with fs.mkdirSync and
then writes LOCK_OWNER with fs.writeFileSync; if writeFileSync throws the
created LOCK_DIR is left behind. Modify the logic around LOCK_DIR/LOCK_OWNER
creation (the fs.mkdirSync and fs.writeFileSync calls) to track that you created
the directory, wrap the writeFileSync in a try/catch, and on failure remove the
created LOCK_DIR (e.g., fs.rmdirSync or fs.rmSync) before rethrowing the error
so no stale .lock directory remains when owner creation fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3b39e52c-8b5f-43ab-b9d9-0bf3da208c86
📒 Files selected for processing (2)
bin/lib/registry.jstest/registry.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/registry.test.js
4b851c9 to
0ee4d61
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/registry.js`:
- Around line 24-27: The lock ownership file (LOCK_OWNER) is written
non-atomically which allows readers to see an empty or truncated PID and falsely
treat the lock as stale; instead, write the PID to a temporary file (e.g.,
LOCK_OWNER + ".tmp" or similar), fs.writeFileSync that temp file with mode
0o600, then fs.renameSync the temp into LOCK_OWNER to publish atomically, and
update any logic that reads LOCK_OWNER (the readFileSync/owner parsing and the
stale-reaping path) to treat missing/invalid/empty owner content as “lock
acquisition in progress” (i.e., do not consider the lock stale) until a valid
PID is atomically in place. Ensure the same atomic write approach is used
wherever LOCK_OWNER is created or updated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 177d9d60-fbb6-4324-9498-e07c236f8d21
📒 Files selected for processing (2)
bin/lib/registry.jstest/registry.test.js
0ee4d61 to
a20b7fa
Compare
…json The save() function now writes to a temp file and renames it into place, preventing half-written JSON from corrupting the registry. All mutating operations (registerSandbox, updateSandbox, removeSandbox, setDefault) are wrapped in a mkdir-based advisory lock so concurrent nemoclaw processes cannot produce lost updates. Fixes NVIDIA#1011
a20b7fa to
a1c5bc2
Compare
Changes in registry.js:
Atomic writes — save() now writes to a .tmp. file and renameSyncs it into place, so a crash mid-write never leaves a half-written sandboxes.json
Advisory file locking — acquireLock()/releaseLock() using mkdirSync (atomic on POSIX), with stale lock detection (10s threshold) and retry loop (50 attempts × 100ms)
withLock() wrapper — all 4 mutating functions (registerSandbox, updateSandbox, removeSandbox, setDefault) now hold the lock during their entire read-modify-write cycle
Zero new dependencies — uses Atomics.wait for sync sleep, mkdirSync for locking
Tests in registry.test.js: (5 new)
Atomic write leaves no .tmp files
Lock acquire/release lifecycle
Lock release on exception (finally block)
Stale lock removal
Concurrent writers — 4 parallel Node processes each writing 5 sandboxes, all 20 must be present (proves no lost updates)
Summary by CodeRabbit
Refactor
Bug Fixes
Tests