fix: prevent IOKit prepare count underflow with concurrent completions (#557)#648
Conversation
jundot#557) Two separate bugs could each independently trigger the 'completeMemory() prepare count underflow' kernel panic on Apple Silicon, both rooted in Metal buffer lifecycle races. Bug 1 — deferred-clear window not extended for concurrent completions ---------------------------------------------------------------------- The deferred-clear mechanism (introduced in jundot#435) used a countdown integer (_deferred_clear_steps) anchored to the *first* request that finished in a batch. The scheduling code guarded the assignment with "only set if None", so when a second request finished in the same or the immediately following step (as happens with max_num_seqs=2 and the Sisyphus multi-agent workload in jundot#557), its completion was silently ignored. The first request's KV cache blocks were returned to the free queue immediately on completion and could be re-allocated to the next incoming request within one generation step — long before IOKit had finished the completeMemory() callbacks for those buffers. The new allocation called prepare() on the same Metal buffer, and when the stale callback finally ran, completeMemory() decremented the count below zero → kernel panic. Fix: replace the countdown integer with an absolute target step (_deferred_clear_at). Each completion computes target = _step_counter + _DEFERRED_CLEAR_DELAY and updates _deferred_clear_at via max(), so every completion gets a full _DEFERRED_CLEAR_DELAY window measured from *its own* finish step, regardless of whether another deferred clear was already pending. The step() tail checks whether the counter has reached the target rather than incrementing a counter, which is both simpler and correct. Bug 2 — bare mx.clear_cache() calls in SpecPrefill path --------------------------------------------------------- Two sites in the SpecPrefill code path called mx.clear_cache() directly instead of going through the _sync_and_clear_cache() helper: • The inter-chunk clear inside the system-prompt full-prefill loop • The draft-cache teardown after SpecPrefill scoring Both mx.eval() calls preceding them submit work onto the generation stream asynchronously. Calling mx.clear_cache() without first calling mx.synchronize(generation_stream) can release Metal buffers that are still referenced by in-flight command buffers on that stream, producing the same kernel panic that the _sync_and_clear_cache() wrapper was introduced to prevent (jundot#435). Fix: replace both bare mx.clear_cache() calls with _sync_and_clear_cache(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
mdevk
left a comment
There was a problem hiding this comment.
This is critical for anyone running concurrent VLM inference on Apple Silicon.
Relevant data from our benchmarking
We ran concurrency sweeps (1, 2, 4, 8, 16) on oMLX v0.3.4 with Qwen3-VL-30B-A3B:
| Concurrency | JSON Valid | Note |
|---|---|---|
| 1 | 100% (20/20) | Baseline |
| 2 | 100% (20/20) | Clean |
| 4 | 100% (20/20) | Clean |
| 8 | 95% (19/20) | First quality drop |
| 16 | 40% (8/20) | Severe degradation |
We also observed that on v0.3.5.dev1 (pre-release), concurrency=2 dropped JSON validity to 20% — much worse than v0.3.4 stable. This could be related to the deferred-clear window regression you describe.
We never hit the kernel panic itself, but the quality degradation at high concurrency could be an early symptom of the same buffer reuse-before-completion race. If IOKit hasn't finished completeMemory() callbacks and the buffer gets reallocated, the model reads stale/corrupted KV cache data → garbage output → invalid JSON.
The fix looks correct
The max() approach for _deferred_clear_at elegantly handles overlapping completions. Each completion extends the window rather than resetting it. Clean design.
The SpecPrefill mx.synchronize() before mx.clear_cache() is clearly necessary — asynchronous stream semantics make bare clear_cache() a race condition by design.
jundot
left a comment
There was a problem hiding this comment.
The fix itself looks correct and well thought out. The max() approach for _deferred_clear_at cleanly handles overlapping completions without the indefinite postponement problem the old comment warned about. Each completion gets a full delay window measured from its own finish step, which is exactly the right behavior.
Replacing the bare mx.clear_cache() calls in the specprefill path with _sync_and_clear_cache() is also clearly necessary since async stream semantics make the unsynchronized version a race condition by design.
3 existing tests reference the renamed _deferred_clear_steps attribute and need updating. I'll handle that plus add concurrent completion test coverage in a follow-up commit after merge.
…llow-up) update 3 tests referencing renamed _deferred_clear_steps -> _deferred_clear_at and add concurrent completion window extension test for #557 fix
Fixes #557
Two separate bugs could each independently trigger the
'completeMemory() prepare count underflow'kernel panic on Apple Silicon.Bug 1 — deferred-clear window not extended for concurrent completions
The deferred-clear mechanism introduced in #435 used a countdown integer (
_deferred_clear_steps) anchored to the first request that finished in a batch. The scheduling code guarded the assignment with"only set if None", so when a second request finished in the same or immediately following step (as happens withmax_num_seqs=2and the Sisyphus multi-agent workload in #557), its completion was silently ignored.The first request's KV cache blocks were returned to the free queue immediately on completion and could be re-allocated to the next incoming request within one generation step — long before IOKit had finished the
completeMemory()callbacks for those buffers. The new allocation calledprepare()on the same Metal buffer, and when the stale callback finally ran,completeMemory()decremented the count below zero → kernel panic.Fix: replace the countdown integer with an absolute target step (
_deferred_clear_at). Each completion computestarget = _step_counter + _DEFERRED_CLEAR_DELAYand updates_deferred_clear_atviamax(), so every completion gets a full_DEFERRED_CLEAR_DELAYwindow measured from its own finish step, regardless of whether another deferred clear was already pending.Bug 2 — bare
mx.clear_cache()calls in SpecPrefill pathTwo sites in the SpecPrefill code path called
mx.clear_cache()directly instead of going through the_sync_and_clear_cache()helper:Both
mx.eval()calls preceding them submit work onto the generation stream asynchronously. Callingmx.clear_cache()without first callingmx.synchronize(generation_stream)can release Metal buffers that are still referenced by in-flight command buffers on that stream — the same race that_sync_and_clear_cache()was introduced to prevent in #435.Fix: replace both bare
mx.clear_cache()calls with_sync_and_clear_cache().Reproduction environment from #557
max_num_seqs: 2, SSD paging 92 GB,prefill_memory_guard: true