fix: prevent IOKit prepare count underflow with concurrent completions (#557) by Chuhan1112 · Pull Request #648 · jundot/omlx

Chuhan1112 · 2026-04-07T12:32:31Z

Fixes #557

Two separate bugs could each independently trigger the 'completeMemory() prepare count underflow' kernel panic on Apple Silicon.

Bug 1 — deferred-clear window not extended for concurrent completions

The deferred-clear mechanism introduced in #435 used a countdown integer (_deferred_clear_steps) anchored to the first request that finished in a batch. The scheduling code guarded the assignment with "only set if None", so when a second request finished in the same or immediately following step (as happens with max_num_seqs=2 and the Sisyphus multi-agent workload in #557), its completion was silently ignored.

The first request's KV cache blocks were returned to the free queue immediately on completion and could be re-allocated to the next incoming request within one generation step — long before IOKit had finished the completeMemory() callbacks for those buffers. The new allocation called prepare() on the same Metal buffer, and when the stale callback finally ran, completeMemory() decremented the count below zero → kernel panic.

Fix: replace the countdown integer with an absolute target step (_deferred_clear_at). Each completion computes target = _step_counter + _DEFERRED_CLEAR_DELAY and updates _deferred_clear_at via max(), so every completion gets a full _DEFERRED_CLEAR_DELAY window measured from its own finish step, regardless of whether another deferred clear was already pending.

Bug 2 — bare `mx.clear_cache()` calls in SpecPrefill path

Two sites in the SpecPrefill code path called mx.clear_cache() directly instead of going through the _sync_and_clear_cache() helper:

The inter-chunk clear inside the system-prompt full-prefill loop
The draft-cache teardown after SpecPrefill scoring

Both mx.eval() calls preceding them submit work onto the generation stream asynchronously. Calling mx.clear_cache() without first calling mx.synchronize(generation_stream) can release Metal buffers that are still referenced by in-flight command buffers on that stream — the same race that _sync_and_clear_cache() was introduced to prevent in #435.

Fix: replace both bare mx.clear_cache() calls with _sync_and_clear_cache().

Reproduction environment from #557

M1 Max Mac Studio, 32 GB RAM, macOS Tahoe 26.3
oMLX v0.3.0 (also confirmed on v0.3.4 by Chuhan1112)
Model: MLX-Qwen3.5-27B (14.8 GB, pinned memory)
max_num_seqs: 2, SSD paging 92 GB, prefill_memory_guard: true
Workload: OpenCode + Sisyphus multi-agent framework

jundot#557) Two separate bugs could each independently trigger the 'completeMemory() prepare count underflow' kernel panic on Apple Silicon, both rooted in Metal buffer lifecycle races. Bug 1 — deferred-clear window not extended for concurrent completions ---------------------------------------------------------------------- The deferred-clear mechanism (introduced in jundot#435) used a countdown integer (_deferred_clear_steps) anchored to the *first* request that finished in a batch. The scheduling code guarded the assignment with "only set if None", so when a second request finished in the same or the immediately following step (as happens with max_num_seqs=2 and the Sisyphus multi-agent workload in jundot#557), its completion was silently ignored. The first request's KV cache blocks were returned to the free queue immediately on completion and could be re-allocated to the next incoming request within one generation step — long before IOKit had finished the completeMemory() callbacks for those buffers. The new allocation called prepare() on the same Metal buffer, and when the stale callback finally ran, completeMemory() decremented the count below zero → kernel panic. Fix: replace the countdown integer with an absolute target step (_deferred_clear_at). Each completion computes target = _step_counter + _DEFERRED_CLEAR_DELAY and updates _deferred_clear_at via max(), so every completion gets a full _DEFERRED_CLEAR_DELAY window measured from *its own* finish step, regardless of whether another deferred clear was already pending. The step() tail checks whether the counter has reached the target rather than incrementing a counter, which is both simpler and correct. Bug 2 — bare mx.clear_cache() calls in SpecPrefill path --------------------------------------------------------- Two sites in the SpecPrefill code path called mx.clear_cache() directly instead of going through the _sync_and_clear_cache() helper: • The inter-chunk clear inside the system-prompt full-prefill loop • The draft-cache teardown after SpecPrefill scoring Both mx.eval() calls preceding them submit work onto the generation stream asynchronously. Calling mx.clear_cache() without first calling mx.synchronize(generation_stream) can release Metal buffers that are still referenced by in-flight command buffers on that stream, producing the same kernel panic that the _sync_and_clear_cache() wrapper was introduced to prevent (jundot#435). Fix: replace both bare mx.clear_cache() calls with _sync_and_clear_cache(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

mdevk

This is critical for anyone running concurrent VLM inference on Apple Silicon.

Relevant data from our benchmarking

We ran concurrency sweeps (1, 2, 4, 8, 16) on oMLX v0.3.4 with Qwen3-VL-30B-A3B:

Concurrency	JSON Valid	Note
1	100% (20/20)	Baseline
2	100% (20/20)	Clean
4	100% (20/20)	Clean
8	95% (19/20)	First quality drop
16	40% (8/20)	Severe degradation

We also observed that on v0.3.5.dev1 (pre-release), concurrency=2 dropped JSON validity to 20% — much worse than v0.3.4 stable. This could be related to the deferred-clear window regression you describe.

We never hit the kernel panic itself, but the quality degradation at high concurrency could be an early symptom of the same buffer reuse-before-completion race. If IOKit hasn't finished completeMemory() callbacks and the buffer gets reallocated, the model reads stale/corrupted KV cache data → garbage output → invalid JSON.

The fix looks correct

The max() approach for _deferred_clear_at elegantly handles overlapping completions. Each completion extends the window rather than resetting it. Clean design.

The SpecPrefill mx.synchronize() before mx.clear_cache() is clearly necessary — asynchronous stream semantics make bare clear_cache() a race condition by design.

jundot

The fix itself looks correct and well thought out. The max() approach for _deferred_clear_at cleanly handles overlapping completions without the indefinite postponement problem the old comment warned about. Each completion gets a full delay window measured from its own finish step, which is exactly the right behavior.

Replacing the bare mx.clear_cache() calls in the specprefill path with _sync_and_clear_cache() is also clearly necessary since async stream semantics make the unsynchronized version a race condition by design.

3 existing tests reference the renamed _deferred_clear_steps attribute and need updating. I'll handle that plus add concurrent completion test coverage in a follow-up commit after merge.

…llow-up) update 3 tests referencing renamed _deferred_clear_steps -> _deferred_clear_at and add concurrent completion window extension test for #557 fix

mdevk reviewed Apr 9, 2026

View reviewed changes

mdevk mentioned this pull request Apr 10, 2026

bug: VLM structured output quality degrades under concurrent requests (conc >= 2) #691

Closed

jundot force-pushed the main branch from e4bd074 to edb7244 Compare April 14, 2026 06:04

jundot approved these changes Apr 14, 2026

View reviewed changes

jundot merged commit f7905e7 into jundot:main Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent IOKit prepare count underflow with concurrent completions (#557)#648

fix: prevent IOKit prepare count underflow with concurrent completions (#557)#648
jundot merged 1 commit intojundot:mainfrom
Chuhan1112:fix/iogpu-prepare-count-underflow-557

Chuhan1112 commented Apr 7, 2026

Uh oh!

mdevk left a comment

Uh oh!

jundot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Chuhan1112 commented Apr 7, 2026

Bug 1 — deferred-clear window not extended for concurrent completions

Bug 2 — bare mx.clear_cache() calls in SpecPrefill path

Reproduction environment from #557

Uh oh!

mdevk left a comment

Choose a reason for hiding this comment

Relevant data from our benchmarking

The fix looks correct

Uh oh!

jundot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 2 — bare `mx.clear_cache()` calls in SpecPrefill path