Skip to content

fix: prevent IOKit prepare count underflow with concurrent completions (#557)#648

Merged
jundot merged 1 commit intojundot:mainfrom
Chuhan1112:fix/iogpu-prepare-count-underflow-557
Apr 14, 2026
Merged

fix: prevent IOKit prepare count underflow with concurrent completions (#557)#648
jundot merged 1 commit intojundot:mainfrom
Chuhan1112:fix/iogpu-prepare-count-underflow-557

Conversation

@Chuhan1112
Copy link
Copy Markdown

Fixes #557

Two separate bugs could each independently trigger the 'completeMemory() prepare count underflow' kernel panic on Apple Silicon.

Bug 1 — deferred-clear window not extended for concurrent completions

The deferred-clear mechanism introduced in #435 used a countdown integer (_deferred_clear_steps) anchored to the first request that finished in a batch. The scheduling code guarded the assignment with "only set if None", so when a second request finished in the same or immediately following step (as happens with max_num_seqs=2 and the Sisyphus multi-agent workload in #557), its completion was silently ignored.

The first request's KV cache blocks were returned to the free queue immediately on completion and could be re-allocated to the next incoming request within one generation step — long before IOKit had finished the completeMemory() callbacks for those buffers. The new allocation called prepare() on the same Metal buffer, and when the stale callback finally ran, completeMemory() decremented the count below zero → kernel panic.

Fix: replace the countdown integer with an absolute target step (_deferred_clear_at). Each completion computes target = _step_counter + _DEFERRED_CLEAR_DELAY and updates _deferred_clear_at via max(), so every completion gets a full _DEFERRED_CLEAR_DELAY window measured from its own finish step, regardless of whether another deferred clear was already pending.

Bug 2 — bare mx.clear_cache() calls in SpecPrefill path

Two sites in the SpecPrefill code path called mx.clear_cache() directly instead of going through the _sync_and_clear_cache() helper:

  • The inter-chunk clear inside the system-prompt full-prefill loop
  • The draft-cache teardown after SpecPrefill scoring

Both mx.eval() calls preceding them submit work onto the generation stream asynchronously. Calling mx.clear_cache() without first calling mx.synchronize(generation_stream) can release Metal buffers that are still referenced by in-flight command buffers on that stream — the same race that _sync_and_clear_cache() was introduced to prevent in #435.

Fix: replace both bare mx.clear_cache() calls with _sync_and_clear_cache().

Reproduction environment from #557

  • M1 Max Mac Studio, 32 GB RAM, macOS Tahoe 26.3
  • oMLX v0.3.0 (also confirmed on v0.3.4 by Chuhan1112)
  • Model: MLX-Qwen3.5-27B (14.8 GB, pinned memory)
  • max_num_seqs: 2, SSD paging 92 GB, prefill_memory_guard: true
  • Workload: OpenCode + Sisyphus multi-agent framework

jundot#557)

Two separate bugs could each independently trigger the
'completeMemory() prepare count underflow' kernel panic on Apple
Silicon, both rooted in Metal buffer lifecycle races.

Bug 1 — deferred-clear window not extended for concurrent completions
----------------------------------------------------------------------
The deferred-clear mechanism (introduced in jundot#435) used a countdown
integer (_deferred_clear_steps) anchored to the *first* request that
finished in a batch.  The scheduling code guarded the assignment with
"only set if None", so when a second request finished in the same or
the immediately following step (as happens with max_num_seqs=2 and
the Sisyphus multi-agent workload in jundot#557), its completion was silently
ignored.

The first request's KV cache blocks were returned to the free queue
immediately on completion and could be re-allocated to the next
incoming request within one generation step — long before IOKit had
finished the completeMemory() callbacks for those buffers.  The new
allocation called prepare() on the same Metal buffer, and when the
stale callback finally ran, completeMemory() decremented the count
below zero → kernel panic.

Fix: replace the countdown integer with an absolute target step
(_deferred_clear_at).  Each completion computes
  target = _step_counter + _DEFERRED_CLEAR_DELAY
and updates _deferred_clear_at via max(), so every completion gets a
full _DEFERRED_CLEAR_DELAY window measured from *its own* finish step,
regardless of whether another deferred clear was already pending.
The step() tail checks whether the counter has reached the target
rather than incrementing a counter, which is both simpler and correct.

Bug 2 — bare mx.clear_cache() calls in SpecPrefill path
---------------------------------------------------------
Two sites in the SpecPrefill code path called mx.clear_cache()
directly instead of going through the _sync_and_clear_cache() helper:

  • The inter-chunk clear inside the system-prompt full-prefill loop
  • The draft-cache teardown after SpecPrefill scoring

Both mx.eval() calls preceding them submit work onto the generation
stream asynchronously.  Calling mx.clear_cache() without first calling
mx.synchronize(generation_stream) can release Metal buffers that are
still referenced by in-flight command buffers on that stream, producing
the same kernel panic that the _sync_and_clear_cache() wrapper was
introduced to prevent (jundot#435).

Fix: replace both bare mx.clear_cache() calls with _sync_and_clear_cache().

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

@mdevk mdevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is critical for anyone running concurrent VLM inference on Apple Silicon.

Relevant data from our benchmarking

We ran concurrency sweeps (1, 2, 4, 8, 16) on oMLX v0.3.4 with Qwen3-VL-30B-A3B:

Concurrency JSON Valid Note
1 100% (20/20) Baseline
2 100% (20/20) Clean
4 100% (20/20) Clean
8 95% (19/20) First quality drop
16 40% (8/20) Severe degradation

We also observed that on v0.3.5.dev1 (pre-release), concurrency=2 dropped JSON validity to 20% — much worse than v0.3.4 stable. This could be related to the deferred-clear window regression you describe.

We never hit the kernel panic itself, but the quality degradation at high concurrency could be an early symptom of the same buffer reuse-before-completion race. If IOKit hasn't finished completeMemory() callbacks and the buffer gets reallocated, the model reads stale/corrupted KV cache data → garbage output → invalid JSON.

The fix looks correct

The max() approach for _deferred_clear_at elegantly handles overlapping completions. Each completion extends the window rather than resetting it. Clean design.

The SpecPrefill mx.synchronize() before mx.clear_cache() is clearly necessary — asynchronous stream semantics make bare clear_cache() a race condition by design.

Copy link
Copy Markdown
Owner

@jundot jundot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself looks correct and well thought out. The max() approach for _deferred_clear_at cleanly handles overlapping completions without the indefinite postponement problem the old comment warned about. Each completion gets a full delay window measured from its own finish step, which is exactly the right behavior.

Replacing the bare mx.clear_cache() calls in the specprefill path with _sync_and_clear_cache() is also clearly necessary since async stream semantics make the unsynchronized version a race condition by design.

3 existing tests reference the renamed _deferred_clear_steps attribute and need updating. I'll handle that plus add concurrent completion test coverage in a follow-up commit after merge.

@jundot jundot merged commit f7905e7 into jundot:main Apr 14, 2026
jundot added a commit that referenced this pull request Apr 14, 2026
…llow-up)

update 3 tests referencing renamed _deferred_clear_steps -> _deferred_clear_at
and add concurrent completion window extension test for #557 fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kernel panic: completeMemory() prepare count underflow on M1 Max 32GB with v0.3.0

3 participants