fix: memory-pressure-aware model eviction and reclaim verification#649
fix: memory-pressure-aware model eviction and reclaim verification#649yeemio wants to merge 1 commit intojundot:mainfrom
Conversation
Add watermark-based pre-load checks, memory settle barriers, and active-request-safe LRU eviction to prevent unbounded memory growth during repeated model switches on Apple Silicon. Changes: - Add MemoryWatermark/WatermarkAction enums with four pressure levels (GREEN/YELLOW/RED/FATAL) based on projected memory utilization - Add pre_load_check() that projects post-load memory including Metal cache deduction and engine-type-scaled overhead - Add memory settle barrier in _unload_engine() that polls mx.get_active_memory() to verify actual Metal buffer reclamation - Add pressure-aware LRU eviction loop in get_engine() that evicts cached inactive models before loading when watermark is elevated - Protect _find_lru_victim() from selecting models with active inference requests (has_active_requests() check) - Guard engine.close() in BatchedEngine.stop() against None reference - Add /api/restart-status and /api/restart-engine admin endpoints for memory diagnostics and restart state management Tested on Apple M5 Max (128GB) with 50-round model switching soak test showing zero memory accumulation. Co-authored-by: Copilot <[email protected]>
|
Maintainer follow-up: this PR is ready for human review/merge on my side. What is already verified locally on Apple Silicon:
GitHub currently shows this PR as mergeable with no conflicts, and I do not see any required status checks attached to the branch. If helpful, I can also split this into a smaller safety-only subset, but I believe the current patch is coherent as one memory-hardening change. |
mdevk
left a comment
There was a problem hiding this comment.
Great PR — we've been hitting this exact problem during VLM inference benchmarking on Mac Studio M3 Ultra 96GB (oMLX v0.3.4).
Our empirical data supporting this PR
During a 308-photo sequential extraction run (Qwen3-VL-30B-A3B-Instruct-4bit, v10b prompt), we observed:
- Memory pressure escalated from level 1 → level 2 (critical) during the run
mx.get_active_memory()showed model weights at 18.25 GB with only 0.05 GB KV cache delta per photo- The memory growth was NOT from per-photo leaks — it was from server-level cache management (paged SSD cache, hot cache) accumulating outside the model's MLX allocations
psRSS showed ~40MB (misleading — Metal buffers invisible to Unix process monitoring, as you correctly identified)
The watermark approach makes sense
The four-level watermark (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) is a good fit for Apple Silicon's unified memory. We measured that memory pressure hitting critical caused quality degradation in structured output — JSON validity dropped from 99.4% to 20% at concurrency=2 on an earlier (unconfigured) run. Your pressure-aware eviction would have prevented this.
One question
The settle barrier polls mx.get_active_memory() for up to 5s — have you measured the typical Metal buffer deallocation lag on M3 Ultra? In our testing, mx.clear_cache() reclaimed buffer cache (0.03 GB) instantly, but we didn't test model unload reclamation timing. Curious if 5s is tight or generous for a 20GB+ model teardown.
Keep up the great work on oMLX — we're running production photo extraction on it and the engine quality matters a lot to us. (We also have PR #688 open for a VLM decode speedup in a related area.)
|
Thanks, this is very useful validation. We saw the same failure mode on Apple Silicon when model unload completed logically but memory pressure had not actually settled yet. That is the main reason this PR adds a settle barrier based on observed reclaim rather than only estimated unload size. On our side (M5 Max, 128 GB), the 5s window was sufficient in the tested paths. In the representative high-water switch case, we observed real reclaim on the order of ~19 GB ( That said, your question is fair: we have stronger evidence for correctness than for universal timing across all Apple Silicon variants, and I would not claim that 5s is optimal for every large-model unload on every machine. If M3 Ultra or larger VLM paths show slower Metal buffer release, the next refinement would be to make the settle timeout configurable rather than hard-coded. If you want to test this on your side before maintainer review, I also published the hardened patch set separately so the behavior is easy to reproduce outside this PR context: That repo reflects the same practical direction as this PR:
If you try it on the M3 Ultra VLM path, I’d be very interested in whether your unload/reclaim timing still fits comfortably within the current 5s settle window. |
|
This should be ready for maintainer review. The failure mode here is not hypothetical: we reproduced it locally on Apple Silicon under high-water model switching, and there is now external confirmation from another contributor seeing the same class of memory-pressure behavior on M3 Ultra under VLM load. I would prefer to keep this PR focused on reclaim-correctness and eviction behavior. If configurable settle timeout becomes necessary after broader hardware validation, I can spin that out as a separate follow-up instead of expanding this review surface. |
Summary
Add memory-pressure-aware model eviction and reclaim verification to the engine pool, preventing unbounded memory growth during repeated model switches on Apple Silicon.
Problem
When oMLX serves multiple models on Apple Silicon (unified memory), switching between large models (e.g. 27B → 120B → 27B) can leave Metal buffer pools, KV caches, and boundary snapshots unreleased. The current eviction logic:
Doesn't verify actual memory reclamation —
_unload_engine()clears the engine reference and callsgc.collect()+mx.clear_cache()once, but never confirms Metal buffers were actually freed. On Apple Silicon, Metal buffer deallocation is asynchronous and can lag significantly.Doesn't consider memory pressure before loading — The engine pool checks whether a model fits in
max_model_memory, but doesn't check system-wide memory utilization or project post-load pressure. This means it can load a new model while already at 80%+ utilization.Doesn't protect models with in-flight requests —
_find_lru_victim()can select a model that is actively serving inference requests, causing abrupt request failures.Doesn't evict cached models when memory is tight — The hot cache can hold multiple models, but
emergency_reclaim()only doesgc.collect()/mx.clear_cache()without evicting any cached models from the pool.Over 50+ model switches, this causes steady memory accumulation until the system hits swap pressure or needs a restart.
Solution
Memory settle barrier in
_unload_engine()After stopping an engine, poll
mx.get_active_memory()for up to 10 rounds (5s) to verify the expected memory was actually freed. This catches cases where Metal buffer deallocation lags behind the Python-level engine teardown. If the barrier times out, perform 3 rounds of aggressive GC + Metal cache clear before giving up.Watermark-based pre-load check
Before loading any model, compute projected memory utilization including:
Map the projected utilization to four watermark levels (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) and take appropriate action.
Pressure-aware LRU eviction
When the pre-load check returns YELLOW or higher, evict LRU non-active cached models one at a time, re-checking the watermark after each eviction, until pressure drops to an acceptable level. Only then proceed with the new model load.
Active request protection in
_find_lru_victim()Skip models with active inference requests (
has_active_requests()) during victim selection. This prevents evicting a model mid-generation.Protected
engine.close()in batched engineGuard the
engine.close()call inBatchedEngine.stop()to handle cases where the inner engine reference is already None, preventing AttributeError during rapid unload sequences.Diagnostics endpoint
Add
/api/restart-statusand/api/restart-engineadmin endpoints exposing memory watermark, loaded model details, last eviction event, and restart recommendation state.Verification
Tested on Mac17,6 (Apple M5 Max, 128GB) with oMLX 0.3.4:
freed=18.99GB (expected>=17.94GB).Detailed write-up and test data: omlx-runtime-hardening
Files Changed
omlx/process_memory_enforcer.pyMemoryWatermark/WatermarkActionenums,pre_load_check(),emergency_reclaim(),get_memory_diagnostics()omlx/engine_pool.py_unload_engine(), watermark-aware pre-load eviction inget_engine(), active-request-safe_find_lru_victim()omlx/engine/batched.pyengine.close()instop()omlx/admin/routes.py/api/restart-statusand/api/restart-engineendpointstests/test_engine_pool.pytests/test_process_memory_enforcer.pyKnown Limitations
pre_load_check()usesmx.get_cache_memory()to deduct reclaimable cache, which is correct for Apple Silicon but may behave differently on other backends.