Skip to content

fix: memory-pressure-aware model eviction and reclaim verification#649

Open
yeemio wants to merge 1 commit intojundot:mainfrom
yeemio:fix/memory-pressure-aware-eviction
Open

fix: memory-pressure-aware model eviction and reclaim verification#649
yeemio wants to merge 1 commit intojundot:mainfrom
yeemio:fix/memory-pressure-aware-eviction

Conversation

@yeemio
Copy link
Copy Markdown

@yeemio yeemio commented Apr 7, 2026

Summary

Add memory-pressure-aware model eviction and reclaim verification to the engine pool, preventing unbounded memory growth during repeated model switches on Apple Silicon.

Problem

When oMLX serves multiple models on Apple Silicon (unified memory), switching between large models (e.g. 27B → 120B → 27B) can leave Metal buffer pools, KV caches, and boundary snapshots unreleased. The current eviction logic:

  1. Doesn't verify actual memory reclamation_unload_engine() clears the engine reference and calls gc.collect() + mx.clear_cache() once, but never confirms Metal buffers were actually freed. On Apple Silicon, Metal buffer deallocation is asynchronous and can lag significantly.

  2. Doesn't consider memory pressure before loading — The engine pool checks whether a model fits in max_model_memory, but doesn't check system-wide memory utilization or project post-load pressure. This means it can load a new model while already at 80%+ utilization.

  3. Doesn't protect models with in-flight requests_find_lru_victim() can select a model that is actively serving inference requests, causing abrupt request failures.

  4. Doesn't evict cached models when memory is tight — The hot cache can hold multiple models, but emergency_reclaim() only does gc.collect() / mx.clear_cache() without evicting any cached models from the pool.

Over 50+ model switches, this causes steady memory accumulation until the system hits swap pressure or needs a restart.

Solution

Memory settle barrier in _unload_engine()

After stopping an engine, poll mx.get_active_memory() for up to 10 rounds (5s) to verify the expected memory was actually freed. This catches cases where Metal buffer deallocation lags behind the Python-level engine teardown. If the barrier times out, perform 3 rounds of aggressive GC + Metal cache clear before giving up.

Watermark-based pre-load check

Before loading any model, compute projected memory utilization including:

  • Current active memory minus reclaimable Metal cache
  • New model's estimated size
  • Runtime overhead scaled by engine type (5% for embeddings, 10-25% for LLMs)

Map the projected utilization to four watermark levels (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) and take appropriate action.

Pressure-aware LRU eviction

When the pre-load check returns YELLOW or higher, evict LRU non-active cached models one at a time, re-checking the watermark after each eviction, until pressure drops to an acceptable level. Only then proceed with the new model load.

Active request protection in _find_lru_victim()

Skip models with active inference requests (has_active_requests()) during victim selection. This prevents evicting a model mid-generation.

Protected engine.close() in batched engine

Guard the engine.close() call in BatchedEngine.stop() to handle cases where the inner engine reference is already None, preventing AttributeError during rapid unload sequences.

Diagnostics endpoint

Add /api/restart-status and /api/restart-engine admin endpoints exposing memory watermark, loaded model details, last eviction event, and restart recommendation state.

Verification

Tested on Mac17,6 (Apple M5 Max, 128GB) with oMLX 0.3.4:

  • High-watermark switching (27B ↔ 120B, 2 full cycles): Memory returns to baseline after each switch. Settle barrier logs confirm freed=18.99GB (expected>=17.94GB).
  • 50-round soak test (27B ↔ 35B): Zero memory accumulation across all 50 cycles. Active memory locked at 33.1GB ± 0.0GB. No restart requests triggered.
  • Active request protection: Verified via concurrent inference + model switch — active model is never selected as eviction victim.

Detailed write-up and test data: omlx-runtime-hardening

Files Changed

File Changes
omlx/process_memory_enforcer.py Add MemoryWatermark/WatermarkAction enums, pre_load_check(), emergency_reclaim(), get_memory_diagnostics()
omlx/engine_pool.py Memory settle barrier in _unload_engine(), watermark-aware pre-load eviction in get_engine(), active-request-safe _find_lru_victim()
omlx/engine/batched.py Protected engine.close() in stop()
omlx/admin/routes.py /api/restart-status and /api/restart-engine endpoints
tests/test_engine_pool.py Tests for active-request skip, exclude set, restart state
tests/test_process_memory_enforcer.py Tests for watermark levels and action enum

Known Limitations

  • Watermark thresholds (65/80/90%) are hardcoded. A future enhancement could make them configurable via settings.
  • The settle barrier timeout (5s) may be too short for very large models on slower storage. The emergency reclaim path handles this, but a configurable timeout would be better.
  • pre_load_check() uses mx.get_cache_memory() to deduct reclaimable cache, which is correct for Apple Silicon but may behave differently on other backends.

Add watermark-based pre-load checks, memory settle barriers, and
active-request-safe LRU eviction to prevent unbounded memory growth
during repeated model switches on Apple Silicon.

Changes:
- Add MemoryWatermark/WatermarkAction enums with four pressure levels
  (GREEN/YELLOW/RED/FATAL) based on projected memory utilization
- Add pre_load_check() that projects post-load memory including
  Metal cache deduction and engine-type-scaled overhead
- Add memory settle barrier in _unload_engine() that polls
  mx.get_active_memory() to verify actual Metal buffer reclamation
- Add pressure-aware LRU eviction loop in get_engine() that evicts
  cached inactive models before loading when watermark is elevated
- Protect _find_lru_victim() from selecting models with active
  inference requests (has_active_requests() check)
- Guard engine.close() in BatchedEngine.stop() against None reference
- Add /api/restart-status and /api/restart-engine admin endpoints
  for memory diagnostics and restart state management

Tested on Apple M5 Max (128GB) with 50-round model switching soak
test showing zero memory accumulation.

Co-authored-by: Copilot <[email protected]>
@yeemio
Copy link
Copy Markdown
Author

yeemio commented Apr 8, 2026

Maintainer follow-up: this PR is ready for human review/merge on my side.

What is already verified locally on Apple Silicon:

  • repeated model switching returns memory back to baseline instead of accumulating
  • active-request engines are excluded from LRU eviction
  • preload watermark checks prevent loading into already-tight memory states
  • admin diagnostics endpoints expose restart recommendation and last eviction state

GitHub currently shows this PR as mergeable with no conflicts, and I do not see any required status checks attached to the branch. If helpful, I can also split this into a smaller safety-only subset, but I believe the current patch is coherent as one memory-hardening change.

Copy link
Copy Markdown
Contributor

@mdevk mdevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR — we've been hitting this exact problem during VLM inference benchmarking on Mac Studio M3 Ultra 96GB (oMLX v0.3.4).

Our empirical data supporting this PR

During a 308-photo sequential extraction run (Qwen3-VL-30B-A3B-Instruct-4bit, v10b prompt), we observed:

  • Memory pressure escalated from level 1 → level 2 (critical) during the run
  • mx.get_active_memory() showed model weights at 18.25 GB with only 0.05 GB KV cache delta per photo
  • The memory growth was NOT from per-photo leaks — it was from server-level cache management (paged SSD cache, hot cache) accumulating outside the model's MLX allocations
  • ps RSS showed ~40MB (misleading — Metal buffers invisible to Unix process monitoring, as you correctly identified)

The watermark approach makes sense

The four-level watermark (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) is a good fit for Apple Silicon's unified memory. We measured that memory pressure hitting critical caused quality degradation in structured output — JSON validity dropped from 99.4% to 20% at concurrency=2 on an earlier (unconfigured) run. Your pressure-aware eviction would have prevented this.

One question

The settle barrier polls mx.get_active_memory() for up to 5s — have you measured the typical Metal buffer deallocation lag on M3 Ultra? In our testing, mx.clear_cache() reclaimed buffer cache (0.03 GB) instantly, but we didn't test model unload reclamation timing. Curious if 5s is tight or generous for a 20GB+ model teardown.

Keep up the great work on oMLX — we're running production photo extraction on it and the engine quality matters a lot to us. (We also have PR #688 open for a VLM decode speedup in a related area.)

@yeemio
Copy link
Copy Markdown
Author

yeemio commented Apr 10, 2026

Thanks, this is very useful validation.

We saw the same failure mode on Apple Silicon when model unload completed logically but memory pressure had not actually settled yet. That is the main reason this PR adds a settle barrier based on observed reclaim rather than only estimated unload size.

On our side (M5 Max, 128 GB), the 5s window was sufficient in the tested paths. In the representative high-water switch case, we observed real reclaim on the order of ~19 GB (freed=18.99GB), and the unload settled within the current polling window, so 5s was enough for the scenarios that motivated this patch.

That said, your question is fair: we have stronger evidence for correctness than for universal timing across all Apple Silicon variants, and I would not claim that 5s is optimal for every large-model unload on every machine. If M3 Ultra or larger VLM paths show slower Metal buffer release, the next refinement would be to make the settle timeout configurable rather than hard-coded.

If you want to test this on your side before maintainer review, I also published the hardened patch set separately so the behavior is easy to reproduce outside this PR context:
https://github.com/yeemio/omlx-runtime-hardening

That repo reflects the same practical direction as this PR:

  • memory-pressure-aware eviction
  • reclaim verification based on observed memory deltas
  • restart/status observability

If you try it on the M3 Ultra VLM path, I’d be very interested in whether your unload/reclaim timing still fits comfortably within the current 5s settle window.

@yeemio
Copy link
Copy Markdown
Author

yeemio commented Apr 10, 2026

This should be ready for maintainer review.

The failure mode here is not hypothetical: we reproduced it locally on Apple Silicon under high-water model switching, and there is now external confirmation from another contributor seeing the same class of memory-pressure behavior on M3 Ultra under VLM load.

I would prefer to keep this PR focused on reclaim-correctness and eviction behavior. If configurable settle timeout becomes necessary after broader hardware validation, I can spin that out as a separate follow-up instead of expanding this review surface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants