fix: memory-pressure-aware model eviction and reclaim verification by yeemio · Pull Request #649 · jundot/omlx

yeemio · 2026-04-07T15:55:23Z

Summary

Add memory-pressure-aware model eviction and reclaim verification to the engine pool, preventing unbounded memory growth during repeated model switches on Apple Silicon.

Problem

When oMLX serves multiple models on Apple Silicon (unified memory), switching between large models (e.g. 27B → 120B → 27B) can leave Metal buffer pools, KV caches, and boundary snapshots unreleased. The current eviction logic:

Doesn't verify actual memory reclamation — _unload_engine() clears the engine reference and calls gc.collect() + mx.clear_cache() once, but never confirms Metal buffers were actually freed. On Apple Silicon, Metal buffer deallocation is asynchronous and can lag significantly.
Doesn't consider memory pressure before loading — The engine pool checks whether a model fits in max_model_memory, but doesn't check system-wide memory utilization or project post-load pressure. This means it can load a new model while already at 80%+ utilization.
Doesn't protect models with in-flight requests — _find_lru_victim() can select a model that is actively serving inference requests, causing abrupt request failures.
Doesn't evict cached models when memory is tight — The hot cache can hold multiple models, but emergency_reclaim() only does gc.collect() / mx.clear_cache() without evicting any cached models from the pool.

Over 50+ model switches, this causes steady memory accumulation until the system hits swap pressure or needs a restart.

Solution

Memory settle barrier in `_unload_engine()`

After stopping an engine, poll mx.get_active_memory() for up to 10 rounds (5s) to verify the expected memory was actually freed. This catches cases where Metal buffer deallocation lags behind the Python-level engine teardown. If the barrier times out, perform 3 rounds of aggressive GC + Metal cache clear before giving up.

Watermark-based pre-load check

Before loading any model, compute projected memory utilization including:

Current active memory minus reclaimable Metal cache
New model's estimated size
Runtime overhead scaled by engine type (5% for embeddings, 10-25% for LLMs)

Map the projected utilization to four watermark levels (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) and take appropriate action.

Pressure-aware LRU eviction

When the pre-load check returns YELLOW or higher, evict LRU non-active cached models one at a time, re-checking the watermark after each eviction, until pressure drops to an acceptable level. Only then proceed with the new model load.

Active request protection in `_find_lru_victim()`

Skip models with active inference requests (has_active_requests()) during victim selection. This prevents evicting a model mid-generation.

Protected `engine.close()` in batched engine

Guard the engine.close() call in BatchedEngine.stop() to handle cases where the inner engine reference is already None, preventing AttributeError during rapid unload sequences.

Diagnostics endpoint

Add /api/restart-status and /api/restart-engine admin endpoints exposing memory watermark, loaded model details, last eviction event, and restart recommendation state.

Verification

Tested on Mac17,6 (Apple M5 Max, 128GB) with oMLX 0.3.4:

High-watermark switching (27B ↔ 120B, 2 full cycles): Memory returns to baseline after each switch. Settle barrier logs confirm freed=18.99GB (expected>=17.94GB).
50-round soak test (27B ↔ 35B): Zero memory accumulation across all 50 cycles. Active memory locked at 33.1GB ± 0.0GB. No restart requests triggered.
Active request protection: Verified via concurrent inference + model switch — active model is never selected as eviction victim.

Detailed write-up and test data: omlx-runtime-hardening

Files Changed

File	Changes
`omlx/process_memory_enforcer.py`	Add `MemoryWatermark`/`WatermarkAction` enums, `pre_load_check()`, `emergency_reclaim()`, `get_memory_diagnostics()`
`omlx/engine_pool.py`	Memory settle barrier in `_unload_engine()`, watermark-aware pre-load eviction in `get_engine()`, active-request-safe `_find_lru_victim()`
`omlx/engine/batched.py`	Protected `engine.close()` in `stop()`
`omlx/admin/routes.py`	`/api/restart-status` and `/api/restart-engine` endpoints
`tests/test_engine_pool.py`	Tests for active-request skip, exclude set, restart state
`tests/test_process_memory_enforcer.py`	Tests for watermark levels and action enum

Known Limitations

Watermark thresholds (65/80/90%) are hardcoded. A future enhancement could make them configurable via settings.
The settle barrier timeout (5s) may be too short for very large models on slower storage. The emergency reclaim path handles this, but a configurable timeout would be better.
pre_load_check() uses mx.get_cache_memory() to deduct reclaimable cache, which is correct for Apple Silicon but may behave differently on other backends.

Add watermark-based pre-load checks, memory settle barriers, and active-request-safe LRU eviction to prevent unbounded memory growth during repeated model switches on Apple Silicon. Changes: - Add MemoryWatermark/WatermarkAction enums with four pressure levels (GREEN/YELLOW/RED/FATAL) based on projected memory utilization - Add pre_load_check() that projects post-load memory including Metal cache deduction and engine-type-scaled overhead - Add memory settle barrier in _unload_engine() that polls mx.get_active_memory() to verify actual Metal buffer reclamation - Add pressure-aware LRU eviction loop in get_engine() that evicts cached inactive models before loading when watermark is elevated - Protect _find_lru_victim() from selecting models with active inference requests (has_active_requests() check) - Guard engine.close() in BatchedEngine.stop() against None reference - Add /api/restart-status and /api/restart-engine admin endpoints for memory diagnostics and restart state management Tested on Apple M5 Max (128GB) with 50-round model switching soak test showing zero memory accumulation. Co-authored-by: Copilot <[email protected]>

yeemio · 2026-04-08T05:53:37Z

Maintainer follow-up: this PR is ready for human review/merge on my side.

What is already verified locally on Apple Silicon:

repeated model switching returns memory back to baseline instead of accumulating
active-request engines are excluded from LRU eviction
preload watermark checks prevent loading into already-tight memory states
admin diagnostics endpoints expose restart recommendation and last eviction state

GitHub currently shows this PR as mergeable with no conflicts, and I do not see any required status checks attached to the branch. If helpful, I can also split this into a smaller safety-only subset, but I believe the current patch is coherent as one memory-hardening change.

mdevk

Great PR — we've been hitting this exact problem during VLM inference benchmarking on Mac Studio M3 Ultra 96GB (oMLX v0.3.4).

Our empirical data supporting this PR

During a 308-photo sequential extraction run (Qwen3-VL-30B-A3B-Instruct-4bit, v10b prompt), we observed:

Memory pressure escalated from level 1 → level 2 (critical) during the run
mx.get_active_memory() showed model weights at 18.25 GB with only 0.05 GB KV cache delta per photo
The memory growth was NOT from per-photo leaks — it was from server-level cache management (paged SSD cache, hot cache) accumulating outside the model's MLX allocations
ps RSS showed ~40MB (misleading — Metal buffers invisible to Unix process monitoring, as you correctly identified)

The watermark approach makes sense

The four-level watermark (GREEN < 65%, YELLOW < 80%, RED < 90%, FATAL ≥ 90%) is a good fit for Apple Silicon's unified memory. We measured that memory pressure hitting critical caused quality degradation in structured output — JSON validity dropped from 99.4% to 20% at concurrency=2 on an earlier (unconfigured) run. Your pressure-aware eviction would have prevented this.

One question

The settle barrier polls mx.get_active_memory() for up to 5s — have you measured the typical Metal buffer deallocation lag on M3 Ultra? In our testing, mx.clear_cache() reclaimed buffer cache (0.03 GB) instantly, but we didn't test model unload reclamation timing. Curious if 5s is tight or generous for a 20GB+ model teardown.

Keep up the great work on oMLX — we're running production photo extraction on it and the engine quality matters a lot to us. (We also have PR #688 open for a VLM decode speedup in a related area.)

yeemio · 2026-04-10T14:34:56Z

Thanks, this is very useful validation.

We saw the same failure mode on Apple Silicon when model unload completed logically but memory pressure had not actually settled yet. That is the main reason this PR adds a settle barrier based on observed reclaim rather than only estimated unload size.

On our side (M5 Max, 128 GB), the 5s window was sufficient in the tested paths. In the representative high-water switch case, we observed real reclaim on the order of ~19 GB (freed=18.99GB), and the unload settled within the current polling window, so 5s was enough for the scenarios that motivated this patch.

That said, your question is fair: we have stronger evidence for correctness than for universal timing across all Apple Silicon variants, and I would not claim that 5s is optimal for every large-model unload on every machine. If M3 Ultra or larger VLM paths show slower Metal buffer release, the next refinement would be to make the settle timeout configurable rather than hard-coded.

If you want to test this on your side before maintainer review, I also published the hardened patch set separately so the behavior is easy to reproduce outside this PR context:
https://github.com/yeemio/omlx-runtime-hardening

That repo reflects the same practical direction as this PR:

memory-pressure-aware eviction
reclaim verification based on observed memory deltas
restart/status observability

If you try it on the M3 Ultra VLM path, I’d be very interested in whether your unload/reclaim timing still fits comfortably within the current 5s settle window.

yeemio · 2026-04-10T14:36:41Z

This should be ready for maintainer review.

The failure mode here is not hypothetical: we reproduced it locally on Apple Silicon under high-water model switching, and there is now external confirmation from another contributor seeing the same class of memory-pressure behavior on M3 Ultra under VLM load.

I would prefer to keep this PR focused on reclaim-correctness and eviction behavior. If configurable settle timeout becomes necessary after broader hardware validation, I can spin that out as a separate follow-up instead of expanding this review surface.

mdevk reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: memory-pressure-aware model eviction and reclaim verification#649

fix: memory-pressure-aware model eviction and reclaim verification#649
yeemio wants to merge 1 commit intojundot:mainfrom
yeemio:fix/memory-pressure-aware-eviction

yeemio commented Apr 7, 2026

Uh oh!

yeemio commented Apr 8, 2026

Uh oh!

mdevk left a comment

Uh oh!

yeemio commented Apr 10, 2026

Uh oh!

yeemio commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yeemio commented Apr 7, 2026

Summary

Problem

Solution

Memory settle barrier in _unload_engine()

Watermark-based pre-load check

Pressure-aware LRU eviction

Active request protection in _find_lru_victim()

Protected engine.close() in batched engine

Diagnostics endpoint

Verification

Files Changed

Known Limitations

Uh oh!

yeemio commented Apr 8, 2026

Uh oh!

mdevk left a comment

Choose a reason for hiding this comment

Our empirical data supporting this PR

The watermark approach makes sense

One question

Uh oh!

yeemio commented Apr 10, 2026

Uh oh!

yeemio commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Memory settle barrier in `_unload_engine()`

Active request protection in `_find_lru_victim()`

Protected `engine.close()` in batched engine