Skip to content

[Performance] add hot cache only mode and optimize memory usage#701

Open
RepublicOfKorokke wants to merge 9 commits intojundot:mainfrom
RepublicOfKorokke:feat/hot-cache-only
Open

[Performance] add hot cache only mode and optimize memory usage#701
RepublicOfKorokke wants to merge 9 commits intojundot:mainfrom
RepublicOfKorokke:feat/hot-cache-only

Conversation

@RepublicOfKorokke
Copy link
Copy Markdown

@RepublicOfKorokke RepublicOfKorokke commented Apr 10, 2026

Enable 'Hot Cache' only (Disable 'Cold Cache')

Features

1. SSD Cache Control (hot_cache_only flag)

The hot_cache_only flag acts as a master switch for the tiered storage system.

When hot_cache_only = True

  • No Disk I/O: The system bypasses directory initialization and existing cache scanning at startup.
  • No Background Writing: The ssd-cache-writer thread is not started, and save_block returns False immediately without extracting tensor bytes.
  • Pure In-Memory Mode: If a hot cache is configured, the system operates as a pure in-memory cache. Blocks are stored and retrieved from RAM, but nothing is persisted to disk.

When hot_cache_only = False

  • Full Tiered Storage: The system enables the complete pipeline: RAM (Hot Cache) -> SSD (Cold Storage).
  • Persistence: The background writer thread is active, asynchronously saving KV blocks to .safetensors files.
  • Prefix Reuse: The PagedCacheManager can identify cached prefixes on disk via the PagedSSDCacheManager index, allowing "cold restores" of KV state.

2. Hot Cache Optimization (Hybrid Storage)

The hot cache mechanism was refactored to move from a bytes-only storage model to a hybrid model that prioritizes native MLX arrays.

Old Architecture (Bytes-Only)

  • Storage: Stored only tensors_raw (raw bytes).
  • Retrieval Path: Bytes -> Numpy -> MLX Array.
  • Downside: Every hot cache hit required a full reconstruction of the tensors, causing CPU overhead and temporary RAM spikes.

New Architecture (Hybrid Array/Bytes)

  • Storage: Stores mx.array objects directly in the arrays key, while keeping tensors_raw only when necessary for SSD writing.
  • Retrieval Path: Direct access to mx.array.
  • Benefit: Bypasses the reconstruction pipeline entirely. Hot cache hits are now near-instantaneous with zero reconstruction overhead.

Tiering Logic & Memory Efficiency

  • Promotion: When a block is loaded from SSD and promoted to the hot cache (_promote_to_hot_cache), it is stored as arrays but the tensors_raw are omitted. This is because the data is already on disk, so storing bytes in RAM would be redundant.
  • Saving: In save_block, the system stores the arrays for immediate fast access and extracts tensors_raw only if hot_cache_only = False to facilitate the background disk write.

Summary Table

Feature hot_cache_only = True hot_cache_only = False
Storage Medium RAM only RAM -> SSD
Hot Cache Type Native mx.array Native mx.array
Disk I/O Disabled Enabled (Async)
Prefix Reuse Session-only Persistent across restarts
Hit Performance Ultra-fast (Direct Array) Ultra-fast (RAM) / Fast (SSD)

3. Memory Leak Fix: Boundary Cache Snapshots

1. Problem Description

When processing very large prompts (e.g., > 80k tokens) with hot_cache_only=True, the system experienced a massive RAM spike immediately after the generation phase finished. This memory was not reclaimed, leading to a significant memory leak that could crash the server on subsequent requests.

Symptoms:

  • Prefill Phase: Memory increased gradually and normally.
  • Decode Phase: Memory remained stable.
  • Finish Phase: A sudden, massive spike in RAM usage occurred exactly when the request finished and the cache was being stored.

2. Root Cause Analysis

The "Reference Accumulation" Bug

The issue was located in the Boundary Cache Snapshot mechanism. To support efficient prefix reuse for non-sliceable caches (like ArraysCache or RotatingKVCache), the scheduler captures "snapshots" of the cache state at regular block boundaries (e.g., every 1024 tokens).

The flawed logic was as follows:

  1. During prefill, _emit_prefill_boundary_snapshot created a list of references to the current cache objects.
  2. In Python, these were shallow references. Because the cache objects are mutated in-place as the model processes more tokens, every single snapshot captured during the prefill eventually pointed to the final, full-sized cache.
  3. For a 94k token prompt, the system accumulated ~92 snapshots. All 92 snapshots were actually pointing to the same giant 94k-token cache.

The "End-of-Request Explosion"

When the request finished, the scheduler attempted to store these snapshots into the tiered cache. It iterated through the 92 snapshots and performed "normalization" (slicing the state to fit a block size).

Because every snapshot was actually the full 94k-token cache, the scheduler performed the following 92 times:
Full 94k Cache $\rightarrow$ Slice to 1024 tokens $\rightarrow$ Create New Tensors

This created a massive burst of new tensor allocations. In hot_cache_only mode, these slices were stored in an OrderedDict in RAM, pinning them and causing the observed memory spike and leak.

3. The Solution: State Freezing

The fix involves moving from Reference Capturing to State Freezing.

Key Changes:

  1. Immediate Extraction: Modified _emit_prefill_boundary_snapshot and _extract_boundary_snapshot to call _extract_cache_states() the moment the boundary is hit. This extracts the actual tensor values into a dictionary, "freezing" the state at that specific token count.
  2. Idempotent Extraction: Updated _extract_cache_states to detect if the input is already a list of extracted states. If it is, it returns the input immediately, preventing redundant processing and unnecessary tensor copies.
  3. Reference Breaking: By storing the extracted dictionary instead of the cache object, we break the reference to the evolving full-sized cache.

Comparison:

Feature Old Behavior (Buggy) New Behavior (Fixed)
Snapshot Content Pointers to mutable objects Frozen tensor dictionaries
Memory Growth Flat during prefill $\rightarrow$ Explosion at end Gradual growth during prefill
Processing 92 $\times$ (Slice Huge Cache) 92 $\times$ (Store Small Slice)
RAM Impact Massive spike at finish Stable and predictable

4. Verification

The fix was verified by processing a prompt of ~95k tokens.

  • Result: The massive RAM spike at the end of the request was eliminated.
  • Observation: Memory usage now scales linearly with the number of blocks stored, and the "normalization" logs no longer show huge reductions (e.g., 1789 -> 1024), as the snapshots are already the correct size.

5. Impact on SSD Cache

This fix significantly improves the stability of the SSD caching pipeline without changing the resulting data.

  • Data Integrity: No change. The final blocks stored on disk are identical to the previous version. Cache hit rates and reconstruction logic remain unaffected.
  • RAM Stability: Previously, the system experienced a RAM spike before writing to the SSD because it expanded all snapshots into huge tensors simultaneously. This spike is now eliminated.
  • Workload Distribution: The extraction work is now distributed across the prefill phase rather than being bunched at the end of the request, preventing "lag spikes" and OOM crashes during the storage phase.

Testing

  • Confirm that the number of NG results of pytest -m "not slow has not increased.
    (Fix 1 TestBoundarySnapshotSSDStore fail. Detail in commit message)
    • Add test_boundary_snapshot_is_frozen_state test
Feature test_capture_boundary_snapshot... test_boundary_snapshot_is_frozen...
Primary Goal Verify the mechanism works. Verify the fix for the memory leak works.
What it checks "Did we save a snapshot at token 4?" "Is the snapshot independent of the original?"
Key Assertion assert 4 in snapshots assert snapshot_after == snapshot_before
Failure Meaning The scheduler isn't capturing boundaries. The system is still using shallow references (leak risk).
Before this commit
FAILED tests/test_audio_discovery.py::TestNonAudioRegressions::test_vlm_still_detected - AssertionError: assert 'llm' == 'vlm'
FAILED tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue - AssertionError: assert 1 == 0
FAILED tests/test_cli.py::TestServeCommandOptions::test_serve_has_scheduler_options - assert '--max-num-seqs' in 'usage: cli.py serve [-h] [--model-dir MODEL_DIR]\n          ...
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_omlx_app.py::TestServerManager::test_check_health_success - assert False is True
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_enters_suppression_after_forcing - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_trailing_tokens_forced_after_end - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_natural_end_before_budget - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_forcing - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_natural_detection - assert False
FAILED tests/test_vlm_engine.py::TestInjectToolCalling::test_skips_when_mlx_lm_not_available - AssertionError: assert True is False
========== 15 failed, 3428 passed, 29 skipped, 45 deselected in 228.72s (0:03:48) ===========
After this commit
FAILED tests/test_audio_discovery.py::TestNonAudioRegressions::test_vlm_still_detected - AssertionError: assert 'llm' == 'vlm'
FAILED tests/test_cli.py::TestServeCommandOptions::test_serve_has_scheduler_options - assert '--max-num-seqs' in 'usage: cli.py serve [-h] [--model-dir MODEL_DIR]\n                   ...
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_omlx_app.py::TestServerManager::test_check_health_success - assert False is True
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_enters_suppression_after_forcing - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_trailing_tokens_forced_after_end - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_natural_end_before_budget - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_forcing - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_natural_detection - assert False
FAILED tests/test_vlm_engine.py::TestInjectToolCalling::test_skips_when_mlx_lm_not_available - AssertionError: assert True is False
=============== 14 failed, 3430 passed, 29 skipped, 45 deselected in 225.45s (0:03:45) ===============
  • RAM Cache only mode works without storage access (send text -> prefill -> decode -> resend text -> prefill uses cache)
    • NVIDIA-Nemotron-3-Nano-4B-oQ4 (lm)
    • gemma-4-26B-A4B-it-oQ3 (vlm)
    • Qwen3.5-4B-oQ4 (vlm)
  • SSD and RAM cache mode works (send text -> prefill -> decode -> resend text -> prefill uses cache -> unload model -> reload model -> resend text -> prefill uses cache)
    • NVIDIA-Nemotron-3-Nano-4B-oQ4 (lm)
    • gemma-4-26B-A4B-it-oQ3 (vlm)
    • Qwen3.5-4B-oQ4 (vlm)
  • SSD cache mode works (send text -> prefill -> decode -> resend text -> prefill uses cache -> unload model -> reload model -> resend text -> prefill uses cache)
    • NVIDIA-Nemotron-3-Nano-4B-oQ4 (lm)
    • gemma-4-26B-A4B-it-oQ3 (vlm)
    • Qwen3.5-4B-oQ4 (vlm)
  • No cache mode works (send text -> prefill -> decode -> resend text -> prefill NOT uses cache)
    • NVIDIA-Nemotron-3-Nano-4B-oQ4 (lm)
    • gemma-4-26B-A4B-it-oQ3 (vlm)
    • Qwen3.5-4B-oQ4 (vlm)
  • Fix massive tokens causes memory leak

[Features]
- Introduce `hot_cache_only` flag for pure in-memory operation.
- Change hot cache storage from bytes-only to hybrid `mx.array` model to prevent RAM spikes and speed up retrieval.

[Fixes]
- Fix: `_hot_cache_entry_size` incorrectly calculated size for evicted entries.
- Fix: Prevented unnecessary GPU memory consumption when hot cache is disabled.
- Fix: Resolved race condition between boundary snapshot cleanup and background writer thread.

[Admin/UI]
- Add `hot_cache_only` toggle to the admin dashboard.
- Update configuration schemas for settings, routes, and local storage.
1. NameError in `GlobalSettings.to_scheduler_config`

[Issue]
Tests in `tests/test_settings.py` failed with `NameError: name 'ssd_dir' is not defined`.
In `omlx/settings.py`, the `to_scheduler_config` method was attempting to pass a variable named `ssd_dir` to the `SchedulerConfig` constructor, but this variable had not been defined within the scope of the function.

[Fix]
- Defined `ssd_dir` by calling `self.cache.get_ssd_cache_dir(self.base_path)`.
- Integrated logic for the `hot_cache_only` mode:
  - If `hot_cache_only` is enabled, `ssd_dir` is set to `None`.
  - If `hot_cache_only` is enabled, `paged_ssd_cache_max_size` is forced to `0`.
- This ensures that the scheduler is correctly informed when SSD caching is disabled in favor of a memory-only hot cache.

2. Race Condition in `BoundarySnapshotSSDStore`

[Issue]
The test `tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue` failed intermittently. The test expected the snapshot directory to be empty after `cleanup_all()`, but files were occasionally found.
A race condition existed between the background writer thread and the cleanup methods (`cleanup_request` and `cleanup_all`):
1. **Writer Thread**: Pops a write task from the queue -> Checks if the request is cancelled (it is not) -> **[Context Switch]**.
2. **Main Thread**: Calls `cleanup_all()` -> Marks requests as cancelled -> Deletes the `_boundary_snapshots` directory.
3. **Writer Thread**: Resumes -> Calls `mkdir(parents=True)` (recreating the directory just deleted) -> Writes the `.safetensors` file.
The result was a "zombie" file written to disk after the system had explicitly requested a full cleanup.

[Fix]
Introduced a synchronization primitive `_write_lock` to ensure atomicity between the check-and-write phase and the cleanup phase.
- **Writer Thread**: Now acquires `_write_lock` before checking the cancellation status and performing the disk I/O. This ensures that if a write starts, no cleanup can occur until the write finishes (or vice versa).
- **Cleanup Methods**: Now acquire `_write_lock` before calling `shutil.rmtree()`. This prevents the writer from recreating the directory while it is being deleted.
- **Error Handling**: Improved the `except` block in the writer loop to specifically target and remove temporary files (`_tmp.safetensors`) if a write fails, preventing disk clutter.
@Landon-Molt
Copy link
Copy Markdown

Thank you! I am currently testing your branch.

[Problem]

A significant memory leak was identified in the `Scheduler` when using boundary snapshots for non-sliceable cache layers (such as `RotatingKVCache` or `ArraysCache`). Even after requests were finished or the `BatchGenerator` was reset, GPU and CPU memory remained occupied, leading to eventual Out-of-Memory (OOM) errors during long-running sessions.

[Root cause]

The `Scheduler` was storing direct references to live cache objects within the `_boundary_cache_snapshots` dictionary. These cache objects are internally linked to the `BatchGenerator` and its associated large KV tensors.
Because the `Scheduler` held these references, the Python Garbage Collector could not reclaim the `BatchGenerator` or its tensors, even after the generator was destroyed. This created a "reference chain" that kept the entire state of previous batches alive in memory.

[Fix]

The fix involves "freezing" the cache state immediately upon capture, converting live objects into static data structures.

| Feature              | Before (Buggy)                                       | After (Fixed)                                      |
| :------------------- | :--------------------------------------------------- | :------------------------------------------------- |
| **Snapshot Content** | Shallow references to live cache objects             | Extracted state dictionaries (frozen)              |
| **Reference Chain**  | `Scheduler` -> `Cache Obj` -> `BatchGenerator`       | `Scheduler` -> `Dict` -> `Tensors`                 |
| **GC Behavior**      | `BatchGenerator` pinned until all snapshots are gone | `BatchGenerator` reclaimed immediately after reset |
| **State Integrity**  | Mutating live cache changes all snapshots            | Snapshots remain immutable (frozen)                |
| **Memory Impact**    | Chronic leak of GPU/CPU tensors                      | Prompt reclamation of memory                       |

[Implementation Details]

1.  **Immediate Extraction**: Modified `_emit_prefill_boundary_snapshot` and `_extract_boundary_snapshot` to call `_extract_cache_states` at the moment of capture. This converts the live cache objects into dictionaries containing only the necessary tensors and metadata.
2.  **Reference Breaking**: By storing dictionaries instead of object references, the link to the `BatchGenerator` is broken, allowing the generator and its large tensors to be reclaimed by the GC.
3.  **Idempotency in Extraction**: Added a check in `_extract_cache_states` to detect if the input is already a list of extracted dictionaries. This ensures that subsequent calls (e.g., when saving the snapshot to the SSD cache) do not attempt to re-extract data from a dictionary.
4.  **Type Safety**: Updated `_cache_tree_has_stateful_non_sliceable` to explicitly handle dictionary types, preventing the scheduler from attempting recursive live-object inspections on frozen snapshots.
5.  **Verification**: Added a regression test `test_boundary_snapshot_is_frozen_state` to ensure that mutating a live cache object after a snapshot is taken does not affect the stored snapshot.

[Side effects]

- **SSD Cache**: No negative side effects. The data format written to the SSD remains identical, as the SSD writer still receives the same list of state dictionaries.
- **Performance**: Negligible. The extraction process is simply moved slightly earlier in the execution pipeline.
- **Memory**: Positive. Memory is now reclaimed promptly after the `BatchGenerator` is reset or a request is fully cleaned up.
[Problem]

SSD cache hits were being rejected as "partial prefix matches" for large prompts using non-sliceable caches (such as `RotatingKVCache` or `ArraysCache`), even when the prefix was actually present in the cache. This resulted in the system falling back to full prefill, eliminating the performance benefits of the prefix cache.

[Root cause]

The "Frozen" state logic (introduced to fix a memory leak) caused a regression in how the scheduler detects the need for boundary snapshots.
The method `_cache_tree_has_stateful_non_sliceable` was updated to return `False` immediately if the `cache_obj` was a dictionary. However, because boundary snapshots are now "frozen" into dictionaries _before_ being passed to the detection logic, the scheduler incorrectly concluded that no stateful non-sliceable caches were present.
Consequently, the scheduler skipped capturing boundary snapshots during prefill. When the cache was stored, intermediate blocks were saved as placeholders (tensors of shape `(1,)`). During reconstruction, the system detected these placeholders in the last matched block and rejected the cache to prevent using stale sliding-window or recurrent state.

[Fix]

Updated `_cache_tree_has_stateful_non_sliceable` to correctly handle both live cache objects and frozen state dictionaries. Instead of rejecting dictionaries, the logic now extracts the class name from the dictionary metadata and queries the `CacheTypeRegistry` to determine if that specific cache type supports block slicing.
By relying on the registry instead of hard-coded lists or simple type checks, the scheduler now correctly identifies non-sliceable caches regardless of whether they are live or frozen, ensuring boundary snapshots are captured and stored.

[Side effects]

None. This fix restores the intended behavior of the boundary snapshot mechanism while maintaining the memory leak fix (as it still operates on frozen dictionaries rather than live object references).
[Problem]

When using `hot_cache_only=True` with very large prompts, the system experienced linear memory growth during the prefill phase, eventually leading to Out-of-Memory (OOM) crashes. This occurred even though the "Frozen" state logic had already been implemented to prevent long-term memory leaks.

[Root cause]

The issue was caused by the storage mechanism for boundary snapshots when no SSD store is available.
1. **Boundary Snapshots**: For non-sliceable caches (e.g., `RotatingKVCache`, `ArraysCache`), the scheduler captures a "frozen" snapshot of the cache state at every block boundary (e.g., every 1024 tokens) to allow partial prefix reuse.
2. **Storage Path**: Normally, these snapshots are offloaded to disk via `BoundarySnapshotSSDStore`. However, in `hot_cache_only` mode, the SSD store is disabled (`self._boundary_snapshot_store = None`).
3. **RAM Accumulation**: Without an SSD store, the scheduler stores these frozen snapshots (which contain actual MLX tensors for all model layers) directly in the `_boundary_cache_snapshots` dictionary in RAM.
4. **Linear Growth**: For a prompt of 100k tokens, the system creates ~100 snapshots. Since each snapshot is a full copy of the recurrent state for all layers, the memory usage grows linearly with the prompt length, leading to OOM.

[Fix]

The boundary snapshot mechanism is now explicitly disabled if no SSD store is available.
Modified the `boundary_enabled` check in `_do_external_prefill` to require `self._boundary_snapshot_store` to be present. If the system is running in `hot_cache_only` mode (or any mode without an SSD store), boundary snapshots are no longer captured.

[Side effects]

- **Standard KVCache**: No impact. Standard caches are sliceable and do not require boundary snapshots for partial prefix reuse.
- **Non-Sliceable Caches (Rotating/Arrays)**: In `hot_cache_only` mode, partial prefix reuse is no longer supported for these specific cache types. They will now only support "Exact Matches" (where the entire prompt is cached). Any partial match will be treated as a cache miss and re-processed.
- **Stability**: Memory usage during prefill is now flat and stable, regardless of prompt length, preventing OOMs in RAM-only mode.
@Landon-Molt
Copy link
Copy Markdown

Great work on this PR — your boundary snapshot freezing fix (1b87a99) resolved a TurboQuant KV cache corruption issue we were debugging in #661.

Context: We wired up _apply_turboquant_kv() (which was defined but never called) to convert KVCacheTurboQuantKVCache via from_cache() after prefill. The conversion quantizes the prefilled KV state, saving ~4-5 GB RAM at 32K context.

The problem: Before your fix, from_cache() would read cache.state to get the KV tensors for quantization, but the stale boundary snapshot references were holding onto the same mutable cache objects. This caused gemma-4-31B to produce garbage output with TQ enabled — even though the quantization itself had <1% reconstruction error.

After your fix: The frozen snapshots break the reference chain, so cache.state returns clean unshared tensors. gemma-4-31B now works correctly with TQ at both concurrency 1 and 4.

Did you encounter similar issues with TurboQuant producing corrupt output before this fix? Curious if the snapshot reference bug was known to interact with TQ.

@RepublicOfKorokke
Copy link
Copy Markdown
Author

@Landon-Molt

Thank you for the testing my fix.

Did you encounter similar issues with TurboQuant producing corrupt output before this fix? Curious if the snapshot reference bug was known to interact with TQ.

I am currently using Gemma-4 (e2b, e4b, and 26b-a4b) all configured with TQ 3bit.
While using the SSD cache enabled, I have occasionally encountered corrupted output.
(primarily looping. I have not seen any instances of garbled output.)

I cannot definitively confirm whether the cache is the direct cause, but I have found that clean up the cache and regenerating the output resolves the issue. (This might also be related to the initial seed value.)

This specific occurrence is rare, and it was not reproduced during my debugging phase of the fix.

The models I am currently using:

  • mlx-community/gemma-4-e2b-it-4bit
  • mlx-community/gemma-4-e4b-it-4bit
  • RepublicOfKorokke/gemma-4-26B-A4B-it-oQ3

@Landon-Molt
Copy link
Copy Markdown

Thanks for the reply! Good to know you're using TQ on gemma-4 models.

We dug deeper and found the root cause of the corruption on gemma-4-31B specifically. It turns out the last full-attention layer (layer 59/60) is too sensitive to KV quantization — even 8-bit TQ on that single layer alone produces garbage, while the other 9 full-attention layers quantize fine.

We confirmed this by converting layers one at a time:

  • Layers 5-53 (9 layers): ✅ correct output
  • Layer 59 alone: ❌ garbage
  • All 10 layers: ❌ garbage

The fix: skip the last KVCache layer during TQ conversion. It sits right before the output head, so any quantization error directly impacts logits with no subsequent layers to absorb the perturbation. This gives us 9/10 layers quantized with correct output.

The occasional looping you see with SSD cache might be related — if a cached TQ state for a late layer gets slightly corrupted during SSD store/restore, it could cause the same sensitivity issue. Your fix (cleaning cache) would resolve that by forcing fresh computation.

We're currently testing a PR with this skip-last-layer fix on top of upstream main. Will share results soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Enable 'Hot Cache' only (Disable 'Cold Cache')

2 participants