[Performance] add hot cache only mode and optimize memory usage by RepublicOfKorokke · Pull Request #701 · jundot/omlx

RepublicOfKorokke · 2026-04-10T11:05:19Z

Enable 'Hot Cache' only (Disable 'Cold Cache')

Features

Fix [Feature Request] Enable 'Hot Cache' only (Disable 'Cold Cache') #605
Add Hot Cache Only toggle to admin pane > global settings > cache

1. SSD Cache Control (`hot_cache_only` flag)

The hot_cache_only flag acts as a master switch for the tiered storage system.

When `hot_cache_only = True`

No Disk I/O: The system bypasses directory initialization and existing cache scanning at startup.
No Background Writing: The ssd-cache-writer thread is not started, and save_block returns False immediately without extracting tensor bytes.
Pure In-Memory Mode: If a hot cache is configured, the system operates as a pure in-memory cache. Blocks are stored and retrieved from RAM, but nothing is persisted to disk.

When `hot_cache_only = False`

Full Tiered Storage: The system enables the complete pipeline: RAM (Hot Cache) -> SSD (Cold Storage).
Persistence: The background writer thread is active, asynchronously saving KV blocks to .safetensors files.
Prefix Reuse: The PagedCacheManager can identify cached prefixes on disk via the PagedSSDCacheManager index, allowing "cold restores" of KV state.

2. Hot Cache Optimization (Hybrid Storage)

The hot cache mechanism was refactored to move from a bytes-only storage model to a hybrid model that prioritizes native MLX arrays.

Old Architecture (Bytes-Only)

Storage: Stored only tensors_raw (raw bytes).
Retrieval Path: Bytes -> Numpy -> MLX Array.
Downside: Every hot cache hit required a full reconstruction of the tensors, causing CPU overhead and temporary RAM spikes.

New Architecture (Hybrid Array/Bytes)

Storage: Stores mx.array objects directly in the arrays key, while keeping tensors_raw only when necessary for SSD writing.
Retrieval Path: Direct access to mx.array.
Benefit: Bypasses the reconstruction pipeline entirely. Hot cache hits are now near-instantaneous with zero reconstruction overhead.

Tiering Logic & Memory Efficiency

Promotion: When a block is loaded from SSD and promoted to the hot cache (_promote_to_hot_cache), it is stored as arrays but the tensors_raw are omitted. This is because the data is already on disk, so storing bytes in RAM would be redundant.
Saving: In save_block, the system stores the arrays for immediate fast access and extracts tensors_raw only if hot_cache_only = False to facilitate the background disk write.

Summary Table

Feature	`hot_cache_only = True`	`hot_cache_only = False`
Storage Medium	RAM only	RAM -> SSD
Hot Cache Type	Native `mx.array`	Native `mx.array`
Disk I/O	Disabled	Enabled (Async)
Prefix Reuse	Session-only	Persistent across restarts
Hit Performance	Ultra-fast (Direct Array)	Ultra-fast (RAM) / Fast (SSD)

3. Memory Leak Fix: Boundary Cache Snapshots

1. Problem Description

When processing very large prompts (e.g., > 80k tokens) with hot_cache_only=True, the system experienced a massive RAM spike immediately after the generation phase finished. This memory was not reclaimed, leading to a significant memory leak that could crash the server on subsequent requests.

Symptoms:

Prefill Phase: Memory increased gradually and normally.
Decode Phase: Memory remained stable.
Finish Phase: A sudden, massive spike in RAM usage occurred exactly when the request finished and the cache was being stored.

2. Root Cause Analysis

The "Reference Accumulation" Bug

The issue was located in the Boundary Cache Snapshot mechanism. To support efficient prefix reuse for non-sliceable caches (like ArraysCache or RotatingKVCache), the scheduler captures "snapshots" of the cache state at regular block boundaries (e.g., every 1024 tokens).

The flawed logic was as follows:

During prefill, _emit_prefill_boundary_snapshot created a list of references to the current cache objects.
In Python, these were shallow references. Because the cache objects are mutated in-place as the model processes more tokens, every single snapshot captured during the prefill eventually pointed to the final, full-sized cache.
For a 94k token prompt, the system accumulated ~92 snapshots. All 92 snapshots were actually pointing to the same giant 94k-token cache.

The "End-of-Request Explosion"

When the request finished, the scheduler attempted to store these snapshots into the tiered cache. It iterated through the 92 snapshots and performed "normalization" (slicing the state to fit a block size).

Because every snapshot was actually the full 94k-token cache, the scheduler performed the following 92 times:
Full 94k Cache $\rightarrow$ Slice to 1024 tokens $\rightarrow$ Create New Tensors

This created a massive burst of new tensor allocations. In hot_cache_only mode, these slices were stored in an OrderedDict in RAM, pinning them and causing the observed memory spike and leak.

3. The Solution: State Freezing

The fix involves moving from Reference Capturing to State Freezing.

Key Changes:

Immediate Extraction: Modified _emit_prefill_boundary_snapshot and _extract_boundary_snapshot to call _extract_cache_states() the moment the boundary is hit. This extracts the actual tensor values into a dictionary, "freezing" the state at that specific token count.
Idempotent Extraction: Updated _extract_cache_states to detect if the input is already a list of extracted states. If it is, it returns the input immediately, preventing redundant processing and unnecessary tensor copies.
Reference Breaking: By storing the extracted dictionary instead of the cache object, we break the reference to the evolving full-sized cache.

Comparison:

Feature	Old Behavior (Buggy)	New Behavior (Fixed)
Snapshot Content	Pointers to mutable objects	Frozen tensor dictionaries
Memory Growth	Flat during prefill $\rightarrow$ Explosion at end	Gradual growth during prefill
Processing	92 $\times$ (Slice Huge Cache)	92 $\times$ (Store Small Slice)
RAM Impact	Massive spike at finish	Stable and predictable

4. Verification

The fix was verified by processing a prompt of ~95k tokens.

Result: The massive RAM spike at the end of the request was eliminated.
Observation: Memory usage now scales linearly with the number of blocks stored, and the "normalization" logs no longer show huge reductions (e.g., 1789 -> 1024), as the snapshots are already the correct size.

5. Impact on SSD Cache

This fix significantly improves the stability of the SSD caching pipeline without changing the resulting data.

Data Integrity: No change. The final blocks stored on disk are identical to the previous version. Cache hit rates and reconstruction logic remain unaffected.
RAM Stability: Previously, the system experienced a RAM spike before writing to the SSD because it expanded all snapshots into huge tensors simultaneously. This spike is now eliminated.
Workload Distribution: The extraction work is now distributed across the prefill phase rather than being bunched at the end of the request, preventing "lag spikes" and OOM crashes during the storage phase.

Testing

Confirm that the number of NG results of pytest -m "not slow has not increased.
(Fix 1 TestBoundarySnapshotSSDStore fail. Detail in commit message)
- Add test_boundary_snapshot_is_frozen_state test

Feature	`test_capture_boundary_snapshot...`	`test_boundary_snapshot_is_frozen...`
Primary Goal	Verify the mechanism works.	Verify the fix for the memory leak works.
What it checks	"Did we save a snapshot at token 4?"	"Is the snapshot independent of the original?"
Key Assertion	`assert 4 in snapshots`	`assert snapshot_after == snapshot_before`
Failure Meaning	The scheduler isn't capturing boundaries.	The system is still using shallow references (leak risk).

Before this commit

FAILED tests/test_audio_discovery.py::TestNonAudioRegressions::test_vlm_still_detected - AssertionError: assert 'llm' == 'vlm'
FAILED tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue - AssertionError: assert 1 == 0
FAILED tests/test_cli.py::TestServeCommandOptions::test_serve_has_scheduler_options - assert '--max-num-seqs' in 'usage: cli.py serve [-h] [--model-dir MODEL_DIR]\n          ...
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_omlx_app.py::TestServerManager::test_check_health_success - assert False is True
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_enters_suppression_after_forcing - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_trailing_tokens_forced_after_end - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_natural_end_before_budget - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_forcing - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_natural_detection - assert False
FAILED tests/test_vlm_engine.py::TestInjectToolCalling::test_skips_when_mlx_lm_not_available - AssertionError: assert True is False
========== 15 failed, 3428 passed, 29 skipped, 45 deselected in 228.72s (0:03:48) ===========

After this commit

FAILED tests/test_audio_discovery.py::TestNonAudioRegressions::test_vlm_still_detected - AssertionError: assert 'llm' == 'vlm'
FAILED tests/test_cli.py::TestServeCommandOptions::test_serve_has_scheduler_options - assert '--max-num-seqs' in 'usage: cli.py serve [-h] [--model-dir MODEL_DIR]\n                   ...
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_calls_get_builtin_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_reasoning_false_when_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileWithStructuralTag::test_patches_user_grammar_into_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_uses_structural_tag - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_grammar.py::TestCompileGrammarForRequest::test_reasoning_parser_with_thinking_disabled - ModuleNotFoundError: No module named 'xgrammar'
FAILED tests/test_omlx_app.py::TestServerManager::test_check_health_success - assert False is True
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_enters_suppression_after_forcing - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_trailing_tokens_forced_after_end - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_natural_end_before_budget - assert False
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_forcing - assert -inf == 0.0
FAILED tests/test_thinking_budget.py::TestThinkingBudgetProcessor::test_multi_token_natural_detection - assert False
FAILED tests/test_vlm_engine.py::TestInjectToolCalling::test_skips_when_mlx_lm_not_available - AssertionError: assert True is False
=============== 14 failed, 3430 passed, 29 skipped, 45 deselected in 225.45s (0:03:45) ===============

[Features] - Introduce `hot_cache_only` flag for pure in-memory operation. - Change hot cache storage from bytes-only to hybrid `mx.array` model to prevent RAM spikes and speed up retrieval. [Fixes] - Fix: `_hot_cache_entry_size` incorrectly calculated size for evicted entries. - Fix: Prevented unnecessary GPU memory consumption when hot cache is disabled. - Fix: Resolved race condition between boundary snapshot cleanup and background writer thread. [Admin/UI] - Add `hot_cache_only` toggle to the admin dashboard. - Update configuration schemas for settings, routes, and local storage.

1. NameError in `GlobalSettings.to_scheduler_config` [Issue] Tests in `tests/test_settings.py` failed with `NameError: name 'ssd_dir' is not defined`. In `omlx/settings.py`, the `to_scheduler_config` method was attempting to pass a variable named `ssd_dir` to the `SchedulerConfig` constructor, but this variable had not been defined within the scope of the function. [Fix] - Defined `ssd_dir` by calling `self.cache.get_ssd_cache_dir(self.base_path)`. - Integrated logic for the `hot_cache_only` mode: - If `hot_cache_only` is enabled, `ssd_dir` is set to `None`. - If `hot_cache_only` is enabled, `paged_ssd_cache_max_size` is forced to `0`. - This ensures that the scheduler is correctly informed when SSD caching is disabled in favor of a memory-only hot cache. 2. Race Condition in `BoundarySnapshotSSDStore` [Issue] The test `tests/test_boundary_snapshot_store.py::TestBoundarySnapshotSSDStore::test_cleanup_all_drains_queue` failed intermittently. The test expected the snapshot directory to be empty after `cleanup_all()`, but files were occasionally found. A race condition existed between the background writer thread and the cleanup methods (`cleanup_request` and `cleanup_all`): 1. **Writer Thread**: Pops a write task from the queue -> Checks if the request is cancelled (it is not) -> **[Context Switch]**. 2. **Main Thread**: Calls `cleanup_all()` -> Marks requests as cancelled -> Deletes the `_boundary_snapshots` directory. 3. **Writer Thread**: Resumes -> Calls `mkdir(parents=True)` (recreating the directory just deleted) -> Writes the `.safetensors` file. The result was a "zombie" file written to disk after the system had explicitly requested a full cleanup. [Fix] Introduced a synchronization primitive `_write_lock` to ensure atomicity between the check-and-write phase and the cleanup phase. - **Writer Thread**: Now acquires `_write_lock` before checking the cancellation status and performing the disk I/O. This ensures that if a write starts, no cleanup can occur until the write finishes (or vice versa). - **Cleanup Methods**: Now acquire `_write_lock` before calling `shutil.rmtree()`. This prevents the writer from recreating the directory while it is being deleted. - **Error Handling**: Improved the `except` block in the writer loop to specifically target and remove temporary files (`_tmp.safetensors`) if a write fails, preventing disk clutter.

Landon-Molt · 2026-04-11T06:11:18Z

Thank you! I am currently testing your branch.

[Problem] A significant memory leak was identified in the `Scheduler` when using boundary snapshots for non-sliceable cache layers (such as `RotatingKVCache` or `ArraysCache`). Even after requests were finished or the `BatchGenerator` was reset, GPU and CPU memory remained occupied, leading to eventual Out-of-Memory (OOM) errors during long-running sessions. [Root cause] The `Scheduler` was storing direct references to live cache objects within the `_boundary_cache_snapshots` dictionary. These cache objects are internally linked to the `BatchGenerator` and its associated large KV tensors. Because the `Scheduler` held these references, the Python Garbage Collector could not reclaim the `BatchGenerator` or its tensors, even after the generator was destroyed. This created a "reference chain" that kept the entire state of previous batches alive in memory. [Fix] The fix involves "freezing" the cache state immediately upon capture, converting live objects into static data structures. | Feature | Before (Buggy) | After (Fixed) | | :------------------- | :--------------------------------------------------- | :------------------------------------------------- | | **Snapshot Content** | Shallow references to live cache objects | Extracted state dictionaries (frozen) | | **Reference Chain** | `Scheduler` -> `Cache Obj` -> `BatchGenerator` | `Scheduler` -> `Dict` -> `Tensors` | | **GC Behavior** | `BatchGenerator` pinned until all snapshots are gone | `BatchGenerator` reclaimed immediately after reset | | **State Integrity** | Mutating live cache changes all snapshots | Snapshots remain immutable (frozen) | | **Memory Impact** | Chronic leak of GPU/CPU tensors | Prompt reclamation of memory | [Implementation Details] 1. **Immediate Extraction**: Modified `_emit_prefill_boundary_snapshot` and `_extract_boundary_snapshot` to call `_extract_cache_states` at the moment of capture. This converts the live cache objects into dictionaries containing only the necessary tensors and metadata. 2. **Reference Breaking**: By storing dictionaries instead of object references, the link to the `BatchGenerator` is broken, allowing the generator and its large tensors to be reclaimed by the GC. 3. **Idempotency in Extraction**: Added a check in `_extract_cache_states` to detect if the input is already a list of extracted dictionaries. This ensures that subsequent calls (e.g., when saving the snapshot to the SSD cache) do not attempt to re-extract data from a dictionary. 4. **Type Safety**: Updated `_cache_tree_has_stateful_non_sliceable` to explicitly handle dictionary types, preventing the scheduler from attempting recursive live-object inspections on frozen snapshots. 5. **Verification**: Added a regression test `test_boundary_snapshot_is_frozen_state` to ensure that mutating a live cache object after a snapshot is taken does not affect the stored snapshot. [Side effects] - **SSD Cache**: No negative side effects. The data format written to the SSD remains identical, as the SSD writer still receives the same list of state dictionaries. - **Performance**: Negligible. The extraction process is simply moved slightly earlier in the execution pipeline. - **Memory**: Positive. Memory is now reclaimed promptly after the `BatchGenerator` is reset or a request is fully cleaned up.

[Problem] SSD cache hits were being rejected as "partial prefix matches" for large prompts using non-sliceable caches (such as `RotatingKVCache` or `ArraysCache`), even when the prefix was actually present in the cache. This resulted in the system falling back to full prefill, eliminating the performance benefits of the prefix cache. [Root cause] The "Frozen" state logic (introduced to fix a memory leak) caused a regression in how the scheduler detects the need for boundary snapshots. The method `_cache_tree_has_stateful_non_sliceable` was updated to return `False` immediately if the `cache_obj` was a dictionary. However, because boundary snapshots are now "frozen" into dictionaries _before_ being passed to the detection logic, the scheduler incorrectly concluded that no stateful non-sliceable caches were present. Consequently, the scheduler skipped capturing boundary snapshots during prefill. When the cache was stored, intermediate blocks were saved as placeholders (tensors of shape `(1,)`). During reconstruction, the system detected these placeholders in the last matched block and rejected the cache to prevent using stale sliding-window or recurrent state. [Fix] Updated `_cache_tree_has_stateful_non_sliceable` to correctly handle both live cache objects and frozen state dictionaries. Instead of rejecting dictionaries, the logic now extracts the class name from the dictionary metadata and queries the `CacheTypeRegistry` to determine if that specific cache type supports block slicing. By relying on the registry instead of hard-coded lists or simple type checks, the scheduler now correctly identifies non-sliceable caches regardless of whether they are live or frozen, ensuring boundary snapshots are captured and stored. [Side effects] None. This fix restores the intended behavior of the boundary snapshot mechanism while maintaining the memory leak fix (as it still operates on frozen dictionaries rather than live object references).

[Problem] When using `hot_cache_only=True` with very large prompts, the system experienced linear memory growth during the prefill phase, eventually leading to Out-of-Memory (OOM) crashes. This occurred even though the "Frozen" state logic had already been implemented to prevent long-term memory leaks. [Root cause] The issue was caused by the storage mechanism for boundary snapshots when no SSD store is available. 1. **Boundary Snapshots**: For non-sliceable caches (e.g., `RotatingKVCache`, `ArraysCache`), the scheduler captures a "frozen" snapshot of the cache state at every block boundary (e.g., every 1024 tokens) to allow partial prefix reuse. 2. **Storage Path**: Normally, these snapshots are offloaded to disk via `BoundarySnapshotSSDStore`. However, in `hot_cache_only` mode, the SSD store is disabled (`self._boundary_snapshot_store = None`). 3. **RAM Accumulation**: Without an SSD store, the scheduler stores these frozen snapshots (which contain actual MLX tensors for all model layers) directly in the `_boundary_cache_snapshots` dictionary in RAM. 4. **Linear Growth**: For a prompt of 100k tokens, the system creates ~100 snapshots. Since each snapshot is a full copy of the recurrent state for all layers, the memory usage grows linearly with the prompt length, leading to OOM. [Fix] The boundary snapshot mechanism is now explicitly disabled if no SSD store is available. Modified the `boundary_enabled` check in `_do_external_prefill` to require `self._boundary_snapshot_store` to be present. If the system is running in `hot_cache_only` mode (or any mode without an SSD store), boundary snapshots are no longer captured. [Side effects] - **Standard KVCache**: No impact. Standard caches are sliceable and do not require boundary snapshots for partial prefix reuse. - **Non-Sliceable Caches (Rotating/Arrays)**: In `hot_cache_only` mode, partial prefix reuse is no longer supported for these specific cache types. They will now only support "Exact Matches" (where the entire prompt is cached). Any partial match will be treated as a cache miss and re-processed. - **Stability**: Memory usage during prefill is now flat and stable, regardless of prompt length, preventing OOMs in RAM-only mode.

Landon-Molt · 2026-04-12T00:12:42Z

Great work on this PR — your boundary snapshot freezing fix (1b87a99) resolved a TurboQuant KV cache corruption issue we were debugging in #661.

Context: We wired up _apply_turboquant_kv() (which was defined but never called) to convert KVCache → TurboQuantKVCache via from_cache() after prefill. The conversion quantizes the prefilled KV state, saving ~4-5 GB RAM at 32K context.

The problem: Before your fix, from_cache() would read cache.state to get the KV tensors for quantization, but the stale boundary snapshot references were holding onto the same mutable cache objects. This caused gemma-4-31B to produce garbage output with TQ enabled — even though the quantization itself had <1% reconstruction error.

After your fix: The frozen snapshots break the reference chain, so cache.state returns clean unshared tensors. gemma-4-31B now works correctly with TQ at both concurrency 1 and 4.

Did you encounter similar issues with TurboQuant producing corrupt output before this fix? Curious if the snapshot reference bug was known to interact with TQ.

RepublicOfKorokke · 2026-04-12T02:14:08Z

@Landon-Molt

Thank you for the testing my fix.

Did you encounter similar issues with TurboQuant producing corrupt output before this fix? Curious if the snapshot reference bug was known to interact with TQ.

I am currently using Gemma-4 (e2b, e4b, and 26b-a4b) all configured with TQ 3bit.
While using the SSD cache enabled, I have occasionally encountered corrupted output.
(primarily looping. I have not seen any instances of garbled output.)

I cannot definitively confirm whether the cache is the direct cause, but I have found that clean up the cache and regenerating the output resolves the issue. (This might also be related to the initial seed value.)

This specific occurrence is rare, and it was not reproduced during my debugging phase of the fix.

The models I am currently using:

mlx-community/gemma-4-e2b-it-4bit
mlx-community/gemma-4-e4b-it-4bit
RepublicOfKorokke/gemma-4-26B-A4B-it-oQ3

Landon-Molt · 2026-04-12T02:53:31Z

Thanks for the reply! Good to know you're using TQ on gemma-4 models.

We dug deeper and found the root cause of the corruption on gemma-4-31B specifically. It turns out the last full-attention layer (layer 59/60) is too sensitive to KV quantization — even 8-bit TQ on that single layer alone produces garbage, while the other 9 full-attention layers quantize fine.

We confirmed this by converting layers one at a time:

Layers 5-53 (9 layers): ✅ correct output
Layer 59 alone: ❌ garbage
All 10 layers: ❌ garbage

The fix: skip the last KVCache layer during TQ conversion. It sits right before the output head, so any quantization error directly impacts logits with no subsequent layers to absorb the perturbation. This gives us 9/10 layers quantized with correct output.

The occasional looping you see with SSD cache might be related — if a cached TQ state for a late layer gets slightly corrupted during SSD store/restore, it could cause the same sensitivity issue. Your fix (cleaning cache) would resolve that by forcing fresh computation.

We're currently testing a PR with this skip-last-layer fix on top of upstream main. Will share results soon.

RepublicOfKorokke added 5 commits April 10, 2026 19:07

fix: remove redundunt if statement

ab478d5

fix: improve SSD cache path resolution and error handling

19ed2fc

chore: restore comment

7192248

RepublicOfKorokke mentioned this pull request Apr 10, 2026

[Feature Request] Enable 'Hot Cache' only (Disable 'Cold Cache') #605

Open

RepublicOfKorokke added 4 commits April 11, 2026 22:48

fix: hot_cache_only require restart server

4fd5b72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] add hot cache only mode and optimize memory usage#701

[Performance] add hot cache only mode and optimize memory usage#701
RepublicOfKorokke wants to merge 9 commits intojundot:mainfrom
RepublicOfKorokke:feat/hot-cache-only

RepublicOfKorokke commented Apr 10, 2026 •

edited

Loading

Uh oh!

Landon-Molt commented Apr 11, 2026

Uh oh!

Landon-Molt commented Apr 12, 2026

Uh oh!

RepublicOfKorokke commented Apr 12, 2026

Uh oh!

Landon-Molt commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RepublicOfKorokke commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable 'Hot Cache' only (Disable 'Cold Cache')

Features

1. SSD Cache Control (hot_cache_only flag)

When hot_cache_only = True

When hot_cache_only = False

2. Hot Cache Optimization (Hybrid Storage)

Old Architecture (Bytes-Only)

New Architecture (Hybrid Array/Bytes)

Tiering Logic & Memory Efficiency

Summary Table

3. Memory Leak Fix: Boundary Cache Snapshots

1. Problem Description

Symptoms:

2. Root Cause Analysis

The "Reference Accumulation" Bug

The "End-of-Request Explosion"

3. The Solution: State Freezing

Key Changes:

Comparison:

4. Verification

5. Impact on SSD Cache

Testing

Uh oh!

Landon-Molt commented Apr 11, 2026

Uh oh!

Landon-Molt commented Apr 12, 2026

Uh oh!

RepublicOfKorokke commented Apr 12, 2026

Uh oh!

Landon-Molt commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RepublicOfKorokke commented Apr 10, 2026 •

edited

Loading

1. SSD Cache Control (`hot_cache_only` flag)

When `hot_cache_only = True`

When `hot_cache_only = False`