feat(admin): cache probe endpoint for prompt prefix lookup by thornad · Pull Request #720 · jundot/omlx

thornad · 2026-04-11T10:25:39Z

Summary

Adds POST /admin/api/cache/probe — an admin-authenticated endpoint that reports how much of a rendered chat prompt is already resident in the loaded model's SSD cache, broken down by tier (hot cache / disk index / cold).

Request body (CacheProbeRequest):

{
  "model_id": "GLM-4.7-Flash-PRISM-mlx-8bit",
  "messages": [...],
  "tools": null,
  "chat_template_kwargs": null
}

Response:

{
  "model_id": "...",
  "model_loaded": true,
  "total_tokens": 19154,
  "block_size": 256,
  "total_blocks": 75,
  "blocks_ssd_hot": 3,
  "blocks_ssd_disk": 71,
  "blocks_cold": 1,
  "ssd_hit_tokens": 18944,
  "cold_tokens": 210,
  ...
}

How it works

The probe renders the prompt via the engine's own _apply_chat_template + tokenizer so the token sequence matches what generate() would see, then walks the chain-hashed block sequence using compute_block_hash — the same hashing the scheduler uses at prefill. For each block it checks the paged SSD cache manager's _hot_cache (RAM copy) and _index (on-disk files), stopping at the first block that is retrievable from neither (the cache is a contiguous prefix, so everything after that point is necessarily cold).

Ground truth for "cached" is deliberately not BlockAwarePrefixCache._prefix_index — that dict survives clear_ssd_cache() and would report false positives after a manual wipe. Only hot_cache + disk index are consulted.

Works for both batched and VLM engines (guards _preprocess_messages / _apply_chat_template since the VLM engine only implements the latter).

Motivation

I've been building a cache-aware chat client on top of this — canopy — that shows a per-node cache dot on every conversation tree node so you can see which branches / regenerations would hit the cache before committing to them. Branching is nearly free when the shared prefix is cached, and exposing that state visually has been useful for iterating on prompts and system messages.

The probe is independently useful (any tool can ask "will this prompt prefill fast?") but canopy is the reference integration if you want to try it: https://github.com/thornad/canopy/releases/tag/v0.1.0 (DMG in the release).

Test plan

Probe on a chat with all blocks on SSD → correct blocks_ssd_disk count, blocks_cold ≈ 0
Probe after POST /admin/api/ssd-cache/clear → all blocks cold (no false positives from _prefix_index)
Probe with a model that's not loaded → returns {"model_loaded": false, "reason": ...} gracefully
Probe with a VLM model (no _preprocess_messages) → falls through to direct tokenizer.apply_chat_template
Probe with no messages → returns zero-block payload cleanly

POST /admin/api/cache/probe accepts {model_id, messages, tools?, chat_template_kwargs?} and reports how much of the rendered prompt is already resident in the loaded model's SSD cache, broken down by tier (hot cache / disk index / cold). The walk chain-hashes each block the same way the scheduler does at prefill so the answer matches what a real request would see. Motivating use case is a cache-aware chat UI: when a user is about to send (or when branching), show whether the prefill will hit the cache or pay the full cost. Works for both batched and VLM engines and requires admin auth. Ground truth for "cached" is retrievability via the paged SSD cache manager — hot_cache for RAM-resident blocks or _index for on-disk files. BlockAwarePrefixCache._prefix_index is intentionally not consulted because it survives clear_ssd_cache() and would report false positives after a manual wipe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admin): cache probe endpoint for prompt prefix lookup#720

feat(admin): cache probe endpoint for prompt prefix lookup#720
thornad wants to merge 1 commit intojundot:mainfrom
thornad:feat/cache-probe-api

thornad commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thornad commented Apr 11, 2026

Summary

How it works

Motivation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant