Skip to content

feat(admin): cache probe endpoint for prompt prefix lookup#720

Open
thornad wants to merge 1 commit intojundot:mainfrom
thornad:feat/cache-probe-api
Open

feat(admin): cache probe endpoint for prompt prefix lookup#720
thornad wants to merge 1 commit intojundot:mainfrom
thornad:feat/cache-probe-api

Conversation

@thornad
Copy link
Copy Markdown

@thornad thornad commented Apr 11, 2026

Summary

Adds POST /admin/api/cache/probe — an admin-authenticated endpoint that reports how much of a rendered chat prompt is already resident in the loaded model's SSD cache, broken down by tier (hot cache / disk index / cold).

Request body (CacheProbeRequest):

{
  "model_id": "GLM-4.7-Flash-PRISM-mlx-8bit",
  "messages": [...],
  "tools": null,
  "chat_template_kwargs": null
}

Response:

{
  "model_id": "...",
  "model_loaded": true,
  "total_tokens": 19154,
  "block_size": 256,
  "total_blocks": 75,
  "blocks_ssd_hot": 3,
  "blocks_ssd_disk": 71,
  "blocks_cold": 1,
  "ssd_hit_tokens": 18944,
  "cold_tokens": 210,
  ...
}

How it works

The probe renders the prompt via the engine's own _apply_chat_template + tokenizer so the token sequence matches what generate() would see, then walks the chain-hashed block sequence using compute_block_hash — the same hashing the scheduler uses at prefill. For each block it checks the paged SSD cache manager's _hot_cache (RAM copy) and _index (on-disk files), stopping at the first block that is retrievable from neither (the cache is a contiguous prefix, so everything after that point is necessarily cold).

Ground truth for "cached" is deliberately not BlockAwarePrefixCache._prefix_index — that dict survives clear_ssd_cache() and would report false positives after a manual wipe. Only hot_cache + disk index are consulted.

Works for both batched and VLM engines (guards _preprocess_messages / _apply_chat_template since the VLM engine only implements the latter).

Motivation

I've been building a cache-aware chat client on top of this — canopy — that shows a per-node cache dot on every conversation tree node so you can see which branches / regenerations would hit the cache before committing to them. Branching is nearly free when the shared prefix is cached, and exposing that state visually has been useful for iterating on prompts and system messages.

The probe is independently useful (any tool can ask "will this prompt prefill fast?") but canopy is the reference integration if you want to try it: https://github.com/thornad/canopy/releases/tag/v0.1.0 (DMG in the release).

Test plan

  • Probe on a chat with all blocks on SSD → correct blocks_ssd_disk count, blocks_cold ≈ 0
  • Probe after POST /admin/api/ssd-cache/clear → all blocks cold (no false positives from _prefix_index)
  • Probe with a model that's not loaded → returns {"model_loaded": false, "reason": ...} gracefully
  • Probe with a VLM model (no _preprocess_messages) → falls through to direct tokenizer.apply_chat_template
  • Probe with no messages → returns zero-block payload cleanly

POST /admin/api/cache/probe accepts {model_id, messages, tools?,
chat_template_kwargs?} and reports how much of the rendered prompt is
already resident in the loaded model's SSD cache, broken down by tier
(hot cache / disk index / cold). The walk chain-hashes each block the
same way the scheduler does at prefill so the answer matches what a
real request would see.

Motivating use case is a cache-aware chat UI: when a user is about to
send (or when branching), show whether the prefill will hit the cache
or pay the full cost. Works for both batched and VLM engines and
requires admin auth.

Ground truth for "cached" is retrievability via the paged SSD cache
manager — hot_cache for RAM-resident blocks or _index for on-disk
files. BlockAwarePrefixCache._prefix_index is intentionally not
consulted because it survives clear_ssd_cache() and would report
false positives after a manual wipe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant