feat(admin): cache probe endpoint for prompt prefix lookup#720
Open
thornad wants to merge 1 commit intojundot:mainfrom
Open
feat(admin): cache probe endpoint for prompt prefix lookup#720thornad wants to merge 1 commit intojundot:mainfrom
thornad wants to merge 1 commit intojundot:mainfrom
Conversation
POST /admin/api/cache/probe accepts {model_id, messages, tools?,
chat_template_kwargs?} and reports how much of the rendered prompt is
already resident in the loaded model's SSD cache, broken down by tier
(hot cache / disk index / cold). The walk chain-hashes each block the
same way the scheduler does at prefill so the answer matches what a
real request would see.
Motivating use case is a cache-aware chat UI: when a user is about to
send (or when branching), show whether the prefill will hit the cache
or pay the full cost. Works for both batched and VLM engines and
requires admin auth.
Ground truth for "cached" is retrievability via the paged SSD cache
manager — hot_cache for RAM-resident blocks or _index for on-disk
files. BlockAwarePrefixCache._prefix_index is intentionally not
consulted because it survives clear_ssd_cache() and would report
false positives after a manual wipe.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
POST /admin/api/cache/probe— an admin-authenticated endpoint that reports how much of a rendered chat prompt is already resident in the loaded model's SSD cache, broken down by tier (hot cache / disk index / cold).Request body (
CacheProbeRequest):{ "model_id": "GLM-4.7-Flash-PRISM-mlx-8bit", "messages": [...], "tools": null, "chat_template_kwargs": null }Response:
{ "model_id": "...", "model_loaded": true, "total_tokens": 19154, "block_size": 256, "total_blocks": 75, "blocks_ssd_hot": 3, "blocks_ssd_disk": 71, "blocks_cold": 1, "ssd_hit_tokens": 18944, "cold_tokens": 210, ... }How it works
The probe renders the prompt via the engine's own
_apply_chat_template+ tokenizer so the token sequence matches whatgenerate()would see, then walks the chain-hashed block sequence usingcompute_block_hash— the same hashing the scheduler uses at prefill. For each block it checks the paged SSD cache manager's_hot_cache(RAM copy) and_index(on-disk files), stopping at the first block that is retrievable from neither (the cache is a contiguous prefix, so everything after that point is necessarily cold).Ground truth for "cached" is deliberately not
BlockAwarePrefixCache._prefix_index— that dict survivesclear_ssd_cache()and would report false positives after a manual wipe. Only hot_cache + disk index are consulted.Works for both batched and VLM engines (guards
_preprocess_messages/_apply_chat_templatesince the VLM engine only implements the latter).Motivation
I've been building a cache-aware chat client on top of this — canopy — that shows a per-node cache dot on every conversation tree node so you can see which branches / regenerations would hit the cache before committing to them. Branching is nearly free when the shared prefix is cached, and exposing that state visually has been useful for iterating on prompts and system messages.
The probe is independently useful (any tool can ask "will this prompt prefill fast?") but canopy is the reference integration if you want to try it: https://github.com/thornad/canopy/releases/tag/v0.1.0 (DMG in the release).
Test plan
blocks_ssd_diskcount,blocks_cold≈ 0POST /admin/api/ssd-cache/clear→ all blocks cold (no false positives from_prefix_index){"model_loaded": false, "reason": ...}gracefully_preprocess_messages) → falls through to directtokenizer.apply_chat_template