fix: Improve VLM prefix cache reuse when adding images#637
fix: Improve VLM prefix cache reuse when adding images#637latent-variable wants to merge 7 commits intojundot:mainfrom
Conversation
|
I had the some experience, claude code sessions with text only achieved great caching, adding images made the cache go invalid every new turn. (Didn't test this branch btw, but based on the results I would imagine this would be solved with this PR) |
|
Can you provide a concrete example of flaky conversation cache? Without images: With images: Step T+1 With proposed change: Step T+1 |
@pjay-io yeah, that’s basically the failure mode this is targeting. Text-only turns were reusing well, then adding images later would invalidate too much of the prefix. If you get a chance to try this branch on your claude code flow, i’d be curious if it lines up with what you were seeing. |
|
@Ark-kun yeah, concretely it’s closer to this:
Before this PR, once a new image showed up later in the conversation, cache reuse could fall back much farther than necessary. So in a case like this, adding or changing image 2 in block E could force recomputing from around block B onward, even though blocks C and D were just text and had not changed. That means the cost of a later image was not localized to that point in the conversation. it could wipe out reuse for a much larger suffix after the first image boundary. With this change, the cache key is segmented by image-turn boundary:
So in the example above, if block E introduces or changes image 2, we can still reuse A/B/C/D and only invalidate from that later multimodal boundary onward. There was also a separate first-image issue where the prompt rendering path changed once the first image showed up, which could invalidate an earlier text prefix. The latest commit in this branch keeps that path stable too. |
mdevk
left a comment
There was a problem hiding this comment.
Interesting — we tested prefix caching for VLMs on oMLX back in March 2026 (v0.2.19 at the time) and found it degraded under real workloads:
| Request | Latency | Notes |
|---|---|---|
| Photo 1 (warm) | 7s | Promising |
| Photo 2 (different image) | 12s | Cache miss |
| Photo 3 (different dimensions) | 72s | Cache pressure regression |
The problem was exactly what you describe: "once new images were added, too much previously reusable context would get invalidated." In our photo extraction workload, every photo has different dimensions and pixel content, so the cache filled, evicted constantly, and the management overhead made it slower than no caching at all.
At the time, we filed it as "prefix caching for VLMs on Apple Silicon isn't ready" (now documented in our lab notebooks). This PR looks like it addresses the core issue — treating image tokens as a separable block rather than invalidating the entire prefix.
Question: For our use case (sequential photo extraction with a shared system prompt but unique images), would this PR enable reusing the ~500 system prompt tokens across photos while only re-processing the image tokens? That would save ~0.5s TTFT per photo (our measured warm system prompt prefill time).
If so, this would be a significant improvement for batch processing workflows. Happy to benchmark it once it's merged.
|
@mdevk thanks, really helpful context. yeah, this PR is aimed at that failure mode. For that kind of workflow, the shared system/text prefix should reuse much better, and a new image should only invalidate from the relevant multimodal boundary onward instead of knocking out a much larger suffix. I'd just phrase it as better segmented prefix reuse rather than literally only reprocessing image tokens. If you end up benchmarking it on that workload, i’d be curious to see the results. |
…clean # Conflicts: # omlx/request.py

Why
I’ve been using oMLX daily for local multimodal agent workflows with
Qwen3.5-122B-A10B, and I kept running into a specific cache issue in longer image-heavy sessions: once new images were added, too much previously reusable context would get invalidated, leading to large avoidable re-prefills.While digging into that behavior, the main issue I found was that VLM image state was being keyed too coarsely for multi-image conversations.
Change
This updates VLM prefix cache keying to be segmented by image turn instead of treating image state as one whole-request cache key.
That means:
Result
This keeps exact cache behavior, but significantly improves reuse for long multimodal conversations where images are added over time.
Before
After
Methodology
Observed
14.1%cache efficiency76.3%cache efficiency