fix: Improve VLM prefix cache reuse when adding images by latent-variable · Pull Request #637 · jundot/omlx

latent-variable · 2026-04-07T05:53:12Z

Why

I’ve been using oMLX daily for local multimodal agent workflows with Qwen3.5-122B-A10B, and I kept running into a specific cache issue in longer image-heavy sessions: once new images were added, too much previously reusable context would get invalidated, leading to large avoidable re-prefills.

While digging into that behavior, the main issue I found was that VLM image state was being keyed too coarsely for multi-image conversations.

Change

This updates VLM prefix cache keying to be segmented by image turn instead of treating image state as one whole-request cache key.

That means:

earlier text-only prefix can still reuse
earlier image turns can still reuse
adding a later image invalidates from the relevant multimodal boundary onward, rather than unnecessarily invalidating a larger earlier prefix

Result

This keeps exact cache behavior, but significantly improves reuse for long multimodal conversations where images are added over time.

Before

After

Methodology

Replayed the same local multimodal agent workflow on both builds: upstream baseline and this branch.
Used the same model, same harness, same turn order, and the same mix of text requests, tool/repo inspection, and local PNG resume reads.
The sequence intentionally alternated normal text turns with appended image reads to measure how much previously processed context remained reusable after later image inputs.
Each run used an isolated oMLX base path/cache directory so the results were not contaminated by prior cache state.

Observed

Baseline: 14.1% cache efficiency
This branch: 76.3% cache efficiency

latent-variable · 2026-04-08T15:54:49Z

latest results after keeping VLM chat rendering path stable before the first image to prevent the first image from potentially causing a large cache miss

…clean

pjay-io · 2026-04-08T18:31:35Z

I had the some experience, claude code sessions with text only achieved great caching, adding images made the cache go invalid every new turn. (Didn't test this branch btw, but based on the results I would imagine this would be solved with this PR)

Ark-kun · 2026-04-08T19:04:44Z

Can you provide a concrete example of flaky conversation cache?
Example:

Without images:
Written to cache: xxx
Reading from cache: xxx
Reused prefix: xxx

With images:
Step T:
Written to cache: x1
Reading from cache: xq1q2
Reused prefix: x

Step T+1
Written to cache: x12
Reading from cache: xq1q2q3
Reused prefix: x

With proposed change:
Step T:
Written to cache: x1
Reading from cache: x12
Reused prefix: x1

Step T+1
Written to cache: x12
Reading from cache: x123
Reused prefix: x12

latent-variable · 2026-04-08T20:17:31Z

I had the some experience, claude code sessions with text only achieved great caching, adding images made the cache go invalid every new turn. (Didn't test this branch btw, but based on the results I would imagine this would be solved with this PR)

@pjay-io yeah, that’s basically the failure mode this is targeting. Text-only turns were reusing well, then adding images later would invalidate too much of the prefix.

If you get a chance to try this branch on your claude code flow, i’d be curious if it lines up with what you were seeing.

latent-variable · 2026-04-08T20:41:17Z

@Ark-kun yeah, concretely it’s closer to this:

block A = system + early text-only turns
block B = the turn where image 1 is first introduced
block C = later text-only turns
block D = more later text-only turns
block E = a later turn that introduces image 2

Before this PR, once a new image showed up later in the conversation, cache reuse could fall back much farther than necessary.

So in a case like this, adding or changing image 2 in block E could force recomputing from around block B onward, even though blocks C and D were just text and had not changed.

That means the cost of a later image was not localized to that point in the conversation. it could wipe out reuse for a much larger suffix after the first image boundary.

With this change, the cache key is segmented by image-turn boundary:

blocks before the first image boundary stay unsalted
blocks after image 1 are keyed by cumulative state through image 1
later text-only blocks after image 1 reuse under that same segment
only when image 2 appears do we advance to a new cumulative image-state key

So in the example above, if block E introduces or changes image 2, we can still reuse A/B/C/D and only invalidate from that later multimodal boundary onward.

There was also a separate first-image issue where the prompt rendering path changed once the first image showed up, which could invalidate an earlier text prefix. The latest commit in this branch keeps that path stable too.

mdevk

Interesting — we tested prefix caching for VLMs on oMLX back in March 2026 (v0.2.19 at the time) and found it degraded under real workloads:

Request	Latency	Notes
Photo 1 (warm)	7s	Promising
Photo 2 (different image)	12s	Cache miss
Photo 3 (different dimensions)	72s	Cache pressure regression

The problem was exactly what you describe: "once new images were added, too much previously reusable context would get invalidated." In our photo extraction workload, every photo has different dimensions and pixel content, so the cache filled, evicted constantly, and the management overhead made it slower than no caching at all.

At the time, we filed it as "prefix caching for VLMs on Apple Silicon isn't ready" (now documented in our lab notebooks). This PR looks like it addresses the core issue — treating image tokens as a separable block rather than invalidating the entire prefix.

Question: For our use case (sequential photo extraction with a shared system prompt but unique images), would this PR enable reusing the ~500 system prompt tokens across photos while only re-processing the image tokens? That would save ~0.5s TTFT per photo (our measured warm system prompt prefill time).

If so, this would be a significant improvement for batch processing workflows. Happy to benchmark it once it's merged.

latent-variable · 2026-04-10T00:51:42Z

@mdevk thanks, really helpful context.

yeah, this PR is aimed at that failure mode.

For that kind of workflow, the shared system/text prefix should reuse much better, and a new image should only invalidate from the relevant multimodal boundary onward instead of knocking out a much larger suffix.

I'd just phrase it as better segmented prefix reuse rather than literally only reprocessing image tokens.

If you end up benchmarking it on that workload, i’d be curious to see the results.

…clean # Conflicts: # omlx/request.py

latent-variable added 2 commits April 6, 2026 22:45

Segment VLM prefix cache keys by image turn

bad6ff2

Clarify segmented VLM cache keying

317efc2

latent-variable changed the title ~~Segment VLM prefix cache keys by image turn~~ Improve VLM prefix cache reuse when adding later image Apr 7, 2026

Add tests for segmented VLM cache keys

d31832b

latent-variable closed this Apr 7, 2026

latent-variable deleted the codex/vlm-cache-pr-clean branch April 7, 2026 19:20

latent-variable restored the codex/vlm-cache-pr-clean branch April 7, 2026 19:20

latent-variable reopened this Apr 7, 2026

Keep VLM chat rendering path stable before first image

53a20e9

Merge remote-tracking branch 'upstream/main' into codex/vlm-cache-pr-…

95fe8d8

…clean

latent-variable changed the title ~~Improve VLM prefix cache reuse when adding later image~~ fix: Improve VLM prefix cache reuse when adding later image Apr 8, 2026

latent-variable changed the title ~~fix: Improve VLM prefix cache reuse when adding later image~~ fix: Improve VLM prefix cache reuse when adding images Apr 8, 2026

Merge main into codex/vlm-cache-pr-clean, resolving conflicts

916dfe2

mdevk reviewed Apr 9, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into codex/vlm-cache-pr-…

b2dd11d

…clean # Conflicts: # omlx/request.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Improve VLM prefix cache reuse when adding images#637

fix: Improve VLM prefix cache reuse when adding images#637
latent-variable wants to merge 7 commits intojundot:mainfrom
latent-variable:codex/vlm-cache-pr-clean

latent-variable commented Apr 7, 2026 •

edited

Loading

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

pjay-io commented Apr 8, 2026

Uh oh!

Ark-kun commented Apr 8, 2026

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

mdevk left a comment

Uh oh!

latent-variable commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

latent-variable commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Change

Result

Before

After

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

pjay-io commented Apr 8, 2026

Uh oh!

Ark-kun commented Apr 8, 2026

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

latent-variable commented Apr 8, 2026

Uh oh!

mdevk left a comment

Choose a reason for hiding this comment

Uh oh!

latent-variable commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

latent-variable commented Apr 7, 2026 •

edited

Loading