Skip to content

fix: Improve VLM prefix cache reuse when adding images#637

Open
latent-variable wants to merge 7 commits intojundot:mainfrom
latent-variable:codex/vlm-cache-pr-clean
Open

fix: Improve VLM prefix cache reuse when adding images#637
latent-variable wants to merge 7 commits intojundot:mainfrom
latent-variable:codex/vlm-cache-pr-clean

Conversation

@latent-variable
Copy link
Copy Markdown
Contributor

@latent-variable latent-variable commented Apr 7, 2026

Why

I’ve been using oMLX daily for local multimodal agent workflows with Qwen3.5-122B-A10B, and I kept running into a specific cache issue in longer image-heavy sessions: once new images were added, too much previously reusable context would get invalidated, leading to large avoidable re-prefills.

While digging into that behavior, the main issue I found was that VLM image state was being keyed too coarsely for multi-image conversations.

Change

This updates VLM prefix cache keying to be segmented by image turn instead of treating image state as one whole-request cache key.

That means:

  • earlier text-only prefix can still reuse
  • earlier image turns can still reuse
  • adding a later image invalidates from the relevant multimodal boundary onward, rather than unnecessarily invalidating a larger earlier prefix

Result

This keeps exact cache behavior, but significantly improves reuse for long multimodal conversations where images are added over time.

Before

Screenshot 2026-04-07 at 12 45 39

After

Screenshot 2026-04-07 at 12 49 46

Methodology

  • Replayed the same local multimodal agent workflow on both builds: upstream baseline and this branch.
  • Used the same model, same harness, same turn order, and the same mix of text requests, tool/repo inspection, and local PNG resume reads.
  • The sequence intentionally alternated normal text turns with appended image reads to measure how much previously processed context remained reusable after later image inputs.
  • Each run used an isolated oMLX base path/cache directory so the results were not contaminated by prior cache state.

Observed

  • Baseline: 14.1% cache efficiency
  • This branch: 76.3% cache efficiency

@latent-variable latent-variable changed the title Segment VLM prefix cache keys by image turn Improve VLM prefix cache reuse when adding later image Apr 7, 2026
@latent-variable latent-variable deleted the codex/vlm-cache-pr-clean branch April 7, 2026 19:20
@latent-variable latent-variable restored the codex/vlm-cache-pr-clean branch April 7, 2026 19:20
@latent-variable
Copy link
Copy Markdown
Contributor Author

Screenshot 2026-04-08 at 08 52 08

latest results after keeping VLM chat rendering path stable before the first image to prevent the first image from potentially causing a large cache miss

@latent-variable latent-variable changed the title Improve VLM prefix cache reuse when adding later image fix: Improve VLM prefix cache reuse when adding later image Apr 8, 2026
@latent-variable latent-variable changed the title fix: Improve VLM prefix cache reuse when adding later image fix: Improve VLM prefix cache reuse when adding images Apr 8, 2026
@pjay-io
Copy link
Copy Markdown

pjay-io commented Apr 8, 2026

I had the some experience, claude code sessions with text only achieved great caching, adding images made the cache go invalid every new turn. (Didn't test this branch btw, but based on the results I would imagine this would be solved with this PR)

@Ark-kun
Copy link
Copy Markdown

Ark-kun commented Apr 8, 2026

Can you provide a concrete example of flaky conversation cache?
Example:

Without images:
Written to cache: xxx
Reading from cache: xxx
Reused prefix: xxx

With images:
Step T:
Written to cache: x1
Reading from cache: xq1q2
Reused prefix: x

Step T+1
Written to cache: x12
Reading from cache: xq1q2q3
Reused prefix: x

With proposed change:
Step T:
Written to cache: x1
Reading from cache: x12
Reused prefix: x1

Step T+1
Written to cache: x12
Reading from cache: x123
Reused prefix: x12

@latent-variable
Copy link
Copy Markdown
Contributor Author

I had the some experience, claude code sessions with text only achieved great caching, adding images made the cache go invalid every new turn. (Didn't test this branch btw, but based on the results I would imagine this would be solved with this PR)

@pjay-io yeah, that’s basically the failure mode this is targeting. Text-only turns were reusing well, then adding images later would invalidate too much of the prefix.

If you get a chance to try this branch on your claude code flow, i’d be curious if it lines up with what you were seeing.

@latent-variable
Copy link
Copy Markdown
Contributor Author

@Ark-kun yeah, concretely it’s closer to this:

  • block A = system + early text-only turns
  • block B = the turn where image 1 is first introduced
  • block C = later text-only turns
  • block D = more later text-only turns
  • block E = a later turn that introduces image 2

Before this PR, once a new image showed up later in the conversation, cache reuse could fall back much farther than necessary.

So in a case like this, adding or changing image 2 in block E could force recomputing from around block B onward, even though blocks C and D were just text and had not changed.

That means the cost of a later image was not localized to that point in the conversation. it could wipe out reuse for a much larger suffix after the first image boundary.

With this change, the cache key is segmented by image-turn boundary:

  • blocks before the first image boundary stay unsalted
  • blocks after image 1 are keyed by cumulative state through image 1
  • later text-only blocks after image 1 reuse under that same segment
  • only when image 2 appears do we advance to a new cumulative image-state key

So in the example above, if block E introduces or changes image 2, we can still reuse A/B/C/D and only invalidate from that later multimodal boundary onward.

There was also a separate first-image issue where the prompt rendering path changed once the first image showed up, which could invalidate an earlier text prefix. The latest commit in this branch keeps that path stable too.

Copy link
Copy Markdown
Contributor

@mdevk mdevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting — we tested prefix caching for VLMs on oMLX back in March 2026 (v0.2.19 at the time) and found it degraded under real workloads:

Request Latency Notes
Photo 1 (warm) 7s Promising
Photo 2 (different image) 12s Cache miss
Photo 3 (different dimensions) 72s Cache pressure regression

The problem was exactly what you describe: "once new images were added, too much previously reusable context would get invalidated." In our photo extraction workload, every photo has different dimensions and pixel content, so the cache filled, evicted constantly, and the management overhead made it slower than no caching at all.

At the time, we filed it as "prefix caching for VLMs on Apple Silicon isn't ready" (now documented in our lab notebooks). This PR looks like it addresses the core issue — treating image tokens as a separable block rather than invalidating the entire prefix.

Question: For our use case (sequential photo extraction with a shared system prompt but unique images), would this PR enable reusing the ~500 system prompt tokens across photos while only re-processing the image tokens? That would save ~0.5s TTFT per photo (our measured warm system prompt prefill time).

If so, this would be a significant improvement for batch processing workflows. Happy to benchmark it once it's merged.

@latent-variable
Copy link
Copy Markdown
Contributor Author

@mdevk thanks, really helpful context.

yeah, this PR is aimed at that failure mode.

For that kind of workflow, the shared system/text prefix should reuse much better, and a new image should only invalidate from the relevant multimodal boundary onward instead of knocking out a much larger suffix.

I'd just phrase it as better segmented prefix reuse rather than literally only reprocessing image tokens.

If you end up benchmarking it on that workload, i’d be curious to see the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants