Skip to content

Conversation

charitarthchugh
Copy link
Contributor

@charitarthchugh charitarthchugh commented Sep 25, 2025

The goal of this PR is to supplement the defaults to increase inference speed. I have tested this on a L40S and this has been working fine, with more than 2 weeks of combined uptime running pdf conversion.

Changes proposed in this pull request:

  • --limit-mm-per-prompt allows VLLM to not allocate space on the GPU for the video encoder, thus increasing space for KV cache
  • --enable-chunked-prefill basically streams the input tokens allowing for more efficient memory use.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

@jakep-allenai
Copy link
Collaborator

Sorry, been sick, just getting to this now. Let me run it locally and see what the numbers are looking like.

@jakep-allenai
Copy link
Collaborator

--limit-mm-per-prompt sounds like a great idea, but it seems like VLLM V1 is going to have chunked prefill always on by default. I'm waiting for benchmarks to run on our cluster here

@jakep-allenai
Copy link
Collaborator

Ok nice find on disabling the video encoder, really makes a difference on smaller GPU cards! There's a small syntax error, but I'm going to merge this in, as I want to push a PR today that moves things over to VLLM 0.11 officially as well.

@jakep-allenai jakep-allenai merged commit 2b70b50 into allenai:main Oct 6, 2025
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants