Skip to content

Conversation

@ClementeP
Copy link

I updated vllm.py to support the V1 engine. By default, the V0 engine is still used, unless FORCE_V0 is set to False and the V1 engine is available on the current version (envs.VLLM_USE_V1 is True). Since accessing the logprobs is slow in V1, we only retrieve the logprobs for the most probable tokens; the exact number of retrieved tokens can be controlled with LOGPROBS_PER_REQUEST (set by default to 256). The remaining probability mass is distributed among the remaining tokens.

IMPORTANT NOTE: in order to support the gpt-oss mxfp4 quantization ( vllm has worked out a way to make it work on the A100, and currently mxfp4 is the only supported quantization ), I had to update the dependencies on "pyproject.toml"
to allow vllm 0.10.2. However the current V0 implementation does not support vllm > 0.10.0 (the "disable_log_requests" option is not supported), which means that to use the V0 machine you should have vllm <= 0.10.0.

@ClementeP ClementeP requested a review from DRMacIver October 27, 2025 20:38
Copy link

@DRMacIver DRMacIver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as the comments I've added about random details, I'm afraid I'm not keen on this the way you've added this.

I would like to see some tests demonstrating that this works, which also requires that the code actually be runnable (which it currently isn't).

In order to do that I think we need to be able to run both V0 and V1 in the same process (unless there's some reason why we can't do that? In which case we might need different testing environments for different versions of vllm. If that's the case, let's talk about how we might set that up), which requires not having this hard coded at import time, e.g. by providing a flag to the constructor for AsyncVirtualLM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. This seems entirely empty.

@@ -1,3 +1,7 @@
FORCE_V0 = True #Currently, we force thw model to use V0, to switch to V1 simply set this to False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky typo comments: thw. Also conventionally there's a space after the #

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More importantly... I'm not thrilled about this hardcoded constant where you have to change the source code for any of the code you've added to be reachable.

@@ -1,3 +1,7 @@
FORCE_V0 = True #Currently, we force thw model to use V0, to switch to V1 simply set this to False
LOGPROBS_PER_REQUEST = 256 #These are th elogprobs that are retrieved currently in V1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above RE #, also th e.

return cls(mod, tok, **kwargs)

# @classmethod
# def from_name(cls, model_id, bitsandbytes_opts=None, hf_opts=None, **kwargs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with all this commented out code?

…select either V1 or V0 by passing a flag to the constructor.
@ClementeP ClementeP marked this pull request as draft November 3, 2025 09:33
@ClementeP
Copy link
Author

I added some tests, and also changed the structure: now we can switch between `v0 and V1 by passing a variable to the constructor.

@DRMacIver DRMacIver changed the title Clemente Update vllm.py to support the V1 engine Nov 3, 2025
@DRMacIver DRMacIver marked this pull request as ready for review November 3, 2025 11:23
@codecov
Copy link

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
genlm/backend/llm/vllm.py 97.22% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ClementeP
Copy link
Author

@DRMacIver the issues should be fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants