-
Notifications
You must be signed in to change notification settings - Fork 4
Update vllm.py to support the V1 engine #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…hch is necessary for gpt-oss.
DRMacIver
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As well as the comments I've added about random details, I'm afraid I'm not keen on this the way you've added this.
I would like to see some tests demonstrating that this works, which also requires that the code actually be runnable (which it currently isn't).
In order to do that I think we need to be able to run both V0 and V1 in the same process (unless there's some reason why we can't do that? In which case we might need different testing environments for different versions of vllm. If that's the case, let's talk about how we might set that up), which requires not having this hard coded at import time, e.g. by providing a flag to the constructor for AsyncVirtualLM.
notes/Untitled.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this doing here?
notes/playground.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question. This seems entirely empty.
genlm/backend/llm/vllm.py
Outdated
| @@ -1,3 +1,7 @@ | |||
| FORCE_V0 = True #Currently, we force thw model to use V0, to switch to V1 simply set this to False | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky typo comments: thw. Also conventionally there's a space after the #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly... I'm not thrilled about this hardcoded constant where you have to change the source code for any of the code you've added to be reachable.
genlm/backend/llm/vllm.py
Outdated
| @@ -1,3 +1,7 @@ | |||
| FORCE_V0 = True #Currently, we force thw model to use V0, to switch to V1 simply set this to False | |||
| LOGPROBS_PER_REQUEST = 256 #These are th elogprobs that are retrieved currently in V1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above RE #, also th e.
genlm/backend/llm/hf.py
Outdated
| return cls(mod, tok, **kwargs) | ||
|
|
||
| # @classmethod | ||
| # def from_name(cls, model_id, bitsandbytes_opts=None, hf_opts=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with all this commented out code?
…select either V1 or V0 by passing a flag to the constructor.
|
I added some tests, and also changed the structure: now we can switch between `v0 and V1 by passing a variable to the constructor. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Merge with Ben's merge fix on pyproject.toml
… input to next_token_logprobs in vllm.py (there is no need to consider the case where we pass a string as input)
|
@DRMacIver the issues should be fixed now |
I updated vllm.py to support the V1 engine. By default, the V0 engine is still used, unless FORCE_V0 is set to False and the V1 engine is available on the current version (envs.VLLM_USE_V1 is True). Since accessing the logprobs is slow in V1, we only retrieve the logprobs for the most probable tokens; the exact number of retrieved tokens can be controlled with LOGPROBS_PER_REQUEST (set by default to 256). The remaining probability mass is distributed among the remaining tokens.
IMPORTANT NOTE: in order to support the gpt-oss mxfp4 quantization ( vllm has worked out a way to make it work on the A100, and currently mxfp4 is the only supported quantization ), I had to update the dependencies on "pyproject.toml"
to allow vllm 0.10.2. However the current V0 implementation does not support vllm > 0.10.0 (the "disable_log_requests" option is not supported), which means that to use the V0 machine you should have vllm <= 0.10.0.