-
-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638
Comments
@PeterSH6 I think this is solved by the sleep mode, right? Any remaining issue here? |
@youkaichao Yes, this is solved by the sleep mode. It provides verl with significant rollout speedup. Feel free to close this issue :) Recently, we investigated that vllm == 0.7.3 could result in some CPU memory leakage (probably with sleep mode), while vllm 0.6.3 works fine. We shall open a new issue to discuss this problem. |
feel free to open a new issue for it! |
🚀 The feature, motivation and pitch
Thanks for the awesome inference library! I'm writing to request two features that would be beneficial to RL post-training workloads.
In online PPO (or GRPO, online DPO), the policy model will perform auto-regressive generation (using vLLM or other inference engines) and fwd + bwd computation with training infrastructure. Therefore, in the training stage, we hope to free the KVCache and even offload the model parameter stored in the vLLM (as the model parallel strategies during generation and training could be different).
Therefore, we propose two sets of APIs in the
Worker
,GPUExecutor
,LLMEngine
, andLLM
classes and one model init choice:free_cache_engine()
andinit_cache_engine()
: The users can call thefree_cache_engine
from an instance ofLLM
and the calling chain could beLLM.free_cache_engine() -> LLMEngine.free_cache_engine() -> GPUExecutor.free_cache_engine() -> Worker.free_cache_engine()
. A similar calling chain applies toinit_cache_engine()
while theWorker.init_cache_engine()
will simply call the_init_cache_engine()
in the Worker class.After generation, the RL framework can call the
llm.free_cache_engine()
to release KVCache and afterupdate_policy
, it will callllm.init_cache_egine()
. We have implemented an example in the veRL framework. See veRL, which utilize a SPMD version of veRL ([RFC]: Fully SPMD Execution for Offline Inference #11400)offload_model_weights()
: We maintain aself.cpu_model
in theWorker
and the calling chain is similar to above. After generation, the RL framework will call thellm.offload_model_weights()
to offload the weight to CPU and reload it back in the next iterationAutoModel.from_pretrain()
. However, in RL workloads, we hope vLLM can provide an option that only initializes the model without downloading the pre-trained weights. Instead, we will later synchronize the model with an HF model outside the vLLM Engine.Potential Issues:
When using
free_cache_engine
andoffload_model_weights
, we have to disable the CUDAGraph, which could reduce the generation throughput.One issue in SGLang observes a similar problem: sgl-project/sglang#2542
Currently, in veRL, we simply set
enforce_eager=True
in all settings.It would be better to use CUDAGraph in generation and avoid KVCache and model weights in training!
Looking forward to your responses and thanks for any help!
CC
@comaniac @WoosukKwon @youkaichao @happierpig
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: