[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638

PeterSH6 · 2024-12-30T17:26:52Z

🚀 The feature, motivation and pitch

Thanks for the awesome inference library! I'm writing to request two features that would be beneficial to RL post-training workloads.

In online PPO (or GRPO, online DPO), the policy model will perform auto-regressive generation (using vLLM or other inference engines) and fwd + bwd computation with training infrastructure. Therefore, in the training stage, we hope to free the KVCache and even offload the model parameter stored in the vLLM (as the model parallel strategies during generation and training could be different).

Therefore, we propose two sets of APIs in the Worker, GPUExecutor, LLMEngine, and LLM classes and one model init choice:

free_cache_engine() and init_cache_engine(): The users can call the free_cache_engine from an instance of LLM and the calling chain could be LLM.free_cache_engine() -> LLMEngine.free_cache_engine() -> GPUExecutor.free_cache_engine() -> Worker.free_cache_engine(). A similar calling chain applies to init_cache_engine() while the Worker.init_cache_engine() will simply call the _init_cache_engine() in the Worker class.
After generation, the RL framework can call the llm.free_cache_engine() to release KVCache and after update_policy, it will call llm.init_cache_egine(). We have implemented an example in the veRL framework. See veRL, which utilize a SPMD version of veRL ([RFC]: Fully SPMD Execution for Offline Inference #11400)
offload_model_weights(): We maintain a self.cpu_model in the Worker and the calling chain is similar to above. After generation, the RL framework will call the llm.offload_model_weights() to offload the weight to CPU and reload it back in the next iteration
Model init choice: Currently, the vLLM Engine will initialize the model from the AutoModel.from_pretrain(). However, in RL workloads, we hope vLLM can provide an option that only initializes the model without downloading the pre-trained weights. Instead, we will later synchronize the model with an HF model outside the vLLM Engine.

Potential Issues:
When using free_cache_engine and offload_model_weights, we have to disable the CUDAGraph, which could reduce the generation throughput.
One issue in SGLang observes a similar problem: sgl-project/sglang#2542
Currently, in veRL, we simply set enforce_eager=True in all settings.
It would be better to use CUDAGraph in generation and avoid KVCache and model weights in training!

Looking forward to your responses and thanks for any help!

CC

@comaniac @WoosukKwon @youkaichao @happierpig

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

samsja · 2025-02-27T19:01:44Z

This feature seems very needed for RL workload. Would love to see it in vllm.

Do you know what the status is, and if some work is going in that direction?

youkaichao · 2025-02-28T17:50:10Z

@PeterSH6 I think this is solved by the sleep mode, right? Any remaining issue here?

PeterSH6 · 2025-03-04T03:52:27Z

@youkaichao Yes, this is solved by the sleep mode. It provides verl with significant rollout speedup. Feel free to close this issue :)

Recently, we investigated that vllm == 0.7.3 could result in some CPU memory leakage (probably with sleep mode), while vllm 0.6.3 works fine. We shall open a new issue to discuss this problem.

youkaichao · 2025-03-04T03:57:33Z

we investigated that vllm == 0.7.3 could result in some CPU memory leakage (probably with sleep mode), while vllm 0.6.3 works fine. We shall open a new issue to discuss this problem.

feel free to open a new issue for it!

PeterSH6 · 2025-03-04T03:59:40Z

Hi @samsja , You can try this feature using verl with vllm == 0.7.3

Nice profile! What a laugh tale!

PeterSH6 added the feature request New feature or request label Dec 30, 2024

PeterSH6 mentioned this issue Jan 1, 2025

Support RLOO/GRPO/REINFORCE? volcengine/verl#68

Closed

youkaichao mentioned this issue Jan 5, 2025

[Core] Support fully transparent sleep mode #11743

Merged

5 tasks

youkaichao closed this as completed Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638

[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638

PeterSH6 commented Dec 30, 2024 •

edited

Loading

samsja commented Feb 27, 2025 •

edited

Loading

youkaichao commented Feb 28, 2025

PeterSH6 commented Mar 4, 2025

youkaichao commented Mar 4, 2025

PeterSH6 commented Mar 4, 2025

[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638

[Feature]: Avoid KV Cache and offload Model weights in RL workloads #11638

Comments

PeterSH6 commented Dec 30, 2024 • edited Loading

🚀 The feature, motivation and pitch

CC

Alternatives

Additional context

Before submitting a new issue...

samsja commented Feb 27, 2025 • edited Loading

youkaichao commented Feb 28, 2025

PeterSH6 commented Mar 4, 2025

youkaichao commented Mar 4, 2025

PeterSH6 commented Mar 4, 2025

PeterSH6 commented Dec 30, 2024 •

edited

Loading

samsja commented Feb 27, 2025 •

edited

Loading