[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

fzyzcjy · 2024-12-22T09:07:26Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi thank you for the library! The use case is that, when doing online PPO, I hope to use SGLang to generate llm completions, and then use RL to do gradient descent on those completions.

The problem is, to do this on a single GPU, the timeline is "SGLang generate - Torch backward - repeat it". Thus, when torch doing backprop, I hope SGLang can free its KV cache memory consumption, otherwise torch will not have enough memory.

Thanks for any suggestions!

Related resources

No response

zhaochenyang20 · 2024-12-22T18:32:51Z

@fzyzcjy

Thanks for pointing this out. I am working on similar things on OpenRLHF.

https://github.com/OpenRLHF/OpenRLHF

Off-loading KV Cache (actually you should shut down the engine and runtime indeed) has a clear trade-off:

Pros:

It saves a lot of VRAM.

Cons:

Relaunching the engine takes a lot of time. This time comes from two parts:

Saving the updated weights to disk.
Relaunching the engine.

In the current design of OpenRLHF, they choose not to offload to save time. In this case, the weights can be directly broadcast from the training engine to the inference engine. As shown in:

api:

sglang/python/sglang/srt/server.py

Lines 214 to 239 in 21e9e63

    
           @app.post("/init_weights_update_group") 
        
           async def init_weights_update_group( 
        
               obj: InitWeightsUpdateGroupReqInput, request: Request 
        
           ): 
        
               """Initialize the parameter update group.""" 
        
               success, message = await tokenizer_manager.init_weights_update_group(obj, request) 
        
               content = {"success": success, "message": message} 
        
               if success: 
        
                   return ORJSONResponse(content, status_code=200) 
        
               else: 
        
                   return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST) 
        
           @app.post("/update_weights_from_distributed") 
        
           async def update_weights_from_distributed( 
        
               obj: UpdateWeightsFromDistributedReqInput, request: Request 
        
           ): 
        
               """Update model parameter from distributed online.""" 
        
               success, message = await tokenizer_manager.update_weights_from_distributed( 
        
                   obj, request 
        
               ) 
        
               content = {"success": success, "message": message} 
        
               if success: 
        
                   return ORJSONResponse(content, status_code=200) 
        
               else: 
        
                   return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)

test case / usage: https://github.com/sgl-project/sglang/blob/main/test/srt/test_update_weights_from_distributed.py

If you feel interested, welcome to continue discussing this with me on github or on our SGLang slack.

fzyzcjy · 2024-12-22T23:47:23Z

@zhaochenyang20 Hi thank you so much! My current thought is that, maybe we do not shutdown the engine at all, thus we do not need to slowly restart engine. Instead, some naive thoughts:

We somehow avoid the kv cache from occupying GPU memory. For example, can we del the_kv_cache_torch_tensors and re-create them later? This can be done on the cache part, and all other parts in sglang do not need to know about this. Because when there are zero requests to sglang engine, all other parts should be sleeping and do not touch the kv cache, thus they will never know the cache tensors are removed and later re-created.

2a. As for model weights, is it even possible that we have only exactly one copy of model memory, shared by both SGLang and Transformers model (used by TRL / OpenRLHF to do weight updates)?

2b. Or, for model weights, can we do the same thing as the proposal in "1.", i.e. temporarily delete the tensors but remain all other parts, and later re-create the tensors from OpenRLHF's new weighhts?

zhaochenyang20 · 2024-12-23T07:37:53Z

In my experience, I do not think 2 is possible. But maybe 1 is okay. Let me discuss this with my teammates. @fzyzcjy

fzyzcjy · 2024-12-23T07:50:07Z

@zhaochenyang20 Thank you! I am happy to PR and try to hack it as well.

fzyzcjy · 2024-12-23T07:55:50Z

Btw I see the PR series OpenRLHF/OpenRLHF#614 and it looks great :) I am mostly interested in PPO/REINFORCE training with fast inference engine.

fzyzcjy · 2024-12-23T13:30:39Z

@zhaochenyang20 Quickly glanced at the code, it seems calling

sglang/python/sglang/srt/mem_cache/memory_pool.py

Line 212 in 23e5e50

def _clear_buffers(self):

when we want to free memory (and call _create_buffers again when we want to use sglang later) may be related.

I will do more experiments later :)

fzyzcjy · 2024-12-23T13:50:13Z

Quick experiments about devices:

model = transformers.AutoModelForCausalLM.from_pretrained('...')
print(model.device, model) # cpu

model.to('cuda')
print(model.device, model) # gpu, and see GPU memory become high

model.to('meta')
torch.cuda.empty_cache()
print(model.device, model) # meta, and see GPU memory usage become low again

model.to_empty(device='cuda')
print(model.device, model) # gpu, and GPU memory high again

Thus for point "2b", maybe we do not need to really delete the tensors, but only need to do a llama_model.to('meta') and later llama_model.to_empty(device='cuda') + llama_model.update_weights().

For point "1", doing this instead of deleing the whole tensor may also be an alternative way.

zhaochenyang20 · 2024-12-23T19:44:30Z

@fzyzcjy Hey there. I connect you through WeChat. We can have a quick discussion.

fzyzcjy · 2024-12-24T02:10:18Z

I made a quick hack and it seems to work.

Code

Note: I hack the function get_weights_by_name to allow me to call between processes for this quick experiment. In a real PR, surely we should create some new functions. Also, in real PR, we should make _clear_buffers public, etc.

Change scheduler.py

    def get_weights_by_name(self, recv_req: GetWeightsByNameReqInput):
        print(f'hi hacky change get_weights_by_name!!! {recv_req=}')
        match recv_req.name:
            case 'hack_pause':
                self.flush_cache()
                self.token_to_kv_pool._clear_buffers()
                torch.cuda.empty_cache()
            case 'hack_resume':
                self.token_to_kv_pool._create_buffers()
            case _:
                raise NotImplementedError

        return None

Test code

import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    # model_path='Qwen/Qwen2.5-0.5B-Instruct',
    enable_torch_compile=True, 
    disable_cuda_graph=True,
)
print(llm)

prompts = [
    "1+1=",
]

sampling_params = {"temperature": 0}

print('llm.generate #1')
outputs = llm.generate(prompts, sampling_params)
print(outputs)

print('pause')
llm.get_weights_by_name('hack_pause')

print('sleep for seconds...')
time.sleep(3)

print('resume')
llm.get_weights_by_name('hack_resume')

print('llm.generate #2')
outputs = llm.generate(prompts, sampling_params)
print(outputs)

Result

GPU memory:

The red color is GPU memory occupied. As we can see, during the 3 second sleep, the GPU memory is much lower, thus we seem to successfully free the kv cache pool.

Logs:

llm.generate #1
[{'text': '2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n', 'meta_info': {'id': 'a76094bca39042658d59c2b74a15485c', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 5, 'completion_tokens': 128, 'cached_tokens': 0}}]
pause
hi hacky change get_weights_by_name!!! recv_req=GetWeightsByNameReqInput(name='hack_pause', truncate_size=100)
sleep for seconds...
resume
hi hacky change get_weights_by_name!!! recv_req=GetWeightsByNameReqInput(name='hack_resume', truncate_size=100)
llm.generate #2
[{'text': '2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n', 'meta_info': {'id': 'bbea82eaca044204b78d57255272b5be', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 5, 'completion_tokens': 128, 'cached_tokens': 0}}]

Caveats

Cannot use CUDA graph (maybe because CUDA graph requires tensors to be at exactly same memory location).
But can use torch.compile to speed up.

Not tested whether such "enable compile, disable CUDA graph" will cause speed slowdown.

fzyzcjy · 2024-12-24T06:26:41Z

Another proposal: #2569

zhaochenyang20 · 2024-12-24T18:53:58Z

@fzyzcjy Hey. Amazing experiments! Funny that get_weights_by_name is implemented by me, to check the correctness of the weight update.

api:

sglang/python/sglang/srt/server.py

Lines 214 to 239 in 21e9e63

    
           @app.post("/init_weights_update_group") 
        
           async def init_weights_update_group( 
        
               obj: InitWeightsUpdateGroupReqInput, request: Request 
        
           ): 
        
               """Initialize the parameter update group.""" 
        
               success, message = await tokenizer_manager.init_weights_update_group(obj, request) 
        
               content = {"success": success, "message": message} 
        
               if success: 
        
                   return ORJSONResponse(content, status_code=200) 
        
               else: 
        
                   return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST) 
        
           @app.post("/update_weights_from_distributed") 
        
           async def update_weights_from_distributed( 
        
               obj: UpdateWeightsFromDistributedReqInput, request: Request 
        
           ): 
        
               """Update model parameter from distributed online.""" 
        
               success, message = await tokenizer_manager.update_weights_from_distributed( 
        
                   obj, request 
        
               ) 
        
               content = {"success": success, "message": message} 
        
               if success: 
        
                   return ORJSONResponse(content, status_code=200) 
        
               else: 
        
                   return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)

test case / usage: https://github.com/sgl-project/sglang/blob/main/test/srt/test_update_weights_from_distributed.py

zhaochenyang20 · 2024-12-24T18:55:55Z

Also, regarding your proposed method, in real implementation, we should definitely have some functions like realse_gpu_occpation and resume_gpu_occupation, like what you hacked in get_weights_by_name. I think the current stage is pretty cool. If you'd like to PR, I am pretty willing to review!!!

@fzyzcjy

zhaochenyang20 · 2024-12-24T19:06:44Z

#2569

@fzyzcjy I also relied on this. And, do you have time for a Zoom/Tencent meeting recently? We can have a discussion at your convenience. I assume you are in China, so the morning and noon in China is appropriate for both of us.

fzyzcjy · 2024-12-24T23:33:13Z

Sure! We should have separate function names like that instead of hacking an unrelated function. I am happy to PR :)

zhaochenyang20 · 2024-12-25T05:24:33Z

@fzyzcjy We will PR this into OpenRLHF. Do you think trl is also needed? We (SGLang and lmsys.org) are willing to collaborate with HuggingFace also 😂

fzyzcjy · 2024-12-25T12:30:26Z

I guess it is up to your (SGLang and lmsys)'s choice. My personal thoughts is that, it seems many people are using TRL, so it may be great to PR to that.

Btw, #2569 is not only restricted to be useful to OpenRLHF and TRL, but also for Transformers, since the proposal is about a speedup to PreTrainedModel.generate using SGLang.

zhaochenyang20 · 2024-12-25T21:04:49Z

I guess it is up to your (SGLang and lmsys)'s choice. My personal thoughts is that, it seems many people are using TRL, so it may be great to PR to that.

Btw, #2569 is not only restricted to be useful to OpenRLHF and TRL, but also for Transformers, since the proposal is about a speedup to PreTrainedModel.generate using SGLang.

Pretty good! Let's discuss this later. @fzyzcjy

merrymercy · 2024-12-26T18:43:27Z

The major concern of this approach is that cuda graph will be disabled. torch.compile is only useful when using cuda graph - this is the current design in sglang. If you disable cuda graph, --enable-torch-compile won't do anything.

zhaochenyang20 · 2024-12-26T22:16:08Z

ray job submit --address="172.31.59.18:4567" \
   --runtime-env-json='{"working_dir": "/opt/dlami/nvme/chenyang/rlhf-ckpt"}' \
   -- python3 -m openrlhf.cli.train_ppo_ray \
   --ref_num_nodes 1 \
   --ref_num_gpus_per_node 1 \
   --reward_num_nodes 1 \
   --reward_num_gpus_per_node 1 \
   --critic_num_nodes 1 \
   --critic_num_gpus_per_node 1 \
   --actor_num_nodes 1 \
   --actor_num_gpus_per_node 1 \
   --vllm_num_engines 1 \
   --vllm_tensor_parallel_size 1 \
   --colocate_critic_reward \
   --colocate_actor_ref \
   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
   --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
   --save_path /opt/dlami/nvme/chenyang/rlhf-ckpt/examples/checkpoint/llama3-8b-rlhf \
   --save_steps 100 \
   --micro_train_batch_size 16 \
   --train_batch_size 128 \
   --micro_rollout_batch_size 32 \
   --rollout_batch_size 128 \
   --max_samples 512 \
   --max_epochs 1 \
   --prompt_max_len 1024 \
   --generate_max_len 1024 \
   --zero_stage 3 \
   --bf16 \
   --actor_learning_rate 5e-7 \
   --critic_learning_rate 9e-6 \
   --init_kl_coef 0.01 \
   --prompt_data OpenRLHF/prompt-collection-v0.1 \
   --input_key context_messages \
   --apply_chat_template \
   --packing_samples \
   --normalize_reward \
   --adam_offload \
   --flash_attn \
   --gradient_checkpointing

@fzyzcjy For your reference.

When sampling with vllm:

[5] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 37858 / 81559 MB | chenyang(21218M) chenyang(16530M)
[6] NVIDIA H100 80GB HBM3 | 32°C,   0 % | 47336 / 81559 MB | chenyang(31702M) chenyang(15528M)
[7] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 74861 / 81559 MB | chenyang(74844M)

When doing weights update:

[5] NVIDIA H100 80GB HBM3 | 30°C,   0 % | 57074 / 81559 MB | chenyang(40434M) chenyang(16530M)
[6] NVIDIA H100 80GB HBM3 | 33°C,   0 % | 45832 / 81559 MB | chenyang(30196M) chenyang(15530M)
[7] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 74861 / 81559 MB | chenyang(74844M)

fzyzcjy · 2024-12-26T23:29:23Z

The major concern of this approach is that cuda graph will be disabled.
torch.compile is only useful when using cuda graph - this is the current design in sglang

Ah I see. CUDA graph seems not supported because our re-created tensors have different addresses. Some naive thoughts:

How much slower if we just disable the CUDA graph?
- We save memory, but lost speed, thus maybe we are still in a Pareto frontier, or maybe still saving money. For example, suppose in the 8xH100 example in [Feature] Proposal: Releasing SGLang memory when idle #2583, we somehow enable to use only 4xH100, then as long as we are not 2x slower, we are still saving the overall money.
Is it possible we do torch.compile without using CUDA graph?
- For example, when using a HF model to do SFT training, it seems I always do torch.compile and never do CUDA graph capture but it works well.
Brainstorm: Is it possible we change the torch memory allocator as follows.
- Base assumption (please correct me if I am wrong!): As long as our tensors are at the same GPU address, CUDA graph will be happy.
- Let's create a custom torch memory allocator for both the HF trainer and SGLang.
- When SGLang generation phase is end
  - We never release the KV cache memory. Instead, we simply tell our allocator that "this part is OK to be reused temporarily by other people". And the allocator in other processes (e.g. HF trainer) can happily use those part of GPU memory as if it is never used.
- When HF training phase is end and we are again in generation phase
  - The memory allocator in HF trainer must free all tensors that overlaps with this physical memory region. This is possible, because the allocated activation/gradient tensor are just temporary. We may simply check and throw if not.
  - Now SGLang can directly continue running. The memory block that KV cache tensor points to may contain some garbage written by HF trainer in the last phase, but that's no problem, since we do not care about that.
- P.S. This looks quite general, and we will get many things for free. e.g. the proposal 2 in [Feature] Proposal: Releasing SGLang memory when idle #2583 is automatically done.

fzyzcjy · 2024-12-26T23:37:58Z

(The comment above is updated adding the "brainstorm" point just now)

fzyzcjy · 2024-12-26T23:39:00Z

Oh man I am saying almost the same thing as @merrymercy in #2588 (review)!

fzyzcjy · 2024-12-27T01:52:57Z

More brainstorms:

One way to implement memory allocator is to allocate a large continuous chunk of memory, and then get chunks from it. Thus we may need to hack torch's current Caching Allocator, or RMM's allocator, etc.
Another potential way may be to use cuMemCreate/cuMemMap/...
- More concretely, if the only requirement of CUDA graph is that "tensor GPU virtual memory address must be same" (have not done experiments to verify), then when we want to temporarily release the memory, maybe we can cuMemUnmap + cuMemRelease, then the virtual address is remained, while the physical memory is de-allocated. Now we have physical memory to do other allocations.
- P.S. Extra potential pros: For an unmapped block, hopefully (not tested yet) CUDA can throw error on it, then users/we get for free some error messages if we get something wrong. We also do not need to care about issues related to multiprocessing. Also it seems we do not need to touch the HF trainer part and it can have complex code having both long-lived tensors and temporary tensors.
Both ways may have a structure like: There are two sub-allocators, one for normal memory, another for "reuseable" memory. We can provide APIs like with alloc_mode('normal'): ... for users to control it, or automatically derive from user intents.

Just quick and naive brainstorms, I have not done experiments yet.

fzyzcjy · 2024-12-27T12:10:50Z

Find a little bit of time to quickly hacked an experiment for the last brainstorm.

Design

Use LD_PRELOAD to quickly hijack the cudaMalloc to do the cuMemCreate+cuMemMap.
Then, when we want to release memory occupation (hack_release_occupation in the code below), we simply cuMemUnmap+cuMemRelease the physical memory, while keeping the virtual memory address.
When we want to resume memory occupation (hack_resume_occupation below), we do cuMemCreate+cuMemMap to the existing virtual address.

This is very quick hacky experiment, since this releases all physical memory. In real design, surely we should track which exact allocations needs to be released and which not. For example, maybe we just write a custom (wrapper) allocator and allow users to write with alloc_mode('normal'): ... to control that.

Results

CUDA graph seems pretty happy about this (since our virtual address is unchanged even if the physical memory is temporarily released and re-allocated). There is no error.
Memory consumption (by netdata which calls nvidia-smi iirc) is like this (x is time, red is consumed memory, the low memory at the center is caused by the hack_release_occupation): Thus it seems we do released the physical memory.

Code

example.py

import ctypes
import time
from typing import Callable

import torch

# print('change_current_allocator')
# new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
#     str(Path(__file__).parent / 'my_alloc.so'), 'my_malloc', 'my_free')
# torch.cuda.memory.change_current_allocator(new_alloc)


# TODO ok?
cdll_my_preload_so = ctypes.CDLL('./my_preload.so')
print(f'{cdll_my_preload_so=}')


def _ptr(x):
    assert isinstance(x, torch.Tensor)
    return hex(x.data_ptr())


class KVCache:
    def __init__(self):
        self.create_buffers(1)

    def create_buffers(self, value):
        # or model weights, etc
        self.kv_buffer = torch.tensor([value, value, value], dtype=torch.int32, device='cuda')
        print(f'create_buffers {_ptr(self.kv_buffer)=}')

    def clear_buffers(self):
        del self.kv_buffer

    def execute(self, arg: torch.Tensor) -> torch.Tensor:
        # print(f'KVCache.execute {arg=} {self.kv_buffer=}')
        return arg + self.kv_buffer


# https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
def create_cuda_graph(fn: Callable):
    # warmup
    s = torch.cuda.Stream()
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        print('with torch.cuda.stream(s) execute fn')
        fn()
    torch.cuda.current_stream().wait_stream(s)

    # capture
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        print('with torch.cuda.graph(g) execute fn')
        fn()

    return g


def run():
    cache = KVCache()
    static_input = torch.zeros((3,), dtype=torch.int32, device='cuda')
    static_output = torch.zeros((3,), dtype=torch.int32, device='cuda')
    print(f'{_ptr(static_input)=} {_ptr(static_output)=}')

    def fn():
        nonlocal static_output
        static_output = cache.execute(static_input)

    g = create_cuda_graph(fn)

    print('replay #1')
    static_input[...] = 100
    g.replay()
    print(f'{static_output=}')
    assert torch.all(
        static_output == torch.tensor([101, 101, 101], dtype=torch.int32, device='cuda')), f'{static_output=}'

    # cache.clear_buffers()

    big_tensor = torch.zeros((2_000_000_000,), dtype=torch.uint8, device='cuda')
    print(f'{big_tensor=}')

    print('torch.cuda.empty_cache()')
    torch.cuda.empty_cache()

    print('sleep...')
    time.sleep(3)

    print('call hack_release_occupation')
    cdll_my_preload_so.hack_release_occupation()

    print('sleep...')
    time.sleep(3)

    # this should fail
    # print(f'{cache.kv_buffer=}')

    print('call hack_resume_occupation')
    cdll_my_preload_so.hack_resume_occupation()

    dummy = torch.zeros((3,), device='cuda')
    print(f'{_ptr(dummy)=}')

    # cache.create_buffers(2)

    cache.kv_buffer[...] = 2

    print('replay #2')
    static_input[...] = 200
    g.replay()
    print(f'{static_output=}')
    assert torch.all(
        static_output == torch.tensor([202, 202, 202], dtype=torch.int32, device='cuda')), f'{static_output=}'

    print('sleep...')
    time.sleep(3)

    print(f'{big_tensor=}')
    print(f'{dummy=}')


if __name__ == '__main__':
    run()

my_preload.cc

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <dlfcn.h>
#include <map>

// https://stackoverflow.com/questions/6083337/overriding-malloc-using-the-ld-preload-mechanism
static cudaError_t (*real_cudaMalloc)(void**, size_t) = NULL;
static cudaError_t (*real_cudaFree)(void*) = NULL;

struct MyInfo {
    CUmemGenericAllocationHandle allocHandle;
    size_t size;
};

int currentDev = 0; // HACK

std::map<void*, MyInfo> info_of_ptr_map;

static void my_init(void) {
    real_cudaMalloc = (cudaError_t (*)(void**, size_t)) dlsym(RTLD_NEXT, "cudaMalloc");
    if (NULL == real_cudaMalloc) {
        fprintf(stderr, "Error in `dlsym`: %s\n", dlerror());
    }

    real_cudaFree = (cudaError_t (*)(void*)) dlsym(RTLD_NEXT, "cudaFree");
    if (NULL == real_cudaFree) {
        fprintf(stderr, "Error in `dlsym`: %s\n", dlerror());
    }
}

void mem_create(CUmemGenericAllocationHandle *allocHandle, size_t size) {
//    size_t granularity = 0;
    CUmemAllocationProp prop = {};
    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = currentDev;
//    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
//    padded_size = ROUND_UP(size, granularity);
    cuMemCreate(allocHandle, size, &prop, 0);
}

void mem_set_access(void* devPtr, size_t size) {
    CUmemAccessDesc accessDesc = {};
    accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    accessDesc.location.id = currentDev;
    accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
    cuMemSetAccess((CUdeviceptr)devPtr, size, &accessDesc, 1);
}

cudaError_t cudaMalloc(void** devPtr, size_t size) {
    if (real_cudaMalloc == NULL) {
        my_init();
    }

//    cudaError_t ret = real_cudaMalloc(devPtr, size);
//    std::cout << "[my_preload.cc] cudaMalloc" << " devPtr=" << devPtr << " size=" << size << " ret=" << ret << std::endl;
//    return ret;

    CUmemGenericAllocationHandle allocHandle;
    mem_create(&allocHandle, size);

    /* Reserve a virtual address range */
    cuMemAddressReserve((CUdeviceptr*)devPtr, size, 0, 0, 0);
    /* Map the virtual address range
     * to the physical allocation */
    cuMemMap((CUdeviceptr)*devPtr, size, 0, allocHandle, 0);

    mem_set_access(*devPtr, size);

    info_of_ptr_map[*devPtr] = MyInfo { allocHandle, size };

    std::cout << "[my_preload.cc] cudaMalloc"
        << " devPtr=" << devPtr << " size=" << size
        << " allocHandle=" << allocHandle
        << std::endl;

    return cudaSuccess;
}

cudaError_t cudaFree(void* devPtr) {
    if (real_cudaFree == NULL) {
        my_init();
    }

//    cudaError_t ret = real_cudaFree(devPtr);
//    std::cout << "[my_preload.cc] cudaFree" << " devPtr=" << devPtr << " ret=" << ret << std::endl;
//    return ret;

    MyInfo info = info_of_ptr_map[devPtr];
    info_of_ptr_map.erase(devPtr);

    cuMemUnmap((CUdeviceptr)devPtr, info.size);
    cuMemRelease(info.allocHandle);
    cuMemAddressFree((CUdeviceptr)devPtr, info.size);

    std::cout << "[my_preload.cc] cudaFree"
        << " devPtr=" << devPtr << " info.size=" << info.size
        << " info.allocHandle=" << info.allocHandle
        << std::endl;

    return cudaSuccess;
}

extern "C" {
    void hack_release_occupation() {
        for (auto it = info_of_ptr_map.begin(); it != info_of_ptr_map.end(); ++it) {
            void* devPtr = it->first;
            MyInfo info = it->second;
            cuMemUnmap((CUdeviceptr)devPtr, info.size);
            cuMemRelease(info.allocHandle);

            std::cout << "[my_preload.cc] hack_release_occupation"
                << " devPtr=" << devPtr << " info.size=" << info.size << " info.allocHandle=" << info.allocHandle
                << std::endl;
        }
    }

    void hack_resume_occupation() {
        for (auto it = info_of_ptr_map.begin(); it != info_of_ptr_map.end(); ++it) {
            void* devPtr = it->first;
            MyInfo &info = it->second;

            CUmemGenericAllocationHandle newAllocHandle;
            mem_create(&newAllocHandle, info.size);

            cuMemMap((CUdeviceptr)devPtr, info.size, 0, newAllocHandle, 0);

            mem_set_access(devPtr, info.size);

            std::cout << "[my_preload.cc] hack_resume_occupation"
                << " devPtr=" << devPtr << " info.size=" << info.size << " (old)info.allocHandle=" << info.allocHandle
                << " (new)newAllocHandle=" << newAllocHandle
                << std::endl;

            info.allocHandle = newAllocHandle;
        }
    }
}

build command

g++ my_preload.cc -o my_preload.so -shared -fPIC -lcuda -I/usr/local/cuda/include

run command

LD_PRELOAD=./my_preload.so python3 example.py

zhaochenyang20 · 2024-12-28T06:07:03Z

@fzyzcjy Sorry for replying late as I have some things to do these days. If I do not reply in one day, please ping me again in WeChat. Have a good day

fzyzcjy · 2024-12-28T06:29:52Z

@zhaochenyang20 No worries! I used a bit more time to dig deeper into the approach above and it seems to work well for SGLang: KV Cache is released and CUDA graph is still enabled. I will do more experiments later to further check when having another bit of time.

Also, theoretically speaking, this approach should make the "release model weight" (again, CUDA graph can be used at the same time) very easy to implement (only two lines of code to add). But I will need to implement "update model weight from same card" feature to test that.

fzyzcjy · 2024-12-28T12:29:56Z

Find another bit of time to submit the PR: #2630.

fzyzcjy · 2024-12-28T23:49:53Z

Update: By adding this single line, the memory occupation when SGLang in paused mode further reduces from 4.5GB to 3.4GB for 1B demo model. Not checked how large it will be for other models, but it is almost free lunch since only 1 line of code.

(Debugging logics for model weights now)

fzyzcjy · 2024-12-29T00:45:19Z

Update: Model weight seems to successfully released. Now, when in paused mode, SGLang only takes 1.1GB memory for llama1B model.

fzyzcjy · 2024-12-29T00:47:15Z

Summary: When in paused mode,

KV cache is released: 22GB -> 4.5GB (expect it to be a lot for a 80GB card)
ReqToTokenPool.req_token is released: 4.5GB -> 3.4GB
Model weight is released: 3.4GB -> 1.1GB (expect it to be ~16GB for a 8B model)

And a (simple) correctness check is done: https://github.com/sgl-project/sglang/pull/2630/files#diff-4f475f1badc32fc2578207bded162aac1c915c5f9d28a8e281c1c7d20cb6dd87 - after release/resume, the generated content is correct.

(numbers above are on my 4090D card, but the idea is similar; I will do more experiments later)

fzyzcjy · 2024-12-29T00:57:10Z

Btw, some scenarios come to my mind that may be especially powerful when one uses SGLang: When the generation phase in RL is especially large. For example, do a lot of "rollout"s, or MCTS searches, at generation phase.

merrymercy · 2024-12-29T06:20:19Z

This is pretty cool! If possible, we can offload both weights and the KV cache.

fzyzcjy · 2024-12-29T08:13:38Z

@merrymercy Yes, that has already been done!

(EDIT: I mean just throwing away them instead of moving to CPU, since moving is slower.)

fzyzcjy · 2024-12-30T07:13:20Z

Quick update: On a standard 3xH100 setting (1xH100 for actor+ref, 1xH100 for critic+reward, 1xH100 for sglang), the memory of SGLang can be released when paused and becomes 2.5GB (instead of 72.2GB). Will check the 2xH100 (do not let SGLang take extra GPU) as well as comparison tests later.

Metrics for very short runs are as follows: (Again, correctness has not been verified)

zhaochenyang20 · 2024-12-31T22:52:21Z

Quick update: On a standard 3xH100 setting (1xH100 for actor+ref, 1xH100 for critic+reward, 1xH100 for sglang), the memory of SGLang can be released when paused and becomes 2.5GB (instead of 72.2GB). Will check the 2xH100 (do not let SGLang take extra GPU) as well as comparison tests later.

Metrics for very short runs are as follows: (Again, correctness has not been verified)

Great job. Ping me on slack is needed!

fzyzcjy changed the title ~~[Feature] Clear KV cache when not used~~ [Feature] Avoid KV cache occupying GPU memory when not used Dec 22, 2024

fzyzcjy changed the title ~~[Feature] Avoid KV cache occupying GPU memory when not used~~ [Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used Dec 22, 2024

zhaochenyang20 self-assigned this Dec 22, 2024

zhaochenyang20 added the feature label Dec 22, 2024

fzyzcjy mentioned this issue Dec 24, 2024

[Feature] (Willing to PR) Proposal: Drop-in fast replacement of PreTrainedModel.generate #2569

Open

2 tasks

zhaochenyang20 added high priority collaboration labels Dec 25, 2024

This was referenced Dec 26, 2024

[Feature] Proposal: Releasing SGLang memory when idle #2583

Open

Release KV cache memory occupation when not needed #2588

Closed

fzyzcjy mentioned this issue Dec 28, 2024

CUDA-graph-compatible releasing and resuming KV cache and model weight memory #2630

Open

3 tasks

fzyzcjy mentioned this issue Dec 28, 2024

Add update_weights_from_tensor #2631

Merged

3 tasks

PeterSH6 mentioned this issue Dec 30, 2024

[Feature]: Avoid KV Cache and offload Model weights in RL workloads vllm-project/vllm#11638

Open

1 task

This was referenced Jan 1, 2025

Will PRs to optimize single-GPU usage be accepted? OpenRLHF/OpenRLHF#646

Open

Support RLOO/GRPO/REINFORCE? volcengine/verl#68

Open

[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

Comments

fzyzcjy commented Dec 22, 2024 • edited Loading

Checklist

Motivation

Related resources

zhaochenyang20 commented Dec 22, 2024 • edited Loading

fzyzcjy commented Dec 22, 2024 • edited Loading

zhaochenyang20 commented Dec 23, 2024

fzyzcjy commented Dec 23, 2024

fzyzcjy commented Dec 23, 2024 • edited Loading

fzyzcjy commented Dec 23, 2024 • edited Loading

fzyzcjy commented Dec 23, 2024

zhaochenyang20 commented Dec 23, 2024

fzyzcjy commented Dec 24, 2024 • edited Loading

Code

Result

Caveats

fzyzcjy commented Dec 24, 2024

zhaochenyang20 commented Dec 24, 2024

zhaochenyang20 commented Dec 24, 2024

zhaochenyang20 commented Dec 24, 2024

fzyzcjy commented Dec 24, 2024

zhaochenyang20 commented Dec 25, 2024

fzyzcjy commented Dec 25, 2024

zhaochenyang20 commented Dec 25, 2024

merrymercy commented Dec 26, 2024

zhaochenyang20 commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024 • edited Loading

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 26, 2024

fzyzcjy commented Dec 27, 2024 • edited Loading

fzyzcjy commented Dec 27, 2024 • edited Loading

Design

Results

Code

zhaochenyang20 commented Dec 28, 2024

fzyzcjy commented Dec 28, 2024 • edited Loading

fzyzcjy commented Dec 28, 2024

fzyzcjy commented Dec 28, 2024 • edited Loading

fzyzcjy commented Dec 29, 2024

fzyzcjy commented Dec 29, 2024 • edited Loading

fzyzcjy commented Dec 29, 2024 • edited Loading

merrymercy commented Dec 29, 2024 • edited Loading

fzyzcjy commented Dec 29, 2024 • edited Loading

fzyzcjy commented Dec 30, 2024 • edited Loading

zhaochenyang20 commented Dec 31, 2024

fzyzcjy commented Dec 22, 2024 •

edited

Loading

zhaochenyang20 commented Dec 22, 2024 •

edited

Loading

fzyzcjy commented Dec 22, 2024 •

edited

Loading

fzyzcjy commented Dec 23, 2024 •

edited

Loading

fzyzcjy commented Dec 23, 2024 •

edited

Loading

fzyzcjy commented Dec 24, 2024 •

edited

Loading

fzyzcjy commented Dec 26, 2024 •

edited

Loading

fzyzcjy commented Dec 27, 2024 •

edited

Loading

fzyzcjy commented Dec 27, 2024 •

edited

Loading

fzyzcjy commented Dec 28, 2024 •

edited

Loading

fzyzcjy commented Dec 28, 2024 •

edited

Loading

fzyzcjy commented Dec 29, 2024 •

edited

Loading

fzyzcjy commented Dec 29, 2024 •

edited

Loading

merrymercy commented Dec 29, 2024 •

edited

Loading

fzyzcjy commented Dec 29, 2024 •

edited

Loading

fzyzcjy commented Dec 30, 2024 •

edited

Loading