Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used #2542

Open
2 tasks done
fzyzcjy opened this issue Dec 22, 2024 · 35 comments
Open
2 tasks done

Comments

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 22, 2024

Checklist

Motivation

Hi thank you for the library! The use case is that, when doing online PPO, I hope to use SGLang to generate llm completions, and then use RL to do gradient descent on those completions.

The problem is, to do this on a single GPU, the timeline is "SGLang generate - Torch backward - repeat it". Thus, when torch doing backprop, I hope SGLang can free its KV cache memory consumption, otherwise torch will not have enough memory.

Thanks for any suggestions!

Related resources

No response

@fzyzcjy fzyzcjy changed the title [Feature] Clear KV cache when not used [Feature] Avoid KV cache occupying GPU memory when not used Dec 22, 2024
@fzyzcjy fzyzcjy changed the title [Feature] Avoid KV cache occupying GPU memory when not used [Feature] (Willing to PR) Avoid KV cache occupying GPU memory when not used Dec 22, 2024
@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Dec 22, 2024

@fzyzcjy

Thanks for pointing this out. I am working on similar things on OpenRLHF.

https://github.com/OpenRLHF/OpenRLHF

Off-loading KV Cache (actually you should shut down the engine and runtime indeed) has a clear trade-off:

  • Pros:

It saves a lot of VRAM.

Cons:

  • Relaunching the engine takes a lot of time. This time comes from two parts:
  1. Saving the updated weights to disk.
  2. Relaunching the engine.

In the current design of OpenRLHF, they choose not to offload to save time. In this case, the weights can be directly broadcast from the training engine to the inference engine. As shown in:

api:

@app.post("/init_weights_update_group")
async def init_weights_update_group(
obj: InitWeightsUpdateGroupReqInput, request: Request
):
"""Initialize the parameter update group."""
success, message = await tokenizer_manager.init_weights_update_group(obj, request)
content = {"success": success, "message": message}
if success:
return ORJSONResponse(content, status_code=200)
else:
return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)
@app.post("/update_weights_from_distributed")
async def update_weights_from_distributed(
obj: UpdateWeightsFromDistributedReqInput, request: Request
):
"""Update model parameter from distributed online."""
success, message = await tokenizer_manager.update_weights_from_distributed(
obj, request
)
content = {"success": success, "message": message}
if success:
return ORJSONResponse(content, status_code=200)
else:
return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)

test case / usage: https://github.com/sgl-project/sglang/blob/main/test/srt/test_update_weights_from_distributed.py

If you feel interested, welcome to continue discussing this with me on github or on our SGLang slack.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 22, 2024

@zhaochenyang20 Hi thank you so much! My current thought is that, maybe we do not shutdown the engine at all, thus we do not need to slowly restart engine. Instead, some naive thoughts:

  1. We somehow avoid the kv cache from occupying GPU memory. For example, can we del the_kv_cache_torch_tensors and re-create them later? This can be done on the cache part, and all other parts in sglang do not need to know about this. Because when there are zero requests to sglang engine, all other parts should be sleeping and do not touch the kv cache, thus they will never know the cache tensors are removed and later re-created.

2a. As for model weights, is it even possible that we have only exactly one copy of model memory, shared by both SGLang and Transformers model (used by TRL / OpenRLHF to do weight updates)?

2b. Or, for model weights, can we do the same thing as the proposal in "1.", i.e. temporarily delete the tensors but remain all other parts, and later re-create the tensors from OpenRLHF's new weighhts?

@zhaochenyang20
Copy link
Collaborator

In my experience, I do not think 2 is possible. But maybe 1 is okay. Let me discuss this with my teammates. @fzyzcjy

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 23, 2024

@zhaochenyang20 Thank you! I am happy to PR and try to hack it as well.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 23, 2024

Btw I see the PR series OpenRLHF/OpenRLHF#614 and it looks great :) I am mostly interested in PPO/REINFORCE training with fast inference engine.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 23, 2024

@zhaochenyang20 Quickly glanced at the code, it seems calling

def _clear_buffers(self):
when we want to free memory (and call _create_buffers again when we want to use sglang later) may be related.

I will do more experiments later :)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 23, 2024

Quick experiments about devices:

model = transformers.AutoModelForCausalLM.from_pretrained('...')
print(model.device, model) # cpu

model.to('cuda')
print(model.device, model) # gpu, and see GPU memory become high

model.to('meta')
torch.cuda.empty_cache()
print(model.device, model) # meta, and see GPU memory usage become low again

model.to_empty(device='cuda')
print(model.device, model) # gpu, and GPU memory high again

Thus for point "2b", maybe we do not need to really delete the tensors, but only need to do a llama_model.to('meta') and later llama_model.to_empty(device='cuda') + llama_model.update_weights().

For point "1", doing this instead of deleing the whole tensor may also be an alternative way.

@zhaochenyang20
Copy link
Collaborator

@fzyzcjy Hey there. I connect you through WeChat. We can have a quick discussion.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 24, 2024

I made a quick hack and it seems to work.

Code

Note: I hack the function get_weights_by_name to allow me to call between processes for this quick experiment. In a real PR, surely we should create some new functions. Also, in real PR, we should make _clear_buffers public, etc.

Change scheduler.py

    def get_weights_by_name(self, recv_req: GetWeightsByNameReqInput):
        print(f'hi hacky change get_weights_by_name!!! {recv_req=}')
        match recv_req.name:
            case 'hack_pause':
                self.flush_cache()
                self.token_to_kv_pool._clear_buffers()
                torch.cuda.empty_cache()
            case 'hack_resume':
                self.token_to_kv_pool._create_buffers()
            case _:
                raise NotImplementedError

        return None

Test code

import sglang as sgl

llm = sgl.Engine(
    model_path="meta-llama/Llama-3.2-1B-Instruct",
    # model_path='Qwen/Qwen2.5-0.5B-Instruct',
    enable_torch_compile=True, 
    disable_cuda_graph=True,
)
print(llm)

prompts = [
    "1+1=",
]

sampling_params = {"temperature": 0}

print('llm.generate #1')
outputs = llm.generate(prompts, sampling_params)
print(outputs)

print('pause')
llm.get_weights_by_name('hack_pause')

print('sleep for seconds...')
time.sleep(3)

print('resume')
llm.get_weights_by_name('hack_resume')

print('llm.generate #2')
outputs = llm.generate(prompts, sampling_params)
print(outputs)

Result

GPU memory:

The red color is GPU memory occupied. As we can see, during the 3 second sleep, the GPU memory is much lower, thus we seem to successfully free the kv cache pool.

image

Logs:

llm.generate #1
[{'text': '2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n', 'meta_info': {'id': 'a76094bca39042658d59c2b74a15485c', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 5, 'completion_tokens': 128, 'cached_tokens': 0}}]
pause
hi hacky change get_weights_by_name!!! recv_req=GetWeightsByNameReqInput(name='hack_pause', truncate_size=100)
sleep for seconds...
resume
hi hacky change get_weights_by_name!!! recv_req=GetWeightsByNameReqInput(name='hack_resume', truncate_size=100)
llm.generate #2
[{'text': '2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n1+1=2\n', 'meta_info': {'id': 'bbea82eaca044204b78d57255272b5be', 'finish_reason': {'type': 'length', 'length': 128}, 'prompt_tokens': 5, 'completion_tokens': 128, 'cached_tokens': 0}}]

Caveats

  1. Cannot use CUDA graph (maybe because CUDA graph requires tensors to be at exactly same memory location).
  2. But can use torch.compile to speed up.

Not tested whether such "enable compile, disable CUDA graph" will cause speed slowdown.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 24, 2024

Another proposal: #2569

@zhaochenyang20
Copy link
Collaborator

@fzyzcjy Hey. Amazing experiments! Funny that get_weights_by_name is implemented by me, to check the correctness of the weight update.

api:

@app.post("/init_weights_update_group")
async def init_weights_update_group(
obj: InitWeightsUpdateGroupReqInput, request: Request
):
"""Initialize the parameter update group."""
success, message = await tokenizer_manager.init_weights_update_group(obj, request)
content = {"success": success, "message": message}
if success:
return ORJSONResponse(content, status_code=200)
else:
return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)
@app.post("/update_weights_from_distributed")
async def update_weights_from_distributed(
obj: UpdateWeightsFromDistributedReqInput, request: Request
):
"""Update model parameter from distributed online."""
success, message = await tokenizer_manager.update_weights_from_distributed(
obj, request
)
content = {"success": success, "message": message}
if success:
return ORJSONResponse(content, status_code=200)
else:
return ORJSONResponse(content, status_code=HTTPStatus.BAD_REQUEST)

test case / usage: https://github.com/sgl-project/sglang/blob/main/test/srt/test_update_weights_from_distributed.py

@zhaochenyang20
Copy link
Collaborator

Also, regarding your proposed method, in real implementation, we should definitely have some functions like realse_gpu_occpation and resume_gpu_occupation, like what you hacked in get_weights_by_name. I think the current stage is pretty cool. If you'd like to PR, I am pretty willing to review!!!

@fzyzcjy

@zhaochenyang20
Copy link
Collaborator

#2569

@fzyzcjy I also relied on this. And, do you have time for a Zoom/Tencent meeting recently? We can have a discussion at your convenience. I assume you are in China, so the morning and noon in China is appropriate for both of us.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 24, 2024

Sure! We should have separate function names like that instead of hacking an unrelated function. I am happy to PR :)

@zhaochenyang20
Copy link
Collaborator

@fzyzcjy We will PR this into OpenRLHF. Do you think trl is also needed? We (SGLang and lmsys.org) are willing to collaborate with HuggingFace also 😂

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 25, 2024

I guess it is up to your (SGLang and lmsys)'s choice. My personal thoughts is that, it seems many people are using TRL, so it may be great to PR to that.

Btw, #2569 is not only restricted to be useful to OpenRLHF and TRL, but also for Transformers, since the proposal is about a speedup to PreTrainedModel.generate using SGLang.

@zhaochenyang20
Copy link
Collaborator

I guess it is up to your (SGLang and lmsys)'s choice. My personal thoughts is that, it seems many people are using TRL, so it may be great to PR to that.

Btw, #2569 is not only restricted to be useful to OpenRLHF and TRL, but also for Transformers, since the proposal is about a speedup to PreTrainedModel.generate using SGLang.

Pretty good! Let's discuss this later. @fzyzcjy

@merrymercy
Copy link
Contributor

The major concern of this approach is that cuda graph will be disabled. torch.compile is only useful when using cuda graph - this is the current design in sglang. If you disable cuda graph, --enable-torch-compile won't do anything.

@zhaochenyang20
Copy link
Collaborator

ray job submit --address="172.31.59.18:4567" \
   --runtime-env-json='{"working_dir": "/opt/dlami/nvme/chenyang/rlhf-ckpt"}' \
   -- python3 -m openrlhf.cli.train_ppo_ray \
   --ref_num_nodes 1 \
   --ref_num_gpus_per_node 1 \
   --reward_num_nodes 1 \
   --reward_num_gpus_per_node 1 \
   --critic_num_nodes 1 \
   --critic_num_gpus_per_node 1 \
   --actor_num_nodes 1 \
   --actor_num_gpus_per_node 1 \
   --vllm_num_engines 1 \
   --vllm_tensor_parallel_size 1 \
   --colocate_critic_reward \
   --colocate_actor_ref \
   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
   --reward_pretrain OpenRLHF/Llama-3-8b-rm-mixture \
   --save_path /opt/dlami/nvme/chenyang/rlhf-ckpt/examples/checkpoint/llama3-8b-rlhf \
   --save_steps 100 \
   --micro_train_batch_size 16 \
   --train_batch_size 128 \
   --micro_rollout_batch_size 32 \
   --rollout_batch_size 128 \
   --max_samples 512 \
   --max_epochs 1 \
   --prompt_max_len 1024 \
   --generate_max_len 1024 \
   --zero_stage 3 \
   --bf16 \
   --actor_learning_rate 5e-7 \
   --critic_learning_rate 9e-6 \
   --init_kl_coef 0.01 \
   --prompt_data OpenRLHF/prompt-collection-v0.1 \
   --input_key context_messages \
   --apply_chat_template \
   --packing_samples \
   --normalize_reward \
   --adam_offload \
   --flash_attn \
   --gradient_checkpointing

@fzyzcjy For your reference.

When sampling with vllm:

[5] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 37858 / 81559 MB | chenyang(21218M) chenyang(16530M)
[6] NVIDIA H100 80GB HBM3 | 32°C,   0 % | 47336 / 81559 MB | chenyang(31702M) chenyang(15528M)
[7] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 74861 / 81559 MB | chenyang(74844M)

When doing weights update:

[5] NVIDIA H100 80GB HBM3 | 30°C,   0 % | 57074 / 81559 MB | chenyang(40434M) chenyang(16530M)
[6] NVIDIA H100 80GB HBM3 | 33°C,   0 % | 45832 / 81559 MB | chenyang(30196M) chenyang(15530M)
[7] NVIDIA H100 80GB HBM3 | 29°C,   0 % | 74861 / 81559 MB | chenyang(74844M)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

The major concern of this approach is that cuda graph will be disabled.
torch.compile is only useful when using cuda graph - this is the current design in sglang

Ah I see. CUDA graph seems not supported because our re-created tensors have different addresses. Some naive thoughts:

  • How much slower if we just disable the CUDA graph?
    • We save memory, but lost speed, thus maybe we are still in a Pareto frontier, or maybe still saving money. For example, suppose in the 8xH100 example in [Feature] Proposal: Releasing SGLang memory when idle #2583, we somehow enable to use only 4xH100, then as long as we are not 2x slower, we are still saving the overall money.
  • Is it possible we do torch.compile without using CUDA graph?
    • For example, when using a HF model to do SFT training, it seems I always do torch.compile and never do CUDA graph capture but it works well.
  • Brainstorm: Is it possible we change the torch memory allocator as follows.
    • Base assumption (please correct me if I am wrong!): As long as our tensors are at the same GPU address, CUDA graph will be happy.
    • Let's create a custom torch memory allocator for both the HF trainer and SGLang.
    • When SGLang generation phase is end
      • We never release the KV cache memory. Instead, we simply tell our allocator that "this part is OK to be reused temporarily by other people". And the allocator in other processes (e.g. HF trainer) can happily use those part of GPU memory as if it is never used.
    • When HF training phase is end and we are again in generation phase
      • The memory allocator in HF trainer must free all tensors that overlaps with this physical memory region. This is possible, because the allocated activation/gradient tensor are just temporary. We may simply check and throw if not.
      • Now SGLang can directly continue running. The memory block that KV cache tensor points to may contain some garbage written by HF trainer in the last phase, but that's no problem, since we do not care about that.
    • P.S. This looks quite general, and we will get many things for free. e.g. the proposal 2 in [Feature] Proposal: Releasing SGLang memory when idle #2583 is automatically done.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

(The comment above is updated adding the "brainstorm" point just now)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 26, 2024

Oh man I am saying almost the same thing as @merrymercy in #2588 (review)!

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 27, 2024

More brainstorms:

  • One way to implement memory allocator is to allocate a large continuous chunk of memory, and then get chunks from it. Thus we may need to hack torch's current Caching Allocator, or RMM's allocator, etc.
  • Another potential way may be to use cuMemCreate/cuMemMap/...
    • More concretely, if the only requirement of CUDA graph is that "tensor GPU virtual memory address must be same" (have not done experiments to verify), then when we want to temporarily release the memory, maybe we can cuMemUnmap + cuMemRelease, then the virtual address is remained, while the physical memory is de-allocated. Now we have physical memory to do other allocations.
    • P.S. Extra potential pros: For an unmapped block, hopefully (not tested yet) CUDA can throw error on it, then users/we get for free some error messages if we get something wrong. We also do not need to care about issues related to multiprocessing. Also it seems we do not need to touch the HF trainer part and it can have complex code having both long-lived tensors and temporary tensors.
  • Both ways may have a structure like: There are two sub-allocators, one for normal memory, another for "reuseable" memory. We can provide APIs like with alloc_mode('normal'): ... for users to control it, or automatically derive from user intents.

Just quick and naive brainstorms, I have not done experiments yet.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 27, 2024

Find a little bit of time to quickly hacked an experiment for the last brainstorm.

Design

  • Use LD_PRELOAD to quickly hijack the cudaMalloc to do the cuMemCreate+cuMemMap.
  • Then, when we want to release memory occupation (hack_release_occupation in the code below), we simply cuMemUnmap+cuMemRelease the physical memory, while keeping the virtual memory address.
  • When we want to resume memory occupation (hack_resume_occupation below), we do cuMemCreate+cuMemMap to the existing virtual address.

This is very quick hacky experiment, since this releases all physical memory. In real design, surely we should track which exact allocations needs to be released and which not. For example, maybe we just write a custom (wrapper) allocator and allow users to write with alloc_mode('normal'): ... to control that.

Results

  • CUDA graph seems pretty happy about this (since our virtual address is unchanged even if the physical memory is temporarily released and re-allocated). There is no error.
  • Memory consumption (by netdata which calls nvidia-smi iirc) is like this (x is time, red is consumed memory, the low memory at the center is caused by the hack_release_occupation): image Thus it seems we do released the physical memory.

Code

example.py

import ctypes
import time
from typing import Callable

import torch

# print('change_current_allocator')
# new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
#     str(Path(__file__).parent / 'my_alloc.so'), 'my_malloc', 'my_free')
# torch.cuda.memory.change_current_allocator(new_alloc)


# TODO ok?
cdll_my_preload_so = ctypes.CDLL('./my_preload.so')
print(f'{cdll_my_preload_so=}')


def _ptr(x):
    assert isinstance(x, torch.Tensor)
    return hex(x.data_ptr())


class KVCache:
    def __init__(self):
        self.create_buffers(1)

    def create_buffers(self, value):
        # or model weights, etc
        self.kv_buffer = torch.tensor([value, value, value], dtype=torch.int32, device='cuda')
        print(f'create_buffers {_ptr(self.kv_buffer)=}')

    def clear_buffers(self):
        del self.kv_buffer

    def execute(self, arg: torch.Tensor) -> torch.Tensor:
        # print(f'KVCache.execute {arg=} {self.kv_buffer=}')
        return arg + self.kv_buffer


# https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/
def create_cuda_graph(fn: Callable):
    # warmup
    s = torch.cuda.Stream()
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        print('with torch.cuda.stream(s) execute fn')
        fn()
    torch.cuda.current_stream().wait_stream(s)

    # capture
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        print('with torch.cuda.graph(g) execute fn')
        fn()

    return g


def run():
    cache = KVCache()
    static_input = torch.zeros((3,), dtype=torch.int32, device='cuda')
    static_output = torch.zeros((3,), dtype=torch.int32, device='cuda')
    print(f'{_ptr(static_input)=} {_ptr(static_output)=}')

    def fn():
        nonlocal static_output
        static_output = cache.execute(static_input)

    g = create_cuda_graph(fn)

    print('replay #1')
    static_input[...] = 100
    g.replay()
    print(f'{static_output=}')
    assert torch.all(
        static_output == torch.tensor([101, 101, 101], dtype=torch.int32, device='cuda')), f'{static_output=}'

    # cache.clear_buffers()

    big_tensor = torch.zeros((2_000_000_000,), dtype=torch.uint8, device='cuda')
    print(f'{big_tensor=}')

    print('torch.cuda.empty_cache()')
    torch.cuda.empty_cache()

    print('sleep...')
    time.sleep(3)

    print('call hack_release_occupation')
    cdll_my_preload_so.hack_release_occupation()

    print('sleep...')
    time.sleep(3)

    # this should fail
    # print(f'{cache.kv_buffer=}')

    print('call hack_resume_occupation')
    cdll_my_preload_so.hack_resume_occupation()

    dummy = torch.zeros((3,), device='cuda')
    print(f'{_ptr(dummy)=}')

    # cache.create_buffers(2)

    cache.kv_buffer[...] = 2

    print('replay #2')
    static_input[...] = 200
    g.replay()
    print(f'{static_output=}')
    assert torch.all(
        static_output == torch.tensor([202, 202, 202], dtype=torch.int32, device='cuda')), f'{static_output=}'

    print('sleep...')
    time.sleep(3)

    print(f'{big_tensor=}')
    print(f'{dummy=}')


if __name__ == '__main__':
    run()

my_preload.cc

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <dlfcn.h>
#include <map>

// https://stackoverflow.com/questions/6083337/overriding-malloc-using-the-ld-preload-mechanism
static cudaError_t (*real_cudaMalloc)(void**, size_t) = NULL;
static cudaError_t (*real_cudaFree)(void*) = NULL;

struct MyInfo {
    CUmemGenericAllocationHandle allocHandle;
    size_t size;
};

int currentDev = 0; // HACK

std::map<void*, MyInfo> info_of_ptr_map;

static void my_init(void) {
    real_cudaMalloc = (cudaError_t (*)(void**, size_t)) dlsym(RTLD_NEXT, "cudaMalloc");
    if (NULL == real_cudaMalloc) {
        fprintf(stderr, "Error in `dlsym`: %s\n", dlerror());
    }

    real_cudaFree = (cudaError_t (*)(void*)) dlsym(RTLD_NEXT, "cudaFree");
    if (NULL == real_cudaFree) {
        fprintf(stderr, "Error in `dlsym`: %s\n", dlerror());
    }
}

void mem_create(CUmemGenericAllocationHandle *allocHandle, size_t size) {
//    size_t granularity = 0;
    CUmemAllocationProp prop = {};
    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = currentDev;
//    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
//    padded_size = ROUND_UP(size, granularity);
    cuMemCreate(allocHandle, size, &prop, 0);
}

void mem_set_access(void* devPtr, size_t size) {
    CUmemAccessDesc accessDesc = {};
    accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    accessDesc.location.id = currentDev;
    accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
    cuMemSetAccess((CUdeviceptr)devPtr, size, &accessDesc, 1);
}

cudaError_t cudaMalloc(void** devPtr, size_t size) {
    if (real_cudaMalloc == NULL) {
        my_init();
    }

//    cudaError_t ret = real_cudaMalloc(devPtr, size);
//    std::cout << "[my_preload.cc] cudaMalloc" << " devPtr=" << devPtr << " size=" << size << " ret=" << ret << std::endl;
//    return ret;

    CUmemGenericAllocationHandle allocHandle;
    mem_create(&allocHandle, size);

    /* Reserve a virtual address range */
    cuMemAddressReserve((CUdeviceptr*)devPtr, size, 0, 0, 0);
    /* Map the virtual address range
     * to the physical allocation */
    cuMemMap((CUdeviceptr)*devPtr, size, 0, allocHandle, 0);

    mem_set_access(*devPtr, size);

    info_of_ptr_map[*devPtr] = MyInfo { allocHandle, size };

    std::cout << "[my_preload.cc] cudaMalloc"
        << " devPtr=" << devPtr << " size=" << size
        << " allocHandle=" << allocHandle
        << std::endl;

    return cudaSuccess;
}

cudaError_t cudaFree(void* devPtr) {
    if (real_cudaFree == NULL) {
        my_init();
    }

//    cudaError_t ret = real_cudaFree(devPtr);
//    std::cout << "[my_preload.cc] cudaFree" << " devPtr=" << devPtr << " ret=" << ret << std::endl;
//    return ret;

    MyInfo info = info_of_ptr_map[devPtr];
    info_of_ptr_map.erase(devPtr);

    cuMemUnmap((CUdeviceptr)devPtr, info.size);
    cuMemRelease(info.allocHandle);
    cuMemAddressFree((CUdeviceptr)devPtr, info.size);

    std::cout << "[my_preload.cc] cudaFree"
        << " devPtr=" << devPtr << " info.size=" << info.size
        << " info.allocHandle=" << info.allocHandle
        << std::endl;

    return cudaSuccess;
}

extern "C" {
    void hack_release_occupation() {
        for (auto it = info_of_ptr_map.begin(); it != info_of_ptr_map.end(); ++it) {
            void* devPtr = it->first;
            MyInfo info = it->second;
            cuMemUnmap((CUdeviceptr)devPtr, info.size);
            cuMemRelease(info.allocHandle);

            std::cout << "[my_preload.cc] hack_release_occupation"
                << " devPtr=" << devPtr << " info.size=" << info.size << " info.allocHandle=" << info.allocHandle
                << std::endl;
        }
    }

    void hack_resume_occupation() {
        for (auto it = info_of_ptr_map.begin(); it != info_of_ptr_map.end(); ++it) {
            void* devPtr = it->first;
            MyInfo &info = it->second;

            CUmemGenericAllocationHandle newAllocHandle;
            mem_create(&newAllocHandle, info.size);

            cuMemMap((CUdeviceptr)devPtr, info.size, 0, newAllocHandle, 0);

            mem_set_access(devPtr, info.size);

            std::cout << "[my_preload.cc] hack_resume_occupation"
                << " devPtr=" << devPtr << " info.size=" << info.size << " (old)info.allocHandle=" << info.allocHandle
                << " (new)newAllocHandle=" << newAllocHandle
                << std::endl;

            info.allocHandle = newAllocHandle;
        }
    }
}

build command

g++ my_preload.cc -o my_preload.so -shared -fPIC -lcuda -I/usr/local/cuda/include

run command

LD_PRELOAD=./my_preload.so python3 example.py

@zhaochenyang20
Copy link
Collaborator

@fzyzcjy Sorry for replying late as I have some things to do these days. If I do not reply in one day, please ping me again in WeChat. Have a good day

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 28, 2024

@zhaochenyang20 No worries! I used a bit more time to dig deeper into the approach above and it seems to work well for SGLang: KV Cache is released and CUDA graph is still enabled. I will do more experiments later to further check when having another bit of time.

Also, theoretically speaking, this approach should make the "release model weight" (again, CUDA graph can be used at the same time) very easy to implement (only two lines of code to add). But I will need to implement "update model weight from same card" feature to test that.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 28, 2024

Find another bit of time to submit the PR: #2630.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 28, 2024

Update: By adding this single line, the memory occupation when SGLang in paused mode further reduces from 4.5GB to 3.4GB for 1B demo model. Not checked how large it will be for other models, but it is almost free lunch since only 1 line of code.

image

(Debugging logics for model weights now)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 29, 2024

Update: Model weight seems to successfully released. Now, when in paused mode, SGLang only takes 1.1GB memory for llama1B model.

image

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 29, 2024

Summary: When in paused mode,

  • KV cache is released: 22GB -> 4.5GB (expect it to be a lot for a 80GB card)
  • ReqToTokenPool.req_token is released: 4.5GB -> 3.4GB
  • Model weight is released: 3.4GB -> 1.1GB (expect it to be ~16GB for a 8B model)

And a (simple) correctness check is done: https://github.com/sgl-project/sglang/pull/2630/files#diff-4f475f1badc32fc2578207bded162aac1c915c5f9d28a8e281c1c7d20cb6dd87 - after release/resume, the generated content is correct.

(numbers above are on my 4090D card, but the idea is similar; I will do more experiments later)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 29, 2024

Btw, some scenarios come to my mind that may be especially powerful when one uses SGLang: When the generation phase in RL is especially large. For example, do a lot of "rollout"s, or MCTS searches, at generation phase.

@merrymercy
Copy link
Contributor

merrymercy commented Dec 29, 2024

This is pretty cool! If possible, we can offload both weights and the KV cache.

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 29, 2024

@merrymercy Yes, that has already been done!

(EDIT: I mean just throwing away them instead of moving to CPU, since moving is slower.)

@fzyzcjy
Copy link
Contributor Author

fzyzcjy commented Dec 30, 2024

Quick update: On a standard 3xH100 setting (1xH100 for actor+ref, 1xH100 for critic+reward, 1xH100 for sglang), the memory of SGLang can be released when paused and becomes 2.5GB (instead of 72.2GB). Will check the 2xH100 (do not let SGLang take extra GPU) as well as comparison tests later.

image

Metrics for very short runs are as follows: (Again, correctness has not been verified)

image

@zhaochenyang20
Copy link
Collaborator

Quick update: On a standard 3xH100 setting (1xH100 for actor+ref, 1xH100 for critic+reward, 1xH100 for sglang), the memory of SGLang can be released when paused and becomes 2.5GB (instead of 72.2GB). Will check the 2xH100 (do not let SGLang take extra GPU) as well as comparison tests later.

image

Metrics for very short runs are as follows: (Again, correctness has not been verified)

image

Great job. Ping me on slack is needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants