Bounded peak memory in Top-P-Top-K with chunked sorting #11544

yangalan123 · 2024-12-27T02:45:17Z

This is the PR collaborated with @youkaichao for implementing chunked sorting to reduce peak memory (try to solve the OOM issue in computing logits in Issue #5907).

The issue we want to address here is that Pytorch sorting on logits would incur a significant peak memory usage, which turns out to be a major memory bottleneck during the decoding time, especially when users request to obtain logits. Our approach is that instead of sorting the large logits as a whole, we first split them into chunks and do sorting on each smaller chunk. In that way, intermediate variables that created during sorting will all become much smaller and the memory can be recycled timely.

Effectiveness Proof:
We conducted a simple verification: On a standard 4 A40 GPU server, we ran vllm serve meta-llama/Meta-Llama-3-8B --tensor_parallel_size 4 and monitored the memory usage. We have successfully managed to reduce the peak memory usage from 5.0 GB to 4.5 GB on Rank 0 (since sampling and sorting only happened at Rank 0) by setting chunk_size=64 (1/4 of max_num_seqs). Considering roughly 3.7 GB non-deductible model weights (plus some minor usage by NCCL and CUDA graphs), and the relatively small code edits, we do see this peak memory reduction as a promising direction (i.e., reducing 50% peak memory usage) to work on.

FIX #5907 (link existing issues this PR will resolve)

…ecoding to reduce peak memory (try to solve OOM issue in computing logits in Issue vllm-project#5907) Signed-off-by: Chenghao Yang <[email protected]>

github-actions · 2024-12-27T02:45:28Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-12-27T05:32:39Z

@yangalan123 thanks for the investigation! I changed the chunk_size budget to always be max-num-seqs , this keeps the current behavior, and only fixes the OOM issue when people ask for prompt logprobs.

we can investigate in the future, if people ever want to directly control the chunk_size . if that is the case, we can expose a flag to users.

cc @robertgshaw2-neuralmagic I think it can solve most of issues linked in #5907 .

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2024-12-27T12:27:30Z

for record, the memory cost of torch.sort seems to be a known issue , see pytorch/pytorch#77049 (comment)

robertgshaw2-redhat · 2024-12-28T14:54:21Z

Nice work, will review on my flight today

robertgshaw2-redhat · 2024-12-29T15:09:36Z

vllm/model_executor/layers/sampler.py

@@ -186,6 +186,12 @@ def __init__(self):
        # speculative decoding.
        self.include_gpu_probs_tensor = False
        self.should_modify_greedy_probs_inplace = False
+        from vllm.config import get_current_vllm_config


I think that we should pipe this parameter down though the constructor rather than using a global config

robertgshaw2-redhat · 2024-12-29T15:11:42Z

vllm/model_executor/layers/sampler.py

+                for logits_chunk in logits_chunks:
+                    current_chunk_size = logits_chunk.shape[0]
+                    end_idx = start_idx + current_chunk_size
+                    output_logits[start_idx:end_idx].copy_(


QQ - why .copy_ rather than just setting the value?

robertgshaw2-redhat · 2024-12-29T15:24:30Z

vllm/model_executor/layers/sampler.py

+        end_idx = start_idx + chunk_size
+        cmp = x[start_idx:end_idx] > vals[start_idx:end_idx, None]
+        # cmp.sum(dim=1, dtype=torch.int32) is the peak memory usage.
+        ranks = cmp.sum(dim=1, dtype=torch.int32).add_(1)


Can you add a comment that explain the logic?

Specifically, something that says:

select all tokens with logprobs > that the selected indicies with booleans

sum over the booleans gets the count of Trues

add 1

robertgshaw2-redhat · 2024-12-29T15:26:42Z

tests/entrypoints/llm/test_prompt_logprobs.py

@@ -0,0 +1,20 @@
+from vllm import LLM, SamplingParams
+


Please move this to the logprobs test directory rather than entrypoints

Do we have a test that this is giving the right answer?

robertgshaw2-redhat · 2024-12-29T15:27:53Z

Thanks! Left some minor comments.

Can you run this through an lm-eval-harness test that uses prompt logprobs as a sanity check for correctness?
Is there any impact on speed?

joerunde · 2024-12-30T21:03:30Z

vllm/model_executor/layers/sampler.py

-    result = (x > vals[:, None])
-    del vals
-    return result.sum(1).add_(1)
+    N, M = x.shape


Looks like M is unused, was that intentional?

mergify · 2024-12-30T21:04:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yangalan123.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yangalan123 · 2024-12-31T20:10:40Z

Thanks! Left some minor comments.

Can you run this through an lm-eval-harness test that uses prompt logprobs as a sanity check for correctness?

Is there any impact on speed?

Thanks for the review and comments! (and Happy New Year!) As the initial author for this PR, I can provide some initial insights on the correctness of this chunked sorting approach:

Correctness: I think here prompt_logprobs is not a direct testing for this PR, though this PR is motivated by the OOM issues resulted from computing prompt_logprobs. A better and more straightforward correctness testing, is directly comparing naive torch.sort with our chunked sorting solution. I run the following very simple benchmarking codes to compare these two kinds of sorting (partial credit goes to Claude, as I am kind of lazy in holiday seasons :-) ):

# benchmark_runner.py
import torch
import time
import json
import argparse
from pathlib import Path
from tqdm import tqdm

def naive_sort(tensor: torch.Tensor) -> torch.Tensor:
    return torch.sort(tensor, dim=-1)[0]

def chunked_sort(tensor: torch.Tensor, chunk_size: int) -> torch.Tensor:
    chunks = torch.split(tensor, chunk_size, dim=0)
    sorted_chunks = [torch.sort(chunk, dim=-1)[0] for chunk in chunks]
    return torch.cat(sorted_chunks, dim=0)

def run_benchmark(args):
    results = {
        'time': [],
        'accuracy': [],
        'type': 'naive' if args.chunk_size is None else f'chunked_{args.chunk_size}'
    }

    for _ in tqdm(range(args.num_rounds)):
        tensor = torch.randn(args.num_tokens, args.vocab_size, device='cuda')
        start_time = time.perf_counter()
        if args.chunk_size is None:
            result = naive_sort(tensor)
        else:
            result = chunked_sort(tensor, args.chunk_size)
            if args.check_accuracy:
                naive_result = naive_sort(tensor)
                accuracy = float(torch.allclose(naive_result, result, rtol=1e-5))
                results['accuracy'].append(accuracy)

        torch.cuda.synchronize()
        results['time'].append(time.perf_counter() - start_time)

        del tensor, result
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

    Path(args.output_dir).mkdir(exist_ok=True)
    output_file = Path(args.output_dir) / f"results_{results['type']}.json"
    with open(output_file, 'w') as f:
        json.dump(results, f)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--num_tokens', type=int, default=8192)
    parser.add_argument('--vocab_size', type=int, default=200000)
    parser.add_argument('--chunk_size', type=int, default=None)
    parser.add_argument('--num_rounds', type=int, default=5)
    parser.add_argument('--check_accuracy', action='store_true')
    parser.add_argument('--output_dir', type=str, default='benchmark_results')
    args = parser.parse_args()

    if not torch.cuda.is_available():
        raise RuntimeError("CUDA not available")

    run_benchmark(args)

if __name__ == "__main__":
    main()

I run with num_tokens=8192 and vocab_size=128256 on an A40 GPU to simulate the running of Llama-3 models. The running results show that, no matter what chunk_size we choose, after 5 rounds of checking, the accuracy (in terms of sorting results match) is 100%, which verifies the correctness of our implementation. This is expected as the chunking-and-merging happens not at the dimension of vocabulary, but at the dimension of tokens.

Efficiency: We would definitely face some overhead here -- because we need to do some chunking first. Also we might lose the opportunity to expose higher concurrency, as we now are sorting smaller chunks of logits and may not fully saturate the GPU computation resources. Nevertheless, reusing the above codes for benchmarking and removing extreme observed metrics in running (potentially due to shared cluster usage), I find that setting chunk_size to 32 only incurs around 0.2 more seconds (1.21 v.s. 1.02, about 18% slower). Note, here this is already a stress testing when we use the full input window (num_tokens=8192). It is definitely possible in real applications, this overhead is more negligible with smaller logit tensors to sort. I think for users suffering from OOM issues (e.g., when computing prompt_logprobs), running a bit slower is much better than just getting the CUDA OOM error and the whole running fails.

Please note that, though we can also do a memory profiling here, for both running time and memory profiling results reported by this simple simulation program, we should take it with a grain of salt because the special distributed environments and setup at vLLM runtime are not reflected here. I made a simulation report here only for quick reference.

Other issues on codes -- as @youkaichao is already working to polish my initial commits, perhaps he can provide more thoughts and details.

Happy New Year again to everyone and thanks all for reviews and polishing this PR!

github-actions · 2025-04-01T02:11:44Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-05-01T02:13:15Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

[Work-in-Progress, do not merge for now] Initial commit for chunked_d…

cb07a82

…ecoding to reduce peak memory (try to solve OOM issue in computing logits in Issue vllm-project#5907) Signed-off-by: Chenghao Yang <[email protected]>

yangalan123 requested review from zhuohan123, youkaichao, alexm-redhat, comaniac and njhill as code owners December 27, 2024 02:45

youkaichao marked this pull request as draft December 27, 2024 04:33

youkaichao added 4 commits December 27, 2024 12:48

Merge branch 'main' into chunked_sorting_rebase_tmp

3bd4e7d

polish

9ec13ab

Signed-off-by: youkaichao <[email protected]>

add tests

88c1a70

Signed-off-by: youkaichao <[email protected]>

format

9f7ff0b

Signed-off-by: youkaichao <[email protected]>

youkaichao marked this pull request as ready for review December 27, 2024 05:29

youkaichao requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners December 27, 2024 05:29

youkaichao changed the title ~~[WIP] Enable Chunked Sorting to Reduce Peak Memory Usage in Top-P-Top-K Sorting~~ Reduce peak memory in Top-P-Top-K with chunked sorting Dec 27, 2024

youkaichao changed the title ~~Reduce peak memory in Top-P-Top-K with chunked sorting~~ Bounded peak memory in Top-P-Top-K with chunked sorting Dec 27, 2024

youkaichao added 2 commits December 27, 2024 17:35

chunked logprobs

3448789

Signed-off-by: youkaichao <[email protected]>

isolate tests

69347e5

Signed-off-by: youkaichao <[email protected]>

mergify bot added the ci/build label Dec 27, 2024

robertgshaw2-redhat self-assigned this Dec 28, 2024

robertgshaw2-redhat reviewed Dec 29, 2024

View reviewed changes

joerunde reviewed Dec 30, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 30, 2024

github-actions bot added the stale Over 90 days of inactivity label Apr 1, 2025

github-actions bot closed this May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bounded peak memory in Top-P-Top-K with chunked sorting #11544

Bounded peak memory in Top-P-Top-K with chunked sorting #11544

Uh oh!

yangalan123 commented Dec 27, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 27, 2024

Uh oh!

youkaichao commented Dec 27, 2024

Uh oh!

youkaichao commented Dec 27, 2024

Uh oh!

robertgshaw2-redhat commented Dec 28, 2024

Uh oh!

robertgshaw2-redhat Dec 29, 2024 •

edited

Loading

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Uh oh!

robertgshaw2-redhat commented Dec 29, 2024

Uh oh!

joerunde Dec 30, 2024

Uh oh!

mergify bot commented Dec 30, 2024

Uh oh!

yangalan123 commented Dec 31, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Apr 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

Bounded peak memory in Top-P-Top-K with chunked sorting #11544

Bounded peak memory in Top-P-Top-K with chunked sorting #11544

Uh oh!

Conversation

yangalan123 commented Dec 27, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 27, 2024

Uh oh!

youkaichao commented Dec 27, 2024

Uh oh!

youkaichao commented Dec 27, 2024

Uh oh!

robertgshaw2-redhat commented Dec 28, 2024

Uh oh!

robertgshaw2-redhat Dec 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Dec 29, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Dec 29, 2024

Uh oh!

joerunde Dec 30, 2024

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 30, 2024

Uh oh!

yangalan123 commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

Uh oh!

yangalan123 commented Dec 27, 2024 •

edited by github-actions bot

Loading

robertgshaw2-redhat Dec 29, 2024 •

edited

Loading

yangalan123 commented Dec 31, 2024 •

edited

Loading