-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[NVIDIA] flashinfer TRTLLM attention prefill token limit #25998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[NVIDIA] flashinfer TRTLLM attention prefill token limit #25998
Conversation
…ction - Remove inappropriate 256 token limit for prefill sequences - Keep existing 256 token limit for decode batches - Add context-specific logging to distinguish prefill vs decode - Fixes issue where prefill sequences > 256 tokens incorrectly fell back to FlashInfer Signed-off-by: jasonlizhengjian <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the heuristic for using the FlashInfer TRTLLM attention kernel during prefill, removing the previous token limit of 256. This change is well-supported by the provided benchmarks, which show significant performance improvements. My review focuses on refining this heuristic to prevent potential out-of-memory errors. I've suggested adding a new token limit based on the successful benchmarked configurations to ensure stability while retaining the performance benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the benchmarks, that seems pretty clear. Have you also looked at the decode case as well? We are likely to see large decode batches on B200 given the default max_num_seqs is 1024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are going to enable prefill kernel for BS>256. But the benchmark only shown BS<256. I believe the pref should be good, but can you show them?
here is with batch size up to 2048:
|
Signed-off-by: jasonlizhengjian <[email protected]>
Purpose
Change the heuristic so that the flashinfer TRTLLM attention gets used more for prefill. Previously it was only used for
<= 256
tokens despite being faster (benchmark below) for all cases tested.Test Plan
benchmark using
benchmarks/kernels/benchmark_trtllm_prefill_attention.py
, datapoints causing OOM were left outTest Result
Benchmark results below.
speedup_% > 0
always meaning TRTLLM attention is always faster for prefillEssential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.