Enable FlashInfer Hopper FP8 attention #29661
Draft
+418
−231
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Enable the best-performing Hopper FP8 attention kernels in FlashInfer backend.
In Prefill stage, the FlashInfer FA3 backend will be used. In Decode stage, the FlashInfer XQA backend (using trtllm interface) will be used.
However, things are complicated because different backends have different requirements. The most noticeable difference is that when kv-cache dtype is FP8, FA3 backend requires the query to be also in FP8, but XQA backend requires the query to be in FP16/BF16. Therefore, we cannot apply query quantization outside of the attention custom op. Instead, we must apply the query quantization inside the attention custom op's forward().
Changes
In
vllm/utils/flashinfer.py:is_sm100f_supported())supports_trtllm_attention()tocheck_trtllm_attention_support()which now returns:use_trtllm_attention()such that:check_trtllm_attention_support()andforce_use_trtllm_attention().force_use_trtllm_attention(), print a warning with a reason. Otherwise, print an info with a reason.In
vllm/v1/attention/backends/flashinfer.py:q_data_typein metadata intoq_data_type_prefillandq_data_type_decode.__init__()andget_cudagraph_support()(the latter is a classmethod).Requires flashinfer-ai/flashinfer#2148 from FlashInfer.
Test Plan
Run accuracy + performance tests on the cross-product of the following attributes:
Script to run
TBATest Result
TBA
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.