fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support #3877

hlu1 · 2025-04-26T00:20:34Z

Description

~~Make sure thatquant_config in attention kernel is not modified later when we reset the quant_config in modules that are excluded from quantization.~~
Add exclude_kv_cache to QuantMode.has_any_quant which is used in Linear and FusedMoE. This was causing issues with enabling fp8 kvcache with fp16/bf16 checkpoints.
Remove the check under if use_paged_context_fmha and self.has_fp8_kv_cache from trtllm attention. It doesn't make any sense because Linear quant_configs are independent from kvcache quant config.
Make sure to set the kv_cache_quant_algo with the original value when resetting quant_config for modules in quant_config.exclude_modules.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

hlu1 · 2025-04-26T00:22:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-26T00:28:46Z

PR_Github #3434 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-26T04:00:34Z

PR_Github #3434 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2411 completed with status: 'FAILURE'

hlu1 · 2025-04-26T06:08:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-26T06:14:25Z

PR_Github #3444 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-26T11:06:13Z

PR_Github #3444 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2420 completed with status: 'SUCCESS'

hlu1 · 2025-04-26T19:11:33Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-26T20:17:52Z

PR_Github #3461 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-26T20:23:14Z

PR_Github #3461 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #3444 for commit 096fc40

tensorrt_llm/_torch/attention_backend/trtllm.py

hlu1 · 2025-04-28T22:14:06Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-28T22:19:37Z

PR_Github #3652 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-28T22:25:37Z

PR_Github #3652 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #3444 for commit 809b907

Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com>

hlu1 · 2025-04-29T02:17:56Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-29T02:24:54Z

PR_Github #3665 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-29T02:31:08Z

PR_Github #3665 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #3444 for commit 46869d6

hlu1 requested review from QiJune, bobboli and kaiyux April 26, 2025 00:20

hlu1 force-pushed the fix_fp8_kvcache branch from 2ed0487 to 966af71 Compare April 26, 2025 00:22

hlu1 changed the title ~~[fix] Fix fp8 kvcache support~~ [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support Apr 26, 2025

juney-nvidia changed the title ~~[https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support~~ fix： [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support Apr 26, 2025

juney-nvidia changed the title ~~fix： [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support~~ fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support Apr 26, 2025

hlu1 force-pushed the fix_fp8_kvcache branch from 966af71 to 4d91c40 Compare April 26, 2025 06:06

hlu1 force-pushed the fix_fp8_kvcache branch from 4d91c40 to 096fc40 Compare April 26, 2025 19:11

hlu1 enabled auto-merge (squash) April 26, 2025 19:11

hlu1 disabled auto-merge April 26, 2025 19:29

hlu1 requested a review from yuxianq April 28, 2025 05:18

bobboli reviewed Apr 28, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/trtllm.py Show resolved Hide resolved

bobboli approved these changes Apr 28, 2025

View reviewed changes

hlu1 force-pushed the fix_fp8_kvcache branch from 096fc40 to 809b907 Compare April 28, 2025 22:12

hlu1 mentioned this pull request Apr 28, 2025

refactor: (part1) Add contraints doc for fusedMoe module. #3882

Merged

hlu1 requested a review from litaotju April 29, 2025 02:12

litaotju approved these changes Apr 29, 2025

View reviewed changes

Fix fp8 kvcache

46869d6

Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com>

hlu1 force-pushed the fix_fp8_kvcache branch from 809b907 to 46869d6 Compare April 29, 2025 02:17

hlu1 enabled auto-merge (squash) April 29, 2025 02:18

hlu1 merged commit d2f312b into NVIDIA:main Apr 29, 2025
3 checks passed

hlu1 deleted the fix_fp8_kvcache branch May 1, 2025 07:03

fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support #3877

fix: [https://nvbugspro.nvidia.com/bug/5242406][fix] Fix fp8 kvcache support #3877

Uh oh!

Conversation

hlu1 commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

hlu1 commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

hlu1 commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

hlu1 commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

tensorrt-cicd commented Apr 26, 2025

Uh oh!

Uh oh!

hlu1 commented Apr 28, 2025

Uh oh!

tensorrt-cicd commented Apr 28, 2025

Uh oh!

tensorrt-cicd commented Apr 28, 2025

Uh oh!

hlu1 commented Apr 29, 2025

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

hlu1 commented Apr 26, 2025 •

edited

Loading