Added support to export for BF16 weight and amax for vLLM fakequant QAT #579

kinjalpatel27 · 2025-11-19T19:26:03Z

What does this PR do?

Type of change: New Feature

Overview:

Support for vLLM fakequantize QAT/QAD checkpoint evaluation. This MR adds function to export checkpoint as BF16 weights and amax using export_hf_checkpoint for HF and export_mcore_gpt_to_hf for MCore using export_bf16_weights_amax option. The exported weights and amax can be used with vllm_serve_fakequant.py script to run saved checkpoint.

Usage

Refer to README.md

Testing

Tested HF approach by exporting bf16 model using QAT script and running vllm server, verified amax values match
Tested MCore approach by quantizing and exporting bf16 model using quantize.sh and export.sh script and running vllm server, verified amax values match

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: Yes

Additional Information

MCore export script doesn't have the option to export enable currently

copy-pr-bot · 2025-11-19T19:26:06Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

codecov · 2025-11-19T19:39:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.47%. Comparing base (1d0ee04) to head (13f6bcd).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #579      +/-   ##
==========================================
+ Coverage   74.43%   74.47%   +0.04%     
==========================================
  Files         182      182              
  Lines       18234    18255      +21     
==========================================
+ Hits        13572    13596      +24     
+ Misses       4662     4659       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <[email protected]>

modelopt/torch/export/unified_export_hf.py

realAsma · 2025-11-21T17:49:13Z

examples/vllm_serve/README.md

 ## Known Problems

 1. AWQ is not yet supported in vLLM.
+2. PTQ/QAT checkpoint doesn't work with KV Cache quantization enabled.


Thanks Kinjal for documenting this. Create a jira ticket to address this - https://jirasw.nvidia.com/browse/OMNIML-3051

Thank you for creating the ticket

realAsma

Is it possible to export the entire TensorQuantizer state - This way we can seamlessly support PTQ/QAT and fake quantization.

The current support does not work in cases such as Mixed precision quantization (some layers in FP4, some in FP8, some disabled etc.) - we would need manual work arounds for this case. This support also does not work for other quantization such as AWQ.

We are relying on the fact that we are quantizing the model with the same quantization formats as that of PTQ/QAT during vllm_serve.

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 · 2025-11-21T19:05:52Z

Is it possible to export the entire TensorQuantizer state - This way we can seamlessly support PTQ/QAT and fake quantization.

The current support does not work in cases such as Mixed precision quantization (some layers in FP4, some in FP8, some disabled etc.) - we would need manual work arounds for this case. This support also does not work for other quantization such as AWQ.

We are relying on the fact that we are quantizing the model with the same quantization formats as that of PTQ/QAT during vllm_serve.

I have created two tickets to explore mixed precision and other quantization algorithm support. Exporting Tensorquantizer state and loading may require additional effort since vLLM model also combines multiple layers etc.

meenchen

Can you also add unit tests for modelopt.torch.export.unified_export_hf.export_hf_checkpoint and modelopt.torch.export.unified_export_megatron.export_mcore_gpt_to_hf

meenchen · 2025-11-21T19:58:22Z

examples/vllm_serve/fakequant_worker.py

+        gate_up_match = "mixer" not in key and re.search(r"(.*\.)(gate|up)_proj(\..+_amax)$", key)
+        if gate_up_match:
+            base_pattern = gate_up_match.group(1) + "gate_up_proj" + gate_up_match.group(3)
+            merge_groups[base_pattern].append((key, value))
+            continue


Does this work with MoE models that use this quant module:

TensorRT-Model-Optimizer/modelopt/torch/quantization/plugins/vllm.py

Line 134 in 01e24fd

class _QuantFusedMoEBase(QuantModule):

Can you give an example for which model you are talking about?

Sure, you can try Qwen/Qwen3-30B-A3B-Instruct-2507

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 · 2025-11-21T21:44:31Z

Can you also add unit tests for modelopt.torch.export.unified_export_hf.export_hf_checkpoint and modelopt.torch.export.unified_export_megatron.export_mcore_gpt_to_hf

@meenchen Added tests for both in separate file: tests/gpu/torch/export/test_vllm_fakequant_export.py

meenchen

LGTM. Could you also try if Qwen/Qwen3-30B-A3B-Instruct-2507 works with this change?

kinjalpatel27 · 2025-11-22T00:40:11Z

LGTM. Could you also try if Qwen/Qwen3-30B-A3B-Instruct-2507 works with this change?

@meenchen Thank you. I checked Qwen/Qwen3-30B-A3B-Instruct-2507, It doesn't work with FuseMoE yet. I am looking into fixing it.

kinjalpatel27 added 2 commits November 19, 2025 22:08

Added support to export for BF16 weight and amax

cff3cc6

Signed-off-by: Kinjal Patel <[email protected]>

Updated docs

560dfc7

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 force-pushed the kinjal/bf16_weight_amax_export branch from 9946463 to 560dfc7 Compare November 19, 2025 22:09

kinjalpatel27 marked this pull request as ready for review November 19, 2025 22:10

kinjalpatel27 requested review from a team as code owners November 19, 2025 22:10

kinjalpatel27 requested a review from meenchen November 19, 2025 22:10

kinjalpatel27 self-assigned this Nov 19, 2025

kinjalpatel27 added 2 commits November 19, 2025 22:17

minor

7c6d63f

Signed-off-by: Kinjal Patel <[email protected]>

minor

708630f

Signed-off-by: Kinjal Patel <[email protected]>

cjluo-nv reviewed Nov 21, 2025

View reviewed changes

modelopt/torch/export/unified_export_hf.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Nov 21, 2025

View reviewed changes

modelopt/torch/export/unified_export_hf.py Outdated Show resolved Hide resolved

realAsma reviewed Nov 21, 2025

View reviewed changes

minor

13f6bcd

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 force-pushed the kinjal/bf16_weight_amax_export branch from e5a095d to 13f6bcd Compare November 21, 2025 18:49

meenchen reviewed Nov 21, 2025

View reviewed changes

kinjalpatel27 requested review from cjluo-nv and realAsma November 21, 2025 20:49

kinjalpatel27 added 2 commits November 21, 2025 21:42

added seperate file for vLLM for export

7aa0559

Signed-off-by: Kinjal Patel <[email protected]>

added test for vllm fq export

32f7ae8

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 force-pushed the kinjal/bf16_weight_amax_export branch from 32968c9 to 32f7ae8 Compare November 21, 2025 21:43

kinjalpatel27 requested a review from meenchen November 21, 2025 21:45

meenchen approved these changes Nov 22, 2025

View reviewed changes

Added support to export for BF16 weight and amax for vLLM fakequant QAT #579

Are you sure you want to change the base?

Added support to export for BF16 weight and amax for vLLM fakequant QAT #579

Conversation

kinjalpatel27 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Nov 19, 2025

Uh oh!

codecov bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

realAsma Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

kinjalpatel27 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

kinjalpatel27 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

kinjalpatel27 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

meenchen Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

kinjalpatel27 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

kinjalpatel27 commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kinjalpatel27 commented Nov 19, 2025 •

edited

Loading

codecov bot commented Nov 19, 2025 •

edited

Loading

kinjalpatel27 commented Nov 21, 2025 •

edited

Loading

kinjalpatel27 commented Nov 21, 2025 •

edited

Loading