Feat: Pre-quantized LLM model support #3740

keehyuna · 2025-08-01T00:02:11Z

Description

Support pre-quantized HF models and post-training quantization (PTQ) option for run_llm.py

Fixes # (issue)

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

narendasan · 2025-08-29T18:50:27Z

tools/llm/quantize_utils.py

+    return model
+
+
+class TensorRTQuantizedLinear(torch.nn.Module):


@peri044 Is this something we might want to upstream to ModelOpt in the future?

Or pull into main torch-tensorrt as a pass?

I guess its somewhat HF specific, so remaining in this tool would make sense but are there some parts we could make generic for any sort of quantization workflow (e.g. torchao)?

Thanks. I think quantize_model() can be moved to function like torch_tensorrt.dynamo.quantize(). Currently investigating how to separate the calibration data path from the quantization logic

narendasan · 2025-08-29T18:51:12Z

tools/llm/quantize_utils.py

+
+        hf_quant_algo = hf_quant_config.pop("quant_algo", None)
+        if hf_quant_algo != "FP8" and hf_quant_algo != "NVFP4":
+            raise RuntimeError("Only FP8 or NVFP4 quantization is supported")


How would it be different for MXFP4?

looked at quantization cfg in modelopt

NVFP4_DEFAULT_CFG NVFP4 has E4M3 scales and a block size is 16.

MXFP4_DEFAULT_CFG MXFP4 has E8M0 scales and a block size is 32.

tools/llm/run_llm.py

tools/llm/README.md

lanluo-nvidia · 2025-09-19T16:07:26Z

modelopt has changed their code structure in 0.35.0:
please make the same changes as here: 9c520f8

lanluo-nvidia · 2025-09-19T16:42:12Z

tools/llm/quantize_utils.py

+                input_amax = tensors.pop(input_scale_name) * 448.0
+
+                # Dequantize the weight using the scale factor
+                dequantized_weight_data = module.weight.to(torch.float32) * weight_scale


should we check if precison is fp16 then .to(torch.float16) otherwise float32?

Thanks, that makes sense. I've updated it to use the same model precision.

peri044

Functionality looks good to me. Posted some comments on code restructuring

py/torch_tensorrt/dynamo/_quantization.py

tools/llm/quantize_utils.py

tools/llm/README.md

peri044 · 2025-09-19T16:59:15Z

tools/llm/run_llm.py

+    hf_quant_config = load_quantization_config(args.model)
+    if hf_quant_config:
+        model = convert_linear_to_tensorrt_quantized(model, hf_quant_config).cuda()
+        print(f"Model converted to TensorRT quantized")


Consider changing this to a more informative message

peri044

LGTM pending CI failures

meta-cla bot added the cla signed label Aug 1, 2025

keehyuna self-assigned this Aug 6, 2025

keehyuna changed the title ~~fp8 pre-quantized model support~~ Pre-quantized model support Aug 7, 2025

keehyuna changed the title ~~Pre-quantized model support~~ Feat: Pre-quantized LLM model support Aug 7, 2025

keehyuna marked this pull request as ready for review August 7, 2025 12:39

keehyuna requested review from narendasan and peri044 and removed request for narendasan August 8, 2025 06:44

keehyuna force-pushed the quant_llm_merged branch from 62cb0f2 to 1f1cf7f Compare August 29, 2025 08:26

narendasan reviewed Aug 29, 2025

View reviewed changes

keehyuna force-pushed the quant_llm_merged branch from 19a4070 to 303775e Compare September 2, 2025 13:13

github-actions bot added component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Sep 4, 2025

lanluo-nvidia reviewed Sep 19, 2025

View reviewed changes

tools/llm/README.md Outdated Show resolved Hide resolved

lanluo-nvidia reviewed Sep 19, 2025

View reviewed changes

peri044 requested changes Sep 19, 2025

View reviewed changes

github-actions bot removed component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Sep 20, 2025

keehyuna added 3 commits September 21, 2025 12:45

fp8/nvfp4 quantization support

25c17ef

chore: Detect pre-quantized hf model

d25b50c

feat: Expose quantization API in torch_tensorrt.dynamo

bd66d44

keehyuna force-pushed the quant_llm_merged branch from 7575120 to efa63db Compare September 22, 2025 03:10

chore: address reviews

3670829

keehyuna force-pushed the quant_llm_merged branch from efa63db to 3670829 Compare September 22, 2025 03:15

chore: api change in modelopt 0.35

1b5f7c0

peri044 approved these changes Sep 23, 2025

View reviewed changes

Feat: Pre-quantized LLM model support #3740

Are you sure you want to change the base?

Feat: Pre-quantized LLM model support #3740

Conversation

keehyuna commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lanluo-nvidia commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peri044 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peri044 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keehyuna commented Aug 1, 2025 •

edited

Loading