Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions examples/llm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,10 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
| QwQ | ✅ | - | - | - | ✅ |
| T5 | ✅ | ✅ | ✅ | ✅ | - |
| Whisper | ✅ | ❌ | ❌ | ❌ | - |
| Kimi-K2-Thinking-BF16 | ✅ | ❌ | ❌ | ❌ | ✅ |

> *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)*
> We recommend upcasting Kimi-K2-Thinking from INT4 to BF16 before running quantization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a recommendation or it's something we have to do? An alterantive is to up cast the in4 to BF16 during calibration like we did with DS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there’s no INT4 support in PyTorch, as we discussed. People have to use vLLM if they want INT4. Me and Zhiyu are looking into the vLLM calibration of this model


> *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
> *<sup>2.</sup>For some models, there is only support for exporting quantized checkpoints.* \
Expand Down
2 changes: 2 additions & 0 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@
"w4a8_nvfp4_fp8": mtq.W4A8_NVFP4_FP8_CFG,
"w4a8_mxfp4_fp8": mtq.W4A8_MXFP4_FP8_CFG,
"nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
"nvfp4_mlp_experts_only": mtq.NVFP4_MLP_EXPERTS_ONLY_CFG,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still against adding more configs here. I think this MR we should just stick with MLP_only if we have to. People can tune the recipe themselves if they want to do experts only.

If you really like to add this config, let's name it experts_only for short. experts are always in MLP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to nvfp4_experts_only. Let’s keep this config for now; once the YAML config system is released, we can avoid using these recipe dictionaries.

}

KV_QUANT_CFG_CHOICES = {
Expand Down Expand Up @@ -121,6 +122,7 @@ def auto_quantize(
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
"nvfp4_mlp_exports_only",
]
for qformat in qformat_list
), "One or more quantization formats provided are not supported for unified checkpoint export"
Expand Down
20 changes: 20 additions & 0 deletions modelopt/torch/quantization/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -623,6 +623,25 @@
"algorithm": "max",
}

NVFP4_MLP_EXPERTS_ONLY_CFG = {
"quant_cfg": {
"*mlp.experts*weight_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
"*mlp.experts*input_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
**_default_disabled_quantizer_cfg,
},
"algorithm": "max",
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this config. See NVFP4_MLP_ONLY_CFG

Suggested change
NVFP4_MLP_EXPERTS_ONLY_CFG = {
"quant_cfg": {
"*mlp.experts*weight_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
"*mlp.experts*input_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
**_default_disabled_quantizer_cfg,
},
"algorithm": "max",
}

Copy link
Contributor Author

@jingyu-ml jingyu-ml Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will quantize mlp.shared_experts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great. Let's not creating more cfgs

choices: set[str] = {
"FP8_2D_BLOCKWISE_WEIGHT_ONLY_CFG",
"FP8_AFFINE_KV_CFG",
Expand Down Expand Up @@ -652,6 +671,7 @@
"NVFP4_MLP_WEIGHT_ONLY_CFG",
"MXFP4_MLP_WEIGHT_ONLY_CFG",
"NVFP4_MLP_ONLY_CFG",
"NVFP4_MLP_EXPERTS_ONLY_CFG",
}

BiasType = Literal["static", "dynamic"]
Expand Down
Loading