[OMNIML-2244] Add support for auto quantizing a model (#571)

ajrasane · web-flow · commit 1aaa77d1af66 · 2025-11-20T19:15:15.000Z
## What does this PR do? **Type of change:** Example update **Overview:** - Added option to quantize a model with `mtq.auto_quantize()` ## Usage ```python python torch_quant_to_onnx.py \ --timm_model_name vit_small_patch16_224 \ --quantize_mode auto \ --onnx_save_path models/vit_auto_quant.onnx \ --calibration_data_size 512 \ --batch_size 8 \ --auto_quantization_formats NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG INT8_DEFAULT_CFG \ --effective_bits 4.8 \ --num_score_steps 128 ``` ## Testing Able to auto quantize ViT model ``` AutoQuantize best recipe for patch_embed.proj: NONE(effective-bits: 16.0) AutoQuantize best recipe for blocks.0.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.0.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.0.mlp.fc1: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.0.mlp.fc2: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.1.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.1.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.1.mlp.fc1: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.1.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.2.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.2.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.2.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.2.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.3.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.3.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.3.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.3.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.4.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.4.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.4.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.4.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.5.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.5.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.5.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.5.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.6.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.6.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.6.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.6.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.7.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.7.attn.proj: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.7.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.7.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.8.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.8.attn.proj: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.8.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.8.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.9.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.9.attn.proj: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.9.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.9.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.10.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.10.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.10.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.10.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.11.attn.qkv: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.11.attn.proj: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize best recipe for blocks.11.mlp.fc1: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for blocks.11.mlp.fc2: NVFP4_AWQ_LITE_CFG(effective-bits: 4.0) AutoQuantize best recipe for head: FP8_DEFAULT_CFG(effective-bits: 8.0) AutoQuantize effective bits from search: 4.80 ``` Accuracy comparison for the ViT model | | Top-1 accuracy | Top-5 accuracy | |------------------------------------------------------|----------------|----------------| | Original model (FP32) | 85.102% | 97.526% | | Auto Quantized (FP8 + NVFP4, 4.78 effective bits) | 84.726% | 97.434% | | MXFP8 Quantized | 85.02% | 97.53% | | NVFP4 Quantized | 84.558% | 97.36% | | INT4 Quantized | 84.23% | 97.22% | ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No  --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
diff --git a/examples/onnx_ptq/torch_quant_to_onnx.py b/examples/onnx_ptq/torch_quant_to_onnx.py
@@ -19,18 +19,20 @@
 import timm
 import torch
 import torch.multiprocessing as mp
+import torch.nn.functional as F
 from datasets import load_dataset
 from download_example_onnx import export_to_onnx
 from evaluation import evaluate
 
 import modelopt.torch.quantization as mtq
 
 """
-This script is used to quantize a timm model using dynamic quantization like MXFP8 or NVFP4.
+This script is used to quantize a timm model using dynamic quantization like MXFP8 or NVFP4,
+or using auto quantization for optimal per-layer quantization.
 
 The script will:
 1. Given the model name, create a timm torch model.
-2. Quantize the torch model in MXFP8 or NVFP4 mode.
+2. Quantize the torch model in MXFP8, NVFP4, INT4_AWQ, or AUTO mode.
 3. Export the quantized torch model to ONNX format.
 """
 
@@ -55,8 +57,17 @@ def filter_func(name):
     return pattern.match(name) is not None
 
 
-def load_calibration_data(model_name, data_size, batch_size, device):
-    """Load and prepare calibration data."""
+def load_calibration_data(model_name, data_size, batch_size, device, with_labels=False):
+    """Load and prepare calibration data.
+
+    Args:
+        model_name: Name of the timm model
+        data_size: Number of samples to load
+        batch_size: Batch size for data loader
+        device: Device to load data to
+        with_labels: If True, return dict with 'image' and 'label' keys (for auto_quantize)
+                    If False, return just the images (for standard quantize)
+    """
     dataset = load_dataset("zh-plus/tiny-imagenet")
     model = timm.create_model(model_name, pretrained=True, num_classes=1000)
     data_config = timm.data.resolve_model_data_config(model)
@@ -65,9 +76,18 @@ def load_calibration_data(model_name, data_size, batch_size, device):
     images = dataset["train"][:data_size]["image"]
     calib_tensor = [transforms(img) for img in images]
     calib_tensor = [t.to(device) for t in calib_tensor]
-    return torch.utils.data.DataLoader(
-        calib_tensor, batch_size=batch_size, shuffle=True, num_workers=4
-    )
+
+    if with_labels:
+        labels = dataset["train"][:data_size]["label"]
+        labels = torch.tensor(labels, device=device)
+        calib_dataset = [{"image": img, "label": lbl} for img, lbl in zip(calib_tensor, labels)]
+        return torch.utils.data.DataLoader(
+            calib_dataset, batch_size=batch_size, shuffle=True, num_workers=4
+        )
+    else:
+        return torch.utils.data.DataLoader(
+            calib_tensor, batch_size=batch_size, shuffle=True, num_workers=4
+        )
 
 
 def quantize_model(model, config, data_loader=None):
@@ -86,16 +106,80 @@ def forward_loop(model):
     return quantized_model
 
 
-def get_model_input_shape(model_name, batch_size):
+def forward_step(model, batch):
+    """Forward step function for auto_quantize scoring."""
+    return model(batch["image"])
+
+
+def loss_func(output, batch):
+    """Loss function for auto_quantize gradient computation."""
+    return F.cross_entropy(output, batch["label"])
+
+
+def auto_quantize_model(
+    model,
+    data_loader,
+    quantization_formats,
+    effective_bits=4.8,
+    num_calib_steps=512,
+    num_score_steps=128,
+):
+    """Auto-quantize the model using optimal per-layer quantization search.
+
+    Args:
+        model: PyTorch model to quantize
+        data_loader: DataLoader with image-label dict batches
+        quantization_formats: List of quantization format config names or dicts
+        effective_bits: Target effective bits constraint
+        num_calib_steps: Number of calibration steps
+        num_score_steps: Number of scoring steps for sensitivity analysis
+
+    Returns:
+        Tuple of (quantized_model, search_state_dict)
+    """
+    constraints = {"effective_bits": effective_bits}
+
+    # Convert string format names to actual config objects
+    format_configs = []
+    for fmt in quantization_formats:
+        if isinstance(fmt, str):
+            format_configs.append(getattr(mtq, fmt))
+        else:
+            format_configs.append(fmt)
+
+    print(f"Starting auto-quantization search with {len(format_configs)} formats...")
+    print(f"Effective bits constraint: {effective_bits}")
+    print(f"Calibration steps: {num_calib_steps}, Scoring steps: {num_score_steps}")
+
+    quantized_model, search_state = mtq.auto_quantize(
+        model,
+        constraints=constraints,
+        quantization_formats=format_configs,
+        data_loader=data_loader,
+        forward_step=forward_step,
+        loss_func=loss_func,
+        num_calib_steps=num_calib_steps,
+        num_score_steps=num_score_steps,
+        verbose=True,
+    )
+
+    # Disable quantization for specified layers
+    mtq.disable_quantizer(quantized_model, filter_func)
+
+    return quantized_model, search_state
+
+
+def get_model_input_shape(model):
     """Get the input shape from timm model configuration."""
-    model = timm.create_model(model_name, pretrained=True, num_classes=1000)
     data_config = timm.data.resolve_model_data_config(model)
     input_size = data_config["input_size"]
-    return (batch_size, *tuple(input_size))  # Add batch dimension
+    return tuple(input_size)
 
 
 def main():
-    parser = argparse.ArgumentParser(description="Quantize timm models to MXFP8 or NVFP4")
+    parser = argparse.ArgumentParser(
+        description="Quantize timm models to FP8, MXFP8, INT8, NVFP4, INT4_AWQ, or use AUTO quantization"
+    )
 
     # Model hyperparameters
     parser.add_argument(
@@ -106,14 +190,14 @@ def main():
     )
     parser.add_argument(
         "--quantize_mode",
-        choices=["fp8", "mxfp8", "int8", "nvfp4", "int4_awq"],
+        choices=["fp8", "mxfp8", "int8", "nvfp4", "int4_awq", "auto"],
         default="mxfp8",
-        help="Type of quantization to apply (mxfp8, nvfp4, int4_awq)",
+        help="Type of quantization to apply. Default is MXFP8.",
     )
     parser.add_argument(
         "--onnx_save_path",
         required=True,
-        help="The path to save the ONNX model.",
+        help="The save path to save the ONNX model.",
         type=str,
     )
     parser.add_argument(
@@ -140,15 +224,43 @@ def main():
         help="Number of samples to use for evaluation. If None, use entire validation set.",
     )
 
-    args = parser.parse_args()
+    # Auto quantization specific arguments
+    parser.add_argument(
+        "--auto_quantization_formats",
+        nargs="+",
+        choices=[
+            "NVFP4_AWQ_LITE_CFG",
+            "FP8_DEFAULT_CFG",
+            "MXFP8_DEFAULT_CFG",
+            "INT8_DEFAULT_CFG",
+            "INT4_AWQ_CFG",
+        ],
+        default=["NVFP4_AWQ_LITE_CFG", "FP8_DEFAULT_CFG"],
+        help="Quantization formats to search from for auto mode (e.g., NVFP4_AWQ_LITE_CFG FP8_DEFAULT_CFG)",
+    )
+    parser.add_argument(
+        "--effective_bits",
+        type=float,
+        default=4.8,
+        help="Target effective bits for auto quantization constraint. Default is 4.8.",
+    )
+    parser.add_argument(
+        "--num_score_steps",
+        type=int,
+        default=128,
+        help="Number of scoring steps for auto quantization. Default is 128.",
+    )
 
-    # Get input shape from model config
-    input_shape = get_model_input_shape(args.timm_model_name, args.batch_size)
+    args = parser.parse_args()
 
     # Create model and move to appropriate device
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     model = timm.create_model(args.timm_model_name, pretrained=True, num_classes=1000).to(device)
 
+    # Get input shape from model config
+    input_size = get_model_input_shape(model)
+    input_shape = (args.batch_size, *input_size)
+
     # Evaluate base model if requested
     if args.evaluate:
         print("\n=== Evaluating Base Model ===")
@@ -159,21 +271,44 @@ def main():
         )
         print(f"Base Model - Top-1 Accuracy: {top1:.2f}%, Top-5 Accuracy: {top5:.2f}%")
 
-    # Select quantization config
-    config = QUANT_CONFIG_DICT[args.quantize_mode]
-    data_loader = (
-        None
-        if args.quantize_mode == "mxfp8"
-        else load_calibration_data(
+    # Quantize model based on mode
+    if args.quantize_mode == "auto":
+        # Auto quantization requires labels for loss computation
+        data_loader = load_calibration_data(
             args.timm_model_name,
             args.calibration_data_size,
-            input_shape[0],  # batch size
+            args.batch_size,
             device,
+            with_labels=True,
         )
-    )
 
-    # Quantize model
-    quantized_model = quantize_model(model, config, data_loader)
+        quantized_model, _ = auto_quantize_model(
+            model,
+            data_loader,
+            args.auto_quantization_formats,
+            args.effective_bits,
+            args.calibration_data_size,
+            args.num_score_steps,
+        )
+    else:
+        # Standard quantization - only load calibration data if needed
+        config = QUANT_CONFIG_DICT[args.quantize_mode]
+        if args.quantize_mode == "mxfp8":
+            data_loader = None
+        else:
+            data_loader = load_calibration_data(
+                args.timm_model_name,
+                args.calibration_data_size,
+                args.batch_size,
+                device,
+                with_labels=False,
+            )
+
+        quantized_model = quantize_model(model, config, data_loader)
+
+    # Print quantization summary
+    print("\nQuantization Summary:")
+    mtq.print_quant_summary(quantized_model)
 
     # Evaluate quantized model if requested
     if args.evaluate:
@@ -188,8 +323,10 @@ def main():
         )
         print(f"Quantized Model - Top-1 Accuracy: {top1:.2f}%, Top-5 Accuracy: {top5:.2f}%")
 
-    if args.quantize_mode in ["fp8", "int8"]:
-        print(f"Exporting to {args.quantize_mode} ONNX model is not supported yet.")
+    if args.quantize_mode in ["fp8", "int8", "auto"]:
+        print(
+            f"The selected quantization mode {args.quantize_mode} is not supported for ONNX export yet."
+        )
         return
 
     # Export to ONNX