breaks up calibration steps

pranavm-nvidia · pranavm-nvidia · commit 6b6960aa2bfa · 2025-01-17T17:30:35.000-08:00
diff --git a/tripy/docs/pre0_user_guides/01-quantization.md b/tripy/docs/pre0_user_guides/01-quantization.md
@@ -8,7 +8,7 @@
 :::{seealso}
 The
 [TensorRT developer guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#working-with-int8)
-explains how quantization works in more detail.
+explains quantization in more detail.
 :::
 
 Scaling factors can be loaded into Tripy models simlar to how weights are loaded.
@@ -21,8 +21,8 @@ If the model was not trained with quantization-aware training (QAT), we can use
 to do **calibration** to determine scaling factors.
 
 :::{admonition} Info
-Calibration runs a model with a small set of input data to determine the distribution
-of values for each tensor.
+**Calibration** runs a model with a small set of input data to determine the
+numerical distribution of each tensor.
 
 The most important range of this distribution is called the **dynamic range**.
 :::
@@ -46,37 +46,47 @@ Let's calibrate a GPT model:
 
 3. Calibrate for `int8` precision:
 
-    ```py
-    # doc: no-output
-    from transformers import AutoTokenizer
-    import modelopt.torch.quantization as mtq
-    from modelopt.torch.utils.dataset_utils import create_forward_loop
-
-    quant_cfg = mtq.INT8_DEFAULT_CFG
-
-    # Define the forward pass
-    MAX_SEQ_LEN = 512
-    tokenizer = AutoTokenizer.from_pretrained(
-        "gpt2",
-        use_fast=True,
-        model_max_length=MAX_SEQ_LEN,
-        padding_side="left",
-        trust_remote_code=True,
-    )
-    tokenizer.pad_token = tokenizer.eos_token
-
-    forward_loop = create_forward_loop(
-        model=model,
-        dataset_name="cnn_dailymail",
-        tokenizer=tokenizer,
-        device=model.device,
-        num_samples=8,
-    )
-
-    # Run calibration to replace linear layers with `QuantLinear`, which
-    # include calibrated parameters.
-    mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
-    ```
+    1. Define the forward pass:
+
+        ```py
+        # doc: no-output
+        from transformers import AutoTokenizer
+        from modelopt.torch.utils.dataset_utils import create_forward_loop
+
+        MAX_SEQ_LEN = 512
+        tokenizer = AutoTokenizer.from_pretrained(
+            "gpt2",
+            use_fast=True,
+            model_max_length=MAX_SEQ_LEN,
+            padding_side="left",
+            trust_remote_code=True,
+        )
+        tokenizer.pad_token = tokenizer.eos_token
+
+        forward_loop = create_forward_loop(
+            model=model,
+            dataset_name="cnn_dailymail",
+            tokenizer=tokenizer,
+            device=model.device,
+            num_samples=8,
+        )
+        ```
+
+    2. Set up quantization configuration:
+
+        ```py
+        import modelopt.torch.quantization as mtq
+
+        quant_cfg = mtq.INT8_DEFAULT_CFG
+        ```
+
+    3. Run calibration to replace linear layers with `QuantLinear`, which contain calibration information:
+
+        ```py
+        # doc: no-output
+        mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
+        ```
+
 
 The `amax` attribute(s) of `QuantLinear`'s quantizers specify **dynamic range**(s):
 
@@ -115,8 +125,8 @@ qlinear.input_scale = tp.Tensor(input_scale)
 qlinear.weight_scale = tp.Tensor(weight_scale)
 ```
 
-We can run the module just like a non-quantized `float32` module.
-The inputs and weights are quantized internally:
+We run the module just like a non-quantized `float32` module.
+Inputs and weights are quantized internally:
 
 ```py
 dummy_input = tp.ones((1, 768), dtype=tp.float32)