Skip to content

Commit 6b6960a

Browse files
breaks up calibration steps
1 parent 2a24a57 commit 6b6960a

File tree

1 file changed

+46
-36
lines changed

1 file changed

+46
-36
lines changed

tripy/docs/pre0_user_guides/01-quantization.md

+46-36
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
:::{seealso}
99
The
1010
[TensorRT developer guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#working-with-int8)
11-
explains how quantization works in more detail.
11+
explains quantization in more detail.
1212
:::
1313

1414
Scaling factors can be loaded into Tripy models simlar to how weights are loaded.
@@ -21,8 +21,8 @@ If the model was not trained with quantization-aware training (QAT), we can use
2121
to do **calibration** to determine scaling factors.
2222

2323
:::{admonition} Info
24-
Calibration runs a model with a small set of input data to determine the distribution
25-
of values for each tensor.
24+
**Calibration** runs a model with a small set of input data to determine the
25+
numerical distribution of each tensor.
2626

2727
The most important range of this distribution is called the **dynamic range**.
2828
:::
@@ -46,37 +46,47 @@ Let's calibrate a GPT model:
4646

4747
3. Calibrate for `int8` precision:
4848

49-
```py
50-
# doc: no-output
51-
from transformers import AutoTokenizer
52-
import modelopt.torch.quantization as mtq
53-
from modelopt.torch.utils.dataset_utils import create_forward_loop
54-
55-
quant_cfg = mtq.INT8_DEFAULT_CFG
56-
57-
# Define the forward pass
58-
MAX_SEQ_LEN = 512
59-
tokenizer = AutoTokenizer.from_pretrained(
60-
"gpt2",
61-
use_fast=True,
62-
model_max_length=MAX_SEQ_LEN,
63-
padding_side="left",
64-
trust_remote_code=True,
65-
)
66-
tokenizer.pad_token = tokenizer.eos_token
67-
68-
forward_loop = create_forward_loop(
69-
model=model,
70-
dataset_name="cnn_dailymail",
71-
tokenizer=tokenizer,
72-
device=model.device,
73-
num_samples=8,
74-
)
75-
76-
# Run calibration to replace linear layers with `QuantLinear`, which
77-
# include calibrated parameters.
78-
mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
79-
```
49+
1. Define the forward pass:
50+
51+
```py
52+
# doc: no-output
53+
from transformers import AutoTokenizer
54+
from modelopt.torch.utils.dataset_utils import create_forward_loop
55+
56+
MAX_SEQ_LEN = 512
57+
tokenizer = AutoTokenizer.from_pretrained(
58+
"gpt2",
59+
use_fast=True,
60+
model_max_length=MAX_SEQ_LEN,
61+
padding_side="left",
62+
trust_remote_code=True,
63+
)
64+
tokenizer.pad_token = tokenizer.eos_token
65+
66+
forward_loop = create_forward_loop(
67+
model=model,
68+
dataset_name="cnn_dailymail",
69+
tokenizer=tokenizer,
70+
device=model.device,
71+
num_samples=8,
72+
)
73+
```
74+
75+
2. Set up quantization configuration:
76+
77+
```py
78+
import modelopt.torch.quantization as mtq
79+
80+
quant_cfg = mtq.INT8_DEFAULT_CFG
81+
```
82+
83+
3. Run calibration to replace linear layers with `QuantLinear`, which contain calibration information:
84+
85+
```py
86+
# doc: no-output
87+
mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
88+
```
89+
8090

8191
The `amax` attribute(s) of `QuantLinear`'s quantizers specify **dynamic range**(s):
8292
@@ -115,8 +125,8 @@ qlinear.input_scale = tp.Tensor(input_scale)
115125
qlinear.weight_scale = tp.Tensor(weight_scale)
116126
```
117127

118-
We can run the module just like a non-quantized `float32` module.
119-
The inputs and weights are quantized internally:
128+
We run the module just like a non-quantized `float32` module.
129+
Inputs and weights are quantized internally:
120130

121131
```py
122132
dummy_input = tp.ones((1, 768), dtype=tp.float32)

0 commit comments

Comments
 (0)