8
8
:::{seealso}
9
9
The
10
10
[ TensorRT developer guide] ( https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#working-with-int8 )
11
- explains how quantization works in more detail.
11
+ explains quantization in more detail.
12
12
:::
13
13
14
14
Scaling factors can be loaded into Tripy models simlar to how weights are loaded.
@@ -21,8 +21,8 @@ If the model was not trained with quantization-aware training (QAT), we can use
21
21
to do ** calibration** to determine scaling factors.
22
22
23
23
:::{admonition} Info
24
- Calibration runs a model with a small set of input data to determine the distribution
25
- of values for each tensor.
24
+ ** Calibration** runs a model with a small set of input data to determine the
25
+ numerical distribution of each tensor.
26
26
27
27
The most important range of this distribution is called the ** dynamic range** .
28
28
:::
@@ -46,37 +46,47 @@ Let's calibrate a GPT model:
46
46
47
47
3. Calibrate for ` int8` precision:
48
48
49
- ` ` ` py
50
- # doc: no-output
51
- from transformers import AutoTokenizer
52
- import modelopt.torch.quantization as mtq
53
- from modelopt.torch.utils.dataset_utils import create_forward_loop
54
-
55
- quant_cfg = mtq.INT8_DEFAULT_CFG
56
-
57
- # Define the forward pass
58
- MAX_SEQ_LEN = 512
59
- tokenizer = AutoTokenizer.from_pretrained(
60
- " gpt2" ,
61
- use_fast=True,
62
- model_max_length=MAX_SEQ_LEN,
63
- padding_side=" left" ,
64
- trust_remote_code=True,
65
- )
66
- tokenizer.pad_token = tokenizer.eos_token
67
-
68
- forward_loop = create_forward_loop(
69
- model=model,
70
- dataset_name=" cnn_dailymail" ,
71
- tokenizer=tokenizer,
72
- device=model.device,
73
- num_samples=8,
74
- )
75
-
76
- # Run calibration to replace linear layers with `QuantLinear`, which
77
- # include calibrated parameters.
78
- mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
79
- ` ` `
49
+ 1. Define the forward pass:
50
+
51
+ ` ` ` py
52
+ # doc: no-output
53
+ from transformers import AutoTokenizer
54
+ from modelopt.torch.utils.dataset_utils import create_forward_loop
55
+
56
+ MAX_SEQ_LEN = 512
57
+ tokenizer = AutoTokenizer.from_pretrained(
58
+ " gpt2" ,
59
+ use_fast=True,
60
+ model_max_length=MAX_SEQ_LEN,
61
+ padding_side=" left" ,
62
+ trust_remote_code=True,
63
+ )
64
+ tokenizer.pad_token = tokenizer.eos_token
65
+
66
+ forward_loop = create_forward_loop(
67
+ model=model,
68
+ dataset_name=" cnn_dailymail" ,
69
+ tokenizer=tokenizer,
70
+ device=model.device,
71
+ num_samples=8,
72
+ )
73
+ ` ` `
74
+
75
+ 2. Set up quantization configuration:
76
+
77
+ ` ` ` py
78
+ import modelopt.torch.quantization as mtq
79
+
80
+ quant_cfg = mtq.INT8_DEFAULT_CFG
81
+ ` ` `
82
+
83
+ 3. Run calibration to replace linear layers with ` QuantLinear` , which contain calibration information:
84
+
85
+ ` ` ` py
86
+ # doc: no-output
87
+ mtq.quantize(model, quant_cfg, forward_loop=forward_loop)
88
+ ` ` `
89
+
80
90
81
91
The ` amax` attribute(s) of ` QuantLinear` ' s quantizers specify **dynamic range**(s):
82
92
@@ -115,8 +125,8 @@ qlinear.input_scale = tp.Tensor(input_scale)
115
125
qlinear.weight_scale = tp.Tensor(weight_scale)
116
126
` ` `
117
127
118
- We can run the module just like a non-quantized ` float32` module.
119
- The inputs and weights are quantized internally:
128
+ We run the module just like a non-quantized ` float32` module.
129
+ Inputs and weights are quantized internally:
120
130
121
131
` ` ` py
122
132
dummy_input = tp.ones(( 1 , 768 ), dtype= tp.float32 )
0 commit comments