You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`--tokenizer`: (Optional) Tokenizer name; defaults to model.
55
56
-`--prompt`: Input prompt for generation.
56
57
-`--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
57
-
-`--precision`: Precision mode (`FP16`, `FP32`).
58
-
-`--quant_format`: Quantization format (`fp8`, `nvfp4`) to apply.
58
+
-`--model_precision`: Precision of model weight/buffer (`FP16`, `BF16`, `FP32`).
59
+
-`--quant_format`: (Optional) Quantization format (`fp8`, `nvfp4`) to apply.
59
60
-`--num_tokens`: Number of output tokens to generate.
60
61
-`--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
61
62
-`--benchmark`: Enable benchmarking mode.
@@ -68,17 +69,26 @@ Torch-TensorRT supports quantization to reduce model memory footprint and improv
68
69
#### Using Pre-quantized Models
69
70
70
71
To use pre-quantized models from HuggingFace:
72
+
If a model contains quantization configuration (detected automatically), the model's linear layers are converted to TensorRT quantized versions using the specified quantization algorithm (e.g., FP8, NVFP4). The quantization algorithm type is displayed during conversion.
73
+
74
+
**Note:** The `--quant_format` option will raise an error if it's used with pre-quantized models, as quantization cannot be applied to models that are already quantized.
0 commit comments