@@ -5,6 +5,7 @@ author: "AMD and Embedded LLM"
55image : /assets/figures/ptpc/PTPC-tumbnail.png
66thumbnail-img : /assets/figures/ptpc/PTPC-tumbnail.png
77share-img : /assets/figures/ptpc/PTPC-tumbnail.png
8+ math : true
89---
910
1011** TL;DR** : vLLM on AMD ROCm now has better FP8 performance!
@@ -57,15 +58,15 @@ This insight led to a dual-granularity approach:
5758The illustration shows two quantization approaches:
5859
5960** Tensor Dimensions (Both Methods):**
60- - ** X ** : Input activation tensor (T×Ci )
61- - ** W ** : Weight tensor (Ci×Co )
62- - ** T ** : Token sequence length
63- - ** Ci/Co ** : Input/output channels
64- - ** \* ** : Matrix multiplication
61+ - ** $X$ ** : Input activation tensor ($T \times C_i$ )
62+ - ** $W$ ** : Weight tensor ($C_i \times C_o$ )
63+ - ** $T$ ** : Token sequence length
64+ - ** $C_i/C_o$ ** : Input/output channels
65+ - ** $ * $ ** : Matrix multiplication
6566
6667** Scaling Factors:**
67- - ** Top (Per-Tensor)** : Single scalars ΔX [ 1] and ΔW [ 1] for entire tensors
68- - ** Bottom (PTPC)** : Vector ΔX [ T×1 ] with one scale per token and ΔW [ 1×Co ] with one scale per input channel
68+ - ** Top (Per-Tensor)** : Single scalars $\Delta_X [ 1] $ and $\Delta_W [ 1] $ for entire tensors
69+ - ** Bottom (PTPC)** : Vector $\Delta_X [ T \times 1 ] $ with one scale per token and $\Delta_W [ 1 \times C_o ] $ with one scale per input channel
6970
7071This granular scaling approach allows PTPC-FP8 to achieve accuracy close to BF16 while maintaining the speed and memory benefits of 8-bit computation.
7172
0 commit comments