Skip to content

Commit 1609c89

Browse files
jomitchellnvJonathan MitchellJonathan MitchellJonathan Mitchellclaude
authored
Adds GEMM Profiling Guide to TE (#2863)
* adds blog post Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * Address review comments on GEMM profiling guide Benchmark tool: - Always benchmark Dgrad separately (remove --verify-dgrad flag) - Pass measured Dgrad data to plot instead of 2x Fprop approximation - Add FP8 CurrentScaling and DelayedScaling benchmark support - Add FP8Block to shape mode (was missing, only in model-config mode) - Add --no-fp8-current and --no-fp8-delayed CLI flags Documentation: - Restructure: concise speedups.rst in features/, full tutorial in examples/ - Add device-specific precision recipes (Hopper vs Blackwell) - Add Hopper (H200) benchmark results alongside Blackwell (B300) - Remove misleading FP8 Block vs MXFP8 comparison (different target devices) - Rename "How Shapes Are Derived" to appendix, promote key sections - Convert benchmark tool references to GitHub links - Refresh all benchmark numbers with FP8 Current/Delayed columns Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> * fixes failing test Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com> * cleanup per comments Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com> * greptile Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com> * Address review comments on speedups.rst - Define autocast vs pre-quantized modes upfront before the figures - Remove the --pre-quantize flag reference and the standalone note - Replace unclear quantization-overhead jargon with plain language - Condense the verbose "Speedup Is Shape-Dependent" section - Reword "Fprop vs Dgrad comparisons" to per-operation breakdowns - Fix benchmark_gemm.py: skip FP8 DelayedScaling in pre-quantized mode (it has no pre-quantized variant and silently fell back to the autocast path, producing a misleading bar in the pre-quantized plots) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com> * Regenerate GEMM speedup figures with DelayedScaling fix Re-ran the model-config benchmark on B300 (SM100) and H200 (SM90) with the pre-quantized DelayedScaling fix applied, and synced the numbers in speedups.rst: - B300 autocast: now includes FP8Block (1.30x); FP8Current 1.41x, FP8Delayed 1.61x, MXFP8 1.44x, NVFP4 2.03x - B300 pre-quantized: FP8Delayed bar removed, FP8Block (1.82x) added; NVFP4 3.55x - H200 autocast: FP8Current 1.57x, FP8Delayed 1.69x, FP8Block 1.41x - H200 pre-quantized: FP8Delayed removed; FP8Block dropped (no Hopper prequant support); raw FP8 1.92x Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com> * Apply suggestion from @pggPL Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> --------- Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
1 parent 920a7db commit 1609c89

13 files changed

Lines changed: 2589 additions & 1 deletion

benchmarks/gemm/benchmark_gemm.py

Lines changed: 1883 additions & 0 deletions
Large diffs are not rendered by default.

docs/examples/gemm_profiling/gemm_profiling.rst

Lines changed: 589 additions & 0 deletions
Large diffs are not rendered by default.
91.7 KB
Loading
92.5 KB
Loading
87.3 KB
Loading
83.8 KB
Loading
94.4 KB
Loading
91.4 KB
Loading
86.8 KB
Loading
80.5 KB
Loading

0 commit comments

Comments
 (0)