Commit 1609c89
Adds GEMM Profiling Guide to TE (#2863)
* adds blog post
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
* Address review comments on GEMM profiling guide
Benchmark tool:
- Always benchmark Dgrad separately (remove --verify-dgrad flag)
- Pass measured Dgrad data to plot instead of 2x Fprop approximation
- Add FP8 CurrentScaling and DelayedScaling benchmark support
- Add FP8Block to shape mode (was missing, only in model-config mode)
- Add --no-fp8-current and --no-fp8-delayed CLI flags
Documentation:
- Restructure: concise speedups.rst in features/, full tutorial in examples/
- Add device-specific precision recipes (Hopper vs Blackwell)
- Add Hopper (H200) benchmark results alongside Blackwell (B300)
- Remove misleading FP8 Block vs MXFP8 comparison (different target devices)
- Rename "How Shapes Are Derived" to appendix, promote key sections
- Convert benchmark tool references to GitHub links
- Refresh all benchmark numbers with FP8 Current/Delayed columns
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
* fixes failing test
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>
* cleanup per comments
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
* greptile
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
* Address review comments on speedups.rst
- Define autocast vs pre-quantized modes upfront before the figures
- Remove the --pre-quantize flag reference and the standalone note
- Replace unclear quantization-overhead jargon with plain language
- Condense the verbose "Speedup Is Shape-Dependent" section
- Reword "Fprop vs Dgrad comparisons" to per-operation breakdowns
- Fix benchmark_gemm.py: skip FP8 DelayedScaling in pre-quantized mode
(it has no pre-quantized variant and silently fell back to the
autocast path, producing a misleading bar in the pre-quantized plots)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
* Regenerate GEMM speedup figures with DelayedScaling fix
Re-ran the model-config benchmark on B300 (SM100) and H200 (SM90) with the
pre-quantized DelayedScaling fix applied, and synced the numbers in speedups.rst:
- B300 autocast: now includes FP8Block (1.30x); FP8Current 1.41x, FP8Delayed
1.61x, MXFP8 1.44x, NVFP4 2.03x
- B300 pre-quantized: FP8Delayed bar removed, FP8Block (1.82x) added; NVFP4 3.55x
- H200 autocast: FP8Current 1.57x, FP8Delayed 1.69x, FP8Block 1.41x
- H200 pre-quantized: FP8Delayed removed; FP8Block dropped (no Hopper prequant
support); raw FP8 1.92x
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
* Apply suggestion from @pggPL
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
---------
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@dl325g11-1979.ipp2a2.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@dl325g11-0771.ipp4a1.colossus.nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>1 parent 920a7db commit 1609c89
13 files changed
Lines changed: 2589 additions & 1 deletion
File tree
- benchmarks/gemm
- docs
- examples/gemm_profiling
- img
- features/low_precision_training
- gemm_profiling/img
Large diffs are not rendered by default.
Large diffs are not rendered by default.
0 commit comments