[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044
[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044pggPL wants to merge 10 commits into
Conversation
…8TensorStats (NVIDIA#2801) - Register scale_inv_std (plus helper variance/numel/sum buffers using Welford reduction) for all FP8 recipes and NVFP4 in add_scale_inv_stats. Population variance keeps std=0 for delayed/current scaling where scale_inv is a single scalar. - Also wire scale_inv_min/max/std for NVFP4 (was previously only FP8 recipes). - LogFp8TensorStats.inspect_tensor now filters bare stats on NVFP4 layers with a warning instead of raising, so dual LogFp8TensorStats + LogNvfp4TensorStats configs work with overlapping (or catch-all) layer regexes. Recipe-prefixed FP8 stats (e.g. mxfp8_mse) are preserved for what-if comparisons. - Numerics test extended to validate scale_inv_min/max/std against torch.std(scale_inv, unbiased=False). Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Greptile SummaryThis PR addresses two items from issue #2801: adding
Confidence Score: 4/5Safe to merge after the open discussions from earlier review rounds are resolved; the new Welford implementation and zero-filtering are correct. The Welford combination fix and the real_scale_inv zero-filtering are both mathematically correct and well-tested by the new microbatch-reduction tests. The scale_inv_max compute still bypasses zero-filtering (harmless in practice since real values are positive, but inconsistent with every other scale_inv stat). Both test helpers that build the reference expected value do so from unfiltered scale_inv tensors, which matches the implementation only because the chosen shapes need no padding, making the tests brittle to future shape changes. stats_computation.py deserves a second look for the scale_inv_max filtering inconsistency; test_log.py reference-value construction in the two new microbatch-reduction tests should be verified against zero-padded shapes. Important Files Changed
Reviews (6): Last reviewed commit: "[PyTorch Debug] Drop redundant variance/..." | Re-trigger Greptile |
Parallel-group variance is Sigma n_i*(var_i + (mean_i - mean)^2) / N - the between-group term must be added, not subtracted. Single-group buffers hide the bug (mean_i = mean_global so the term is 0); it surfaces with scale_inv_std reduced across microbatches/ranks, where negative variance flows into sqrt() and yields NaN. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
|
|
||
| def check_if_stat_is_supported(self, stat: str): | ||
| """Returns True if stat is supported, raises ValueError otherwise.""" | ||
| bare = stat[: -len("_columnwise")] if stat.endswith("_columnwise") else stat |
There was a problem hiding this comment.
mse_columnwise and underflow%_columnwise should not pass this function but they would with the logic as written.
There was a problem hiding this comment.
I added them to exception, even though we should have stats for them I guess, but this is not scope of the PR.
…validation compute_variance/compute_std now combine on M2 with an explicit `unbiased` flag. Previously sample variances (torch.var default, used by variance/std) were fed into a population-style combine, so reductions across microbatches/ranks were silently wrong whenever group means differed. scale_inv_* combinators now pass unbiased=False to match their population feeders. Adds test_stats_computation_microbatch_reduction which sweeps the stats registry and checks every aux-free combinator against its own compute fn over the concatenation. NVFP4 check_if_stat_is_supported no longer accepts mse_columnwise / underflows%_columnwise: only scale_inv_* have a columnwise variant, the others are computed from the single quantized tensor and would crash downstream. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> # Conflicts: # transformer_engine/debug/features/utils/stats_computation.py
…tyle Collapse import/list/call reflows that were split by an accidental black run at the default line length (88) instead of the repo's 100; no functional change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…d reduction test Route every scale_inv stat (min/max/std/variance/numel/sum) through a shared real_scale_inv() helper that strips the zero padding MXFP8/NVFP4 quantizers add, so padding no longer deflates the mean or corrupts the numel-weighted variance reduction across microbatches/ranks. Add a parametrized test driving the aux-dependent parallel-variance combine for scale_inv_std and checking it against std over the concatenated scale_inv values. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
…unbiased convention Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…ch reduction test Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
|
/te-ci pytorch L1 |
Description