Update DeepSeek v4 precision checks#237
Conversation
- Align QKV and SWA goldens with kernel BF16 rounding and tiled accumulation - Add fixed seeds and BF16 outlier-budget comparators for lower tolerances
📝 WalkthroughWalkthroughThis PR refines the numerical precision of four DeepSeek v4 decode golden reference implementations by introducing explicit BF16 rounding helpers, replacing einsum/single-call patterns with tiled matmul and RMSNorm helpers, splitting RoPE computations into half-tensor intermediates, and tightening test harness tolerances with outlier-aware BF16 comparators. ChangesDeepSeek V4 Numerical Precision Improvements
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements tiled versions of matrix multiplication and RMS normalization across several DeepSeek v4 model components to optimize processing. It also introduces a specific rounding-based bfloat16 conversion helper and a custom outlier comparison utility for testing. The review feedback suggests refactoring these newly added helper functions into shared utility modules to avoid code duplication and improve maintainability.
| def _to_device_bf16(value): | ||
| rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000 | ||
| return rounded.view(torch.float32).to(torch.bfloat16) |
| def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol): | ||
| import torch | ||
|
|
||
| close = torch.isclose(actual, expected, rtol=rtol, atol=atol) | ||
| mismatch = int((~close).sum().item()) | ||
| max_mismatch = int(actual.numel() * 0.005) | ||
| if mismatch <= max_mismatch: | ||
| return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}" | ||
|
|
||
| diff = (actual.float() - expected.float()).abs() | ||
| max_idx = int(diff.flatten().argmax().item()) | ||
| return False, ( | ||
| f" BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} " | ||
| f"limit={max_mismatch} rtol={rtol} atol={atol}\n" | ||
| f" max_abs={diff.max().item():.8g} idx={max_idx} " | ||
| f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}" | ||
| ) |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)
130-155:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
__main__is missing a--seedargument andtorch.manual_seedcall, unlike the other files in this PR.
build_tensor_specs()initialises all input tensors with baretorch.randn/torch.randcalls (lines 111–119), so test runs are non-deterministic.deepseek_v4_decode_swa.pyanddeepseek_v4_decode_qkv_proj_rope.pyboth add--seed/torch.manual_seedas part of this PR;hc_postwas apparently missed.🛡️ Proposed fix
+ import torch from golden import RunConfig, run_jit parser = argparse.ArgumentParser() parser.add_argument("-p", "--platform", ...) parser.add_argument("-d", "--device", ...) + parser.add_argument("--seed", type=int, default=20260508) parser.add_argument("--runtime-profiling", ...) args = parser.parse_args() + torch.manual_seed(args.seed) + result = run_jit(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 130 - 155, Add a deterministic seed option and apply it before generating tensors: add a "--seed" argparse argument in the __main__ block, parse it, and call torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all if appropriate) before invoking build_tensor_specs() / run_jit so deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use reproducible inputs; ensure the new arg mirrors the other PR files' behavior and keeps the rest of the RunConfig/runtime logic unchanged.
🧹 Nitpick comments (2)
models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)
97-99: ⚡ Quick winExtract
_to_device_bf16to a shared utility — it is defined identically indeepseek_v4_decode_sparse_attn.py.Both golden functions define the same nested helper. Any future change to the rounding logic would require parallel edits. A shared
golden_bf16_utils.pyin this directory (or a top-levelgolden_utils.py) would eliminate the duplication and givedeepseek_v4_decode_swa.pya clean import path as well.♻️ Suggested extraction
Create
models/deepseek/v4/golden_bf16_utils.py:+import torch + +def to_device_bf16(value: "torch.Tensor") -> "torch.Tensor": + """Round-half-up FP32→BF16 cast, matching hardware kernel rounding behaviour.""" + rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000 + return rounded.view(torch.float32).to(torch.bfloat16)Then in both golden functions:
+from golden_bf16_utils import to_device_bf16 ... - def _to_device_bf16(value): - rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000 - return rounded.view(torch.float32).to(torch.bfloat16) - - y = _to_device_bf16(y_fp32) + y = to_device_bf16(y_fp32)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 97 - 99, Extract the duplicated nested helper _to_device_bf16 into a shared utility module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import it from both deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py so the rounding logic is maintained in one place; specifically, move the body of _to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view float32 → bfloat16 conversion) into a function named to_device_bf16 (or similarly), update both files to call that shared function instead of defining the nested helper, and ensure the function accepts and returns a torch.Tensor with the same semantics as the original _to_device_bf16.models/deepseek/v4/deepseek_v4_decode_swa.py (1)
413-429: ⚡ Quick win
bf16_outlier_compareis a byte-for-byte copy of the same function indeepseek_v4_decode_qkv_proj_rope.py.Move it to the shared utility module suggested above (or a standalone
golden_compare_utils.py) and import it in both files. Also, theimport torchon line 414 is redundant —torchis already imported at line 410.♻️ Suggested extraction (in golden_bf16_utils.py or similar)
+def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol): + import torch + close = torch.isclose(actual, expected, rtol=rtol, atol=atol) + mismatch = int((~close).sum().item()) + max_mismatch = int(actual.numel() * 0.005) + if mismatch <= max_mismatch: + return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}" + diff = (actual.float() - expected.float()).abs() + max_idx = int(diff.flatten().argmax().item()) + return False, ( + f" BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} " + f"limit={max_mismatch} rtol={rtol} atol={atol}\n" + f" max_abs={diff.max().item():.8g} idx={max_idx} " + f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}" + )Then in both files:
+from golden_bf16_utils import bf16_outlier_compare - def bf16_outlier_compare(...): - ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/deepseek_v4_decode_swa.py` around lines 413 - 429, The bf16_outlier_compare function is duplicated; extract this function into a shared utility module (e.g., golden_bf16_utils.py) and have both modules import and use it instead of their local copies: move the bf16_outlier_compare implementation (preserving its signature and behavior) into the new module, remove the redundant local definitions in deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py, delete the extra import torch inside the function bodies (rely on the module-level import), and update both files to import bf16_outlier_compare from the new utility.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 130-155: Add a deterministic seed option and apply it before
generating tensors: add a "--seed" argparse argument in the __main__ block,
parse it, and call torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all
if appropriate) before invoking build_tensor_specs() / run_jit so
deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use
reproducible inputs; ensure the new arg mirrors the other PR files' behavior and
keeps the rest of the RunConfig/runtime logic unchanged.
---
Nitpick comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 97-99: Extract the duplicated nested helper _to_device_bf16 into a
shared utility module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import
it from both deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py
so the rounding logic is maintained in one place; specifically, move the body of
_to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view
float32 → bfloat16 conversion) into a function named to_device_bf16 (or
similarly), update both files to call that shared function instead of defining
the nested helper, and ensure the function accepts and returns a torch.Tensor
with the same semantics as the original _to_device_bf16.
In `@models/deepseek/v4/deepseek_v4_decode_swa.py`:
- Around line 413-429: The bf16_outlier_compare function is duplicated; extract
this function into a shared utility module (e.g., golden_bf16_utils.py) and have
both modules import and use it instead of their local copies: move the
bf16_outlier_compare implementation (preserving its signature and behavior) into
the new module, remove the redundant local definitions in
deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py, delete the
extra import torch inside the function bodies (rely on the module-level import),
and update both files to import bf16_outlier_compare from the new utility.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bc761d97-4484-4164-80c5-5a5267245fe9
📒 Files selected for processing (4)
models/deepseek/v4/deepseek_v4_decode_hc_post.pymodels/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.pymodels/deepseek/v4/deepseek_v4_decode_sparse_attn.pymodels/deepseek/v4/deepseek_v4_decode_swa.py
Summary
Related Issues
None