Skip to content

Update DeepSeek v4 precision checks#237

Open
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:fix/deepseek-v4-precision-tol
Open

Update DeepSeek v4 precision checks#237
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:fix/deepseek-v4-precision-tol

Conversation

@high-cloud
Copy link
Copy Markdown
Contributor

Summary

  • Tighten QKV and SWA BF16 precision checks with fixed seeds and outlier-budget comparators
  • Align DeepSeek v4 golden references with kernel BF16 rounding and tiled accumulation

Related Issues

None

- Align QKV and SWA goldens with kernel BF16 rounding and tiled accumulation

- Add fixed seeds and BF16 outlier-budget comparators for lower tolerances
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refines the numerical precision of four DeepSeek v4 decode golden reference implementations by introducing explicit BF16 rounding helpers, replacing einsum/single-call patterns with tiled matmul and RMSNorm helpers, splitting RoPE computations into half-tensor intermediates, and tightening test harness tolerances with outlier-aware BF16 comparators.

Changes

DeepSeek V4 Numerical Precision Improvements

Layer / File(s) Summary
BF16 Rounding Helpers
models/deepseek/v4/deepseek_v4_decode_hc_post.py, models/deepseek/v4/deepseek_v4_decode_sparse_attn.py
Local _to_device_bf16 helpers perform explicit int32/int16 rounding before bfloat16 conversion, replacing direct PyTorch casting.
Tiled Computation Helpers
models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py
New tiled variants (matmul_wqa_tiled, rms_norm_q_tiled, matmul_wkv_tiled, rms_norm_kv_tiled) chunk along Q_LORA and KV dimensions.
Golden Reference Computations
models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py, models/deepseek/v4/deepseek_v4_decode_sparse_attn.py
qr_out and kv_full switch to tiled matmul+norm; sparse_attn inverse-RoPE uses split-half intermediates and explicit batched projection loops for wo_a and wo_b.
Test Fixture Initialization
models/deepseek/v4/deepseek_v4_decode_swa.py
Weight initializers for wq_a, wq_b, and wkv apply (randn(...) - 0.5) offset before scaling/quantization.
Test Harness & Comparators
models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py, models/deepseek/v4/deepseek_v4_decode_swa.py
New bf16_outlier_compare enforces 0.5% mismatch budgets; tolerances tightened (rtol/atol reduced from 5e-3/6e-3 to 2e-3/3e-3); deterministic seeding and custom comparators wired for q, kv, and x_out.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#230: Touches the same DeepSeek v4 files and makes directly related code-level changes to golden references, BF16 rounding, and tiled implementations.
  • hw-native-sys/pypto-lib#225: Modifies the DeepSeek v4 sparse-attention golden implementations with related inverse RoPE, BF16 rounding, and explicit batched projection accumulation.
  • hw-native-sys/pypto-lib#205: Modifies the same DeepSeek v4 QKV projection and RoPE golden/reference implementation with overlapping q/kv/qr computations and test harness changes.

Poem

🐰 With rounding precise and helpers so tiled,
The golden references now are refined,
BF16 outliers budgeted and filed,
Tolerance tight, numerics aligned—
DeepSeek v4 decoding, now better designed!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the PR—tightening precision checks for DeepSeek v4, which aligns with all four file changes.
Description check ✅ Passed The description clearly relates to the changeset, explaining both the precision check improvements and the alignment of golden references with BF16 rounding and tiled accumulation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements tiled versions of matrix multiplication and RMS normalization across several DeepSeek v4 model components to optimize processing. It also introduces a specific rounding-based bfloat16 conversion helper and a custom outlier comparison utility for testing. The review feedback suggests refactoring these newly added helper functions into shared utility modules to avoid code duplication and improve maintainability.

Comment on lines +97 to +99
def _to_device_bf16(value):
rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
return rounded.view(torch.float32).to(torch.bfloat16)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _to_device_bf16 helper function is duplicated in models/deepseek/v4/deepseek_v4_decode_sparse_attn.py. Consider refactoring this into a shared utility module to adhere to DRY principles and ensure consistent rounding behavior across the project.

Comment on lines +508 to +524
def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol):
import torch

close = torch.isclose(actual, expected, rtol=rtol, atol=atol)
mismatch = int((~close).sum().item())
max_mismatch = int(actual.numel() * 0.005)
if mismatch <= max_mismatch:
return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}"

diff = (actual.float() - expected.float()).abs()
max_idx = int(diff.flatten().argmax().item())
return False, (
f" BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} "
f"limit={max_mismatch} rtol={rtol} atol={atol}\n"
f" max_abs={diff.max().item():.8g} idx={max_idx} "
f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bf16_outlier_compare function is duplicated in models/deepseek/v4/deepseek_v4_decode_swa.py. To improve maintainability, this custom comparator should be moved to a shared validation utility (e.g., in golden/validation.py).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)

130-155: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

__main__ is missing a --seed argument and torch.manual_seed call, unlike the other files in this PR.

build_tensor_specs() initialises all input tensors with bare torch.randn / torch.rand calls (lines 111–119), so test runs are non-deterministic. deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py both add --seed / torch.manual_seed as part of this PR; hc_post was apparently missed.

🛡️ Proposed fix
+    import torch
     from golden import RunConfig, run_jit

     parser = argparse.ArgumentParser()
     parser.add_argument("-p", "--platform", ...)
     parser.add_argument("-d", "--device", ...)
+    parser.add_argument("--seed", type=int, default=20260508)
     parser.add_argument("--runtime-profiling", ...)
     args = parser.parse_args()

+    torch.manual_seed(args.seed)
+
     result = run_jit(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 130 - 155, Add
a deterministic seed option and apply it before generating tensors: add a
"--seed" argparse argument in the __main__ block, parse it, and call
torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all if appropriate)
before invoking build_tensor_specs() / run_jit so
deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use
reproducible inputs; ensure the new arg mirrors the other PR files' behavior and
keeps the rest of the RunConfig/runtime logic unchanged.
🧹 Nitpick comments (2)
models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)

97-99: ⚡ Quick win

Extract _to_device_bf16 to a shared utility — it is defined identically in deepseek_v4_decode_sparse_attn.py.

Both golden functions define the same nested helper. Any future change to the rounding logic would require parallel edits. A shared golden_bf16_utils.py in this directory (or a top-level golden_utils.py) would eliminate the duplication and give deepseek_v4_decode_swa.py a clean import path as well.

♻️ Suggested extraction

Create models/deepseek/v4/golden_bf16_utils.py:

+import torch
+
+def to_device_bf16(value: "torch.Tensor") -> "torch.Tensor":
+    """Round-half-up FP32→BF16 cast, matching hardware kernel rounding behaviour."""
+    rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
+    return rounded.view(torch.float32).to(torch.bfloat16)

Then in both golden functions:

+from golden_bf16_utils import to_device_bf16
 ...
-    def _to_device_bf16(value):
-        rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
-        return rounded.view(torch.float32).to(torch.bfloat16)
-
-    y = _to_device_bf16(y_fp32)
+    y = to_device_bf16(y_fp32)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 97 - 99,
Extract the duplicated nested helper _to_device_bf16 into a shared utility
module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import it from both
deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py so the
rounding logic is maintained in one place; specifically, move the body of
_to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view
float32 → bfloat16 conversion) into a function named to_device_bf16 (or
similarly), update both files to call that shared function instead of defining
the nested helper, and ensure the function accepts and returns a torch.Tensor
with the same semantics as the original _to_device_bf16.
models/deepseek/v4/deepseek_v4_decode_swa.py (1)

413-429: ⚡ Quick win

bf16_outlier_compare is a byte-for-byte copy of the same function in deepseek_v4_decode_qkv_proj_rope.py.

Move it to the shared utility module suggested above (or a standalone golden_compare_utils.py) and import it in both files. Also, the import torch on line 414 is redundant — torch is already imported at line 410.

♻️ Suggested extraction (in golden_bf16_utils.py or similar)
+def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol):
+    import torch
+    close = torch.isclose(actual, expected, rtol=rtol, atol=atol)
+    mismatch = int((~close).sum().item())
+    max_mismatch = int(actual.numel() * 0.005)
+    if mismatch <= max_mismatch:
+        return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}"
+    diff = (actual.float() - expected.float()).abs()
+    max_idx = int(diff.flatten().argmax().item())
+    return False, (
+        f"    BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} "
+        f"limit={max_mismatch} rtol={rtol} atol={atol}\n"
+        f"    max_abs={diff.max().item():.8g} idx={max_idx} "
+        f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}"
+    )

Then in both files:

+from golden_bf16_utils import bf16_outlier_compare
-    def bf16_outlier_compare(...):
-        ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_swa.py` around lines 413 - 429, The
bf16_outlier_compare function is duplicated; extract this function into a shared
utility module (e.g., golden_bf16_utils.py) and have both modules import and use
it instead of their local copies: move the bf16_outlier_compare implementation
(preserving its signature and behavior) into the new module, remove the
redundant local definitions in deepseek_v4_decode_swa.py and
deepseek_v4_decode_qkv_proj_rope.py, delete the extra import torch inside the
function bodies (rely on the module-level import), and update both files to
import bf16_outlier_compare from the new utility.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 130-155: Add a deterministic seed option and apply it before
generating tensors: add a "--seed" argparse argument in the __main__ block,
parse it, and call torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all
if appropriate) before invoking build_tensor_specs() / run_jit so
deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use
reproducible inputs; ensure the new arg mirrors the other PR files' behavior and
keeps the rest of the RunConfig/runtime logic unchanged.

---

Nitpick comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 97-99: Extract the duplicated nested helper _to_device_bf16 into a
shared utility module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import
it from both deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py
so the rounding logic is maintained in one place; specifically, move the body of
_to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view
float32 → bfloat16 conversion) into a function named to_device_bf16 (or
similarly), update both files to call that shared function instead of defining
the nested helper, and ensure the function accepts and returns a torch.Tensor
with the same semantics as the original _to_device_bf16.

In `@models/deepseek/v4/deepseek_v4_decode_swa.py`:
- Around line 413-429: The bf16_outlier_compare function is duplicated; extract
this function into a shared utility module (e.g., golden_bf16_utils.py) and have
both modules import and use it instead of their local copies: move the
bf16_outlier_compare implementation (preserving its signature and behavior) into
the new module, remove the redundant local definitions in
deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py, delete the
extra import torch inside the function bodies (rely on the module-level import),
and update both files to import bf16_outlier_compare from the new utility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc761d97-4484-4164-80c5-5a5267245fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 23fe87c and 9cea4e9.

📒 Files selected for processing (4)
  • models/deepseek/v4/deepseek_v4_decode_hc_post.py
  • models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py
  • models/deepseek/v4/deepseek_v4_decode_sparse_attn.py
  • models/deepseek/v4/deepseek_v4_decode_swa.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant