Update DeepSeek v4 precision checks by high-cloud · Pull Request #237 · hw-native-sys/pypto-lib

high-cloud · 2026-05-08T11:27:34Z

Summary

Tighten QKV and SWA BF16 precision checks with fixed seeds and outlier-budget comparators
Align DeepSeek v4 golden references with kernel BF16 rounding and tiled accumulation

Related Issues

None

- Align QKV and SWA goldens with kernel BF16 rounding and tiled accumulation - Add fixed seeds and BF16 outlier-budget comparators for lower tolerances

coderabbitai · 2026-05-08T11:27:47Z

📝 Walkthrough

Walkthrough

This PR refines the numerical precision of four DeepSeek v4 decode golden reference implementations by introducing explicit BF16 rounding helpers, replacing einsum/single-call patterns with tiled matmul and RMSNorm helpers, splitting RoPE computations into half-tensor intermediates, and tightening test harness tolerances with outlier-aware BF16 comparators.

Changes

DeepSeek V4 Numerical Precision Improvements

Layer / File(s)	Summary
BF16 Rounding Helpers `models/deepseek/v4/deepseek_v4_decode_hc_post.py`, `models/deepseek/v4/deepseek_v4_decode_sparse_attn.py`	Local `_to_device_bf16` helpers perform explicit int32/int16 rounding before bfloat16 conversion, replacing direct PyTorch casting.
Tiled Computation Helpers `models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py`	New tiled variants (`matmul_wqa_tiled`, `rms_norm_q_tiled`, `matmul_wkv_tiled`, `rms_norm_kv_tiled`) chunk along Q_LORA and KV dimensions.
Golden Reference Computations `models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py`, `models/deepseek/v4/deepseek_v4_decode_sparse_attn.py`	`qr_out` and `kv_full` switch to tiled matmul+norm; sparse_attn inverse-RoPE uses split-half intermediates and explicit batched projection loops for `wo_a` and `wo_b`.
Test Fixture Initialization `models/deepseek/v4/deepseek_v4_decode_swa.py`	Weight initializers for `wq_a`, `wq_b`, and `wkv` apply `(randn(...) - 0.5)` offset before scaling/quantization.
Test Harness & Comparators `models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py`, `models/deepseek/v4/deepseek_v4_decode_swa.py`	New `bf16_outlier_compare` enforces 0.5% mismatch budgets; tolerances tightened (rtol/atol reduced from 5e-3/6e-3 to 2e-3/3e-3); deterministic seeding and custom comparators wired for `q`, `kv`, and `x_out`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hw-native-sys/pypto-lib#230: Touches the same DeepSeek v4 files and makes directly related code-level changes to golden references, BF16 rounding, and tiled implementations.
hw-native-sys/pypto-lib#225: Modifies the DeepSeek v4 sparse-attention golden implementations with related inverse RoPE, BF16 rounding, and explicit batched projection accumulation.
hw-native-sys/pypto-lib#205: Modifies the same DeepSeek v4 QKV projection and RoPE golden/reference implementation with overlapping q/kv/qr computations and test harness changes.

Poem

🐰 With rounding precise and helpers so tiled,
The golden references now are refined,
BF16 outliers budgeted and filed,
Tolerance tight, numerics aligned—
DeepSeek v4 decoding, now better designed!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the PR—tightening precision checks for DeepSeek v4, which aligns with all four file changes.
Description check	✅ Passed	The description clearly relates to the changeset, explaining both the precision check improvements and the alignment of golden references with BF16 rounding and tiled accumulation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements tiled versions of matrix multiplication and RMS normalization across several DeepSeek v4 model components to optimize processing. It also introduces a specific rounding-based bfloat16 conversion helper and a custom outlier comparison utility for testing. The review feedback suggests refactoring these newly added helper functions into shared utility modules to avoid code duplication and improve maintainability.

gemini-code-assist · 2026-05-08T11:29:59Z

+    def _to_device_bf16(value):
+        rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
+        return rounded.view(torch.float32).to(torch.bfloat16)


The _to_device_bf16 helper function is duplicated in models/deepseek/v4/deepseek_v4_decode_sparse_attn.py. Consider refactoring this into a shared utility module to adhere to DRY principles and ensure consistent rounding behavior across the project.

gemini-code-assist · 2026-05-08T11:29:59Z

+    def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol):
+        import torch
+
+        close = torch.isclose(actual, expected, rtol=rtol, atol=atol)
+        mismatch = int((~close).sum().item())
+        max_mismatch = int(actual.numel() * 0.005)
+        if mismatch <= max_mismatch:
+            return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}"
+
+        diff = (actual.float() - expected.float()).abs()
+        max_idx = int(diff.flatten().argmax().item())
+        return False, (
+            f"    BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} "
+            f"limit={max_mismatch} rtol={rtol} atol={atol}\n"
+            f"    max_abs={diff.max().item():.8g} idx={max_idx} "
+            f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}"
+        )


The bf16_outlier_compare function is duplicated in models/deepseek/v4/deepseek_v4_decode_swa.py. To improve maintainability, this custom comparator should be moved to a shared validation utility (e.g., in golden/validation.py).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)
130-155: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

__main__ is missing a --seed argument and torch.manual_seed call, unlike the other files in this PR.

build_tensor_specs() initialises all input tensors with bare torch.randn / torch.rand calls (lines 111–119), so test runs are non-deterministic. deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py both add --seed / torch.manual_seed as part of this PR; hc_post was apparently missed.
🛡️ Proposed fix
+    import torch
     from golden import RunConfig, run_jit

     parser = argparse.ArgumentParser()
     parser.add_argument("-p", "--platform", ...)
     parser.add_argument("-d", "--device", ...)
+    parser.add_argument("--seed", type=int, default=20260508)
     parser.add_argument("--runtime-profiling", ...)
     args = parser.parse_args()

+    torch.manual_seed(args.seed)
+
     result = run_jit(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 130 - 155, Add
a deterministic seed option and apply it before generating tensors: add a
"--seed" argparse argument in the __main__ block, parse it, and call
torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all if appropriate)
before invoking build_tensor_specs() / run_jit so
deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use
reproducible inputs; ensure the new arg mirrors the other PR files' behavior and
keeps the rest of the RunConfig/runtime logic unchanged.

🧹 Nitpick comments (2)

models/deepseek/v4/deepseek_v4_decode_hc_post.py (1)

97-99: ⚡ Quick win

Extract _to_device_bf16 to a shared utility — it is defined identically in deepseek_v4_decode_sparse_attn.py.

Both golden functions define the same nested helper. Any future change to the rounding logic would require parallel edits. A shared golden_bf16_utils.py in this directory (or a top-level golden_utils.py) would eliminate the duplication and give deepseek_v4_decode_swa.py a clean import path as well.

♻️ Suggested extraction

Create models/deepseek/v4/golden_bf16_utils.py:

+import torch
+
+def to_device_bf16(value: "torch.Tensor") -> "torch.Tensor":
+    """Round-half-up FP32→BF16 cast, matching hardware kernel rounding behaviour."""
+    rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
+    return rounded.view(torch.float32).to(torch.bfloat16)

Then in both golden functions:

+from golden_bf16_utils import to_device_bf16
 ...
-    def _to_device_bf16(value):
-        rounded = (value.contiguous().view(torch.int32) + 0x8000) & -0x10000
-        return rounded.view(torch.float32).to(torch.bfloat16)
-
-    y = _to_device_bf16(y_fp32)
+    y = to_device_bf16(y_fp32)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py` around lines 97 - 99,
Extract the duplicated nested helper _to_device_bf16 into a shared utility
module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import it from both
deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py so the
rounding logic is maintained in one place; specifically, move the body of
_to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view
float32 → bfloat16 conversion) into a function named to_device_bf16 (or
similarly), update both files to call that shared function instead of defining
the nested helper, and ensure the function accepts and returns a torch.Tensor
with the same semantics as the original _to_device_bf16.

models/deepseek/v4/deepseek_v4_decode_swa.py (1)

413-429: ⚡ Quick win

bf16_outlier_compare is a byte-for-byte copy of the same function in deepseek_v4_decode_qkv_proj_rope.py.

Move it to the shared utility module suggested above (or a standalone golden_compare_utils.py) and import it in both files. Also, the import torch on line 414 is redundant — torch is already imported at line 410.

♻️ Suggested extraction (in golden_bf16_utils.py or similar)

+def bf16_outlier_compare(actual, expected, actual_outputs, expected_outputs, inputs, rtol, atol):
+    import torch
+    close = torch.isclose(actual, expected, rtol=rtol, atol=atol)
+    mismatch = int((~close).sum().item())
+    max_mismatch = int(actual.numel() * 0.005)
+    if mismatch <= max_mismatch:
+        return True, f"mismatch={mismatch}/{actual.numel()} <= {max_mismatch}"
+    diff = (actual.float() - expected.float()).abs()
+    max_idx = int(diff.flatten().argmax().item())
+    return False, (
+        f"    BF16 outlier budget exceeded: mismatch={mismatch}/{actual.numel()} "
+        f"limit={max_mismatch} rtol={rtol} atol={atol}\n"
+        f"    max_abs={diff.max().item():.8g} idx={max_idx} "
+        f"actual={actual.flatten()[max_idx].item()} expected={expected.flatten()[max_idx].item()}"
+    )

Then in both files:

+from golden_bf16_utils import bf16_outlier_compare
-    def bf16_outlier_compare(...):
-        ...

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/deepseek_v4_decode_swa.py` around lines 413 - 429, The
bf16_outlier_compare function is duplicated; extract this function into a shared
utility module (e.g., golden_bf16_utils.py) and have both modules import and use
it instead of their local copies: move the bf16_outlier_compare implementation
(preserving its signature and behavior) into the new module, remove the
redundant local definitions in deepseek_v4_decode_swa.py and
deepseek_v4_decode_qkv_proj_rope.py, delete the extra import torch inside the
function bodies (rely on the module-level import), and update both files to
import bf16_outlier_compare from the new utility.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 130-155: Add a deterministic seed option and apply it before
generating tensors: add a "--seed" argparse argument in the __main__ block,
parse it, and call torch.manual_seed(args.seed) (and torch.cuda.manual_seed_all
if appropriate) before invoking build_tensor_specs() / run_jit so
deepseek_v4_decode_hc_post_test and golden_deepseek_v4_decode_hc_post use
reproducible inputs; ensure the new arg mirrors the other PR files' behavior and
keeps the rest of the RunConfig/runtime logic unchanged.

---

Nitpick comments:
In `@models/deepseek/v4/deepseek_v4_decode_hc_post.py`:
- Around line 97-99: Extract the duplicated nested helper _to_device_bf16 into a
shared utility module (e.g., models/deepseek/v4/golden_bf16_utils.py) and import
it from both deepseek_v4_decode_hc_post.py and deepseek_v4_decode_sparse_attn.py
so the rounding logic is maintained in one place; specifically, move the body of
_to_device_bf16 (the contiguous/view/int32 add 0x8000 & -0x10000 then view
float32 → bfloat16 conversion) into a function named to_device_bf16 (or
similarly), update both files to call that shared function instead of defining
the nested helper, and ensure the function accepts and returns a torch.Tensor
with the same semantics as the original _to_device_bf16.

In `@models/deepseek/v4/deepseek_v4_decode_swa.py`:
- Around line 413-429: The bf16_outlier_compare function is duplicated; extract
this function into a shared utility module (e.g., golden_bf16_utils.py) and have
both modules import and use it instead of their local copies: move the
bf16_outlier_compare implementation (preserving its signature and behavior) into
the new module, remove the redundant local definitions in
deepseek_v4_decode_swa.py and deepseek_v4_decode_qkv_proj_rope.py, delete the
extra import torch inside the function bodies (rely on the module-level import),
and update both files to import bf16_outlier_compare from the new utility.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc761d97-4484-4164-80c5-5a5267245fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 23fe87c and 9cea4e9.

📒 Files selected for processing (4)

models/deepseek/v4/deepseek_v4_decode_hc_post.py
models/deepseek/v4/deepseek_v4_decode_qkv_proj_rope.py
models/deepseek/v4/deepseek_v4_decode_sparse_attn.py
models/deepseek/v4/deepseek_v4_decode_swa.py

Update: tighten DeepSeek v4 precision checks

9cea4e9

- Align QKV and SWA goldens with kernel BF16 rounding and tiled accumulation - Add fixed seeds and BF16 outlier-budget comparators for lower tolerances

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DeepSeek v4 precision checks#237

Update DeepSeek v4 precision checks#237
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:fix/deepseek-v4-precision-tol

high-cloud commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

high-cloud commented May 8, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 8, 2026 •

edited

Loading