Align DeepSeek V4 INT8/BF16 cast to round-to-nearest-even by zhangqi-chen · Pull Request #279 · hw-native-sys/pypto-lib

zhangqi-chen · 2026-05-14T07:25:07Z

Summary

Aligns all kernel narrow casts in v4 to round-to-nearest-even (RTNE), matching the upstream reference cast strategy and the golden side's torch defaults.

INT8 quant FP32 → INT32: mode="round" (half-away-from-zero) → mode="rint" (RTNE). Affected: qkv_proj_rope (1), sparse_attn (1), indexer (3), moe_expert (4).
FP32 → BF16: added explicit mode="rint" at every call site (44 spots across hc_pre, hc_post, qkv_proj_rope, sparse_attn, indexer, indexer_compressor, compressor_ratio4/128, moe_expert, attention_swa, attention_hca). Previously inherited pl.cast's default mode="round" (half-away-from-zero); the new explicit mode="rint" matches torch's default RTNE used in goldens via .to(torch.bfloat16).
Golden helpers: replaced the local _round_half_away_from_zero / round_half_away_from_zero with torch.round (RTNE by default). Removed the helpers in qkv_proj_rope, sparse_attn, indexer, moe_expert, attention_swa, attention_hca.

The INT32 → FP16 step (still mode="round") and the FP16 → INT8 trunc step are unchanged: they already match the reference's CAST_ROUND + CAST_TRUNC semantics.

Validation (device 8, `a2a3`)

All single-stage kernels still pass under their existing tolerances. Notable improvement: compressor_ratio128 previously failed strict allclose on kv (32/8192 ≈ 0.39% bad points) and kv_cache (1 BF16 point at 5-ULP); both now pass with the RTNE alignment. compressor_ratio4 remains borderline (kv 11/8192 ≈ 0.13% bad under strict allclose, well within the 0.5% outlier escape used by attention-class outputs).

End-to-end attention_swa and attention_hca are out of scope for this PR — their fused-kernel tolerance is a separate design question (cumulative single-stage error, no upstream end-to-end counterpart).

Known caveat — `pl.cast` `satmode`

The reference also sets satmode=ON on the final INT8 cast; pl.cast does not expose a satmode parameter in the current pypto framework. INT8_SCALE_MAX = 127.0 mathematically rules out overflow so the gap is benign for v4. Tracking framework-side as a follow-up.

Unrelated pre-existing issue (not introduced by this PR)

moe_expert.py fails to compile on both this branch and upstream/main with pl.col_expand_mul: second argument requires ≥2 dimensions, got 1. This is a pypto framework regression in col_expand_mul's API and is independent of the cast alignment work.

Related Issues

Switches all kernel narrow casts in v4 to round-to-nearest-even (RTNE) to match the upstream reference cast strategy and the golden side's torch defaults: * FP32 -> INT32 in INT8 quant: mode="round" (half-away-from-zero) -> mode="rint" (RTNE) in qkv_proj_rope, sparse_attn, indexer (x3), moe_expert (x4). * FP32 -> BF16: explicit mode="rint" added at every call site (44 spots across hc_pre, hc_post, qkv_proj_rope, sparse_attn, indexer, indexer_compressor, compressor_ratio4/128, moe_expert, attention_swa, attention_hca). Previously inherited pl.cast default "round" which is half-away-from-zero; the new explicit mode matches torch's default RTNE used in goldens via .to(torch.bfloat16). Golden helpers replaced: * _round_half_away_from_zero / round_half_away_from_zero now use torch.round (RTNE by default) directly. Removed the local helpers in qkv_proj_rope, sparse_attn, indexer, moe_expert, attention_swa, attention_hca. The INT32 -> FP16 step in the INT8 quant chain (mode="round") and the FP16 -> INT8 trunc step are unchanged: they already match the upstream reference's CAST_ROUND + CAST_TRUNC semantics. The reference also sets satmode=ON on the final INT8 cast; pl.cast does not expose a satmode parameter today, but INT8_SCALE_MAX = 127.0 mathematically rules out overflow so the gap is benign.

coderabbitai · 2026-05-14T07:25:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7be75ed5-5809-42f7-a1ef-ca7cc95148e2

📥 Commits

Reviewing files that changed from the base of the PR and between 597fc8f and 149c039.

📒 Files selected for processing (11)

models/deepseek/v4/attention_hca.py
models/deepseek/v4/attention_swa.py
models/deepseek/v4/compressor_ratio128.py
models/deepseek/v4/compressor_ratio4.py
models/deepseek/v4/hc_post.py
models/deepseek/v4/hc_pre.py
models/deepseek/v4/indexer.py
models/deepseek/v4/indexer_compressor.py
models/deepseek/v4/moe_expert.py
models/deepseek/v4/qkv_proj_rope.py
models/deepseek/v4/sparse_attn.py

📝 Walkthrough

Walkthrough

This PR standardizes numerical rounding semantics across all DeepSeek-V4 decode kernels and their PyTorch golden references. Golden reference helpers replace custom half-away-from-zero rounding with torch.round(), and JIT kernels apply mode="rint" at BF16, INT32, and INT8 conversion boundaries throughout RoPE, quantization, KV-cache, and output paths.

Changes

DeepSeek-V4 Rounding Semantics Standardization

Layer / File(s)	Summary
Golden Reference Quantization Helpers `models/deepseek/v4/indexer.py`, `models/deepseek/v4/moe_expert.py`, `models/deepseek/v4/qkv_proj_rope.py`, `models/deepseek/v4/sparse_attn.py`, `models/deepseek/v4/attention_hca.py`, `models/deepseek/v4/attention_swa.py`	PyTorch golden references remove `_round_half_away_from_zero` and update `_int8_quant_per_row` and `_quant_w_per_channel` helpers to use `torch.round()` for deterministic INT8 quantization across indexer, MOE, RoPE projection, sparse attention, and attention kernels.
RoPE Path BF16 Casting `models/deepseek/v4/attention_hca.py`, `models/deepseek/v4/attention_swa.py`, `models/deepseek/v4/indexer.py`, `models/deepseek/v4/indexer_compressor.py`, `models/deepseek/v4/qkv_proj_rope.py`, `models/deepseek/v4/sparse_attn.py`	RoPE cosine/sine intermediate tensors across attention HCA/SWA, indexer, and sparse-attention kernels now cast to BF16 using `mode="rint"`, affecting both main and compressor RoPE computation steps in decode orchestration.
Activation & Weight Quantization Casting `models/deepseek/v4/indexer.py`, `models/deepseek/v4/qkv_proj_rope.py`, `models/deepseek/v4/moe_expert.py`, `models/deepseek/v4/sparse_attn.py`	INT32/INT8 quantization in QR projection, hadamard, dynamic activation, and MOE expert paths now use `mode="rint"` instead of default or "round" modes, updating per-tile and per-row requantization across indexed projection and expert selection kernels.
KV Cache & RMSNorm Path Casting `models/deepseek/v4/compressor_ratio128.py`, `models/deepseek/v4/compressor_ratio4.py`, `models/deepseek/v4/indexer_compressor.py`, `models/deepseek/v4/qkv_proj_rope.py`	RMSNorm intermediate BF16 conversions and KV cache writes across compressor (ratio-4 and ratio-128) and indexer-compressor paths now use `mode="rint"`, including pooled-KV chunk normalization, rope accumulator downcasts, normed-KV assembly, and final cache-write steps.
Output Assembly & Single-File Paths `models/deepseek/v4/qkv_proj_rope.py`, `models/deepseek/v4/moe_expert.py`, `models/deepseek/v4/hc_post.py`, `models/deepseek/v4/hc_pre.py`	QKV projection token assembly, shared-expert output write, and pre/post-pipeline input/output paths now use `mode="rint"` for final FP32-to-BF16 and INT8 conversions instead of default cast behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hw-native-sys/pypto-lib#265: Modifies indexer_compressor.py in the decode compressor path where this PR applies the new mode="rint" rounding semantics to BF16/quantization conversions.
hw-native-sys/pypto-lib#246: Introduces HCA ratio-128 decode path orchestration in attention_hca.py where this PR updates RoPE and weight quantization rounding behavior.
hw-native-sys/pypto-lib#270: Modifies compressor_ratio128.py in the ratio-128 compressor kernel implementation where this PR tightens BF16 downcast rounding with mode="rint".

Poem

🐰 Hop through the kernels, rounding so neat,
From half-away-from-zero to even-more-sweet,
Each BF16 cast, each INT8 conversion tight,
Now rint brings determinism to the decode at night! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: aligning DeepSeek V4 INT8/BF16 cast behavior to round-to-nearest-even (RTNE), which is the core objective of this PR.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, providing detailed explanation of what was changed (cast modes and rounding behavior), where (multiple files), validation results, and known caveats.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request standardizes rounding behavior across multiple model components, including attention mechanisms (HCA, SWA, Sparse), MoE experts, and indexers. The changes replace custom 'round half away from zero' implementations with standard torch.round and explicitly set the rounding mode to 'rint' (round to nearest, ties to even) for pl.cast operations involving BF16 and INT32 conversions. These updates ensure consistent numerical behavior during quantization and type casting across the codebase. No review comments were provided for this pull request, and I have no additional feedback.

zhangqi-chen merged commit da8921c into hw-native-sys:main May 14, 2026
6 checks passed

zhangqi-chen deleted the feat/align-cast-mode-rtne branch May 14, 2026 07:29

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align DeepSeek V4 INT8/BF16 cast to round-to-nearest-even#279

Align DeepSeek V4 INT8/BF16 cast to round-to-nearest-even#279
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:feat/align-cast-mode-rtne

zhangqi-chen commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangqi-chen commented May 14, 2026

Summary

Validation (device 8, a2a3)

Known caveat — pl.cast satmode

Unrelated pre-existing issue (not introduced by this PR)

Related Issues

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Validation (device 8, `a2a3`)

Known caveat — `pl.cast` `satmode`

coderabbitai Bot commented May 14, 2026 •

edited

Loading