Skip to content

Align DeepSeek V4 INT8/BF16 cast to round-to-nearest-even#279

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:feat/align-cast-mode-rtne
May 14, 2026
Merged

Align DeepSeek V4 INT8/BF16 cast to round-to-nearest-even#279
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:feat/align-cast-mode-rtne

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Collaborator

Summary

Aligns all kernel narrow casts in v4 to round-to-nearest-even (RTNE), matching the upstream reference cast strategy and the golden side's torch defaults.

  • INT8 quant FP32 → INT32: mode="round" (half-away-from-zero) → mode="rint" (RTNE). Affected: qkv_proj_rope (1), sparse_attn (1), indexer (3), moe_expert (4).
  • FP32 → BF16: added explicit mode="rint" at every call site (44 spots across hc_pre, hc_post, qkv_proj_rope, sparse_attn, indexer, indexer_compressor, compressor_ratio4/128, moe_expert, attention_swa, attention_hca). Previously inherited pl.cast's default mode="round" (half-away-from-zero); the new explicit mode="rint" matches torch's default RTNE used in goldens via .to(torch.bfloat16).
  • Golden helpers: replaced the local _round_half_away_from_zero / round_half_away_from_zero with torch.round (RTNE by default). Removed the helpers in qkv_proj_rope, sparse_attn, indexer, moe_expert, attention_swa, attention_hca.

The INT32 → FP16 step (still mode="round") and the FP16 → INT8 trunc step are unchanged: they already match the reference's CAST_ROUND + CAST_TRUNC semantics.

Validation (device 8, a2a3)

All single-stage kernels still pass under their existing tolerances. Notable improvement: compressor_ratio128 previously failed strict allclose on kv (32/8192 ≈ 0.39% bad points) and kv_cache (1 BF16 point at 5-ULP); both now pass with the RTNE alignment. compressor_ratio4 remains borderline (kv 11/8192 ≈ 0.13% bad under strict allclose, well within the 0.5% outlier escape used by attention-class outputs).

End-to-end attention_swa and attention_hca are out of scope for this PR — their fused-kernel tolerance is a separate design question (cumulative single-stage error, no upstream end-to-end counterpart).

Known caveat — pl.cast satmode

The reference also sets satmode=ON on the final INT8 cast; pl.cast does not expose a satmode parameter in the current pypto framework. INT8_SCALE_MAX = 127.0 mathematically rules out overflow so the gap is benign for v4. Tracking framework-side as a follow-up.

Unrelated pre-existing issue (not introduced by this PR)

moe_expert.py fails to compile on both this branch and upstream/main with pl.col_expand_mul: second argument requires ≥2 dimensions, got 1. This is a pypto framework regression in col_expand_mul's API and is independent of the cast alignment work.

Related Issues

Switches all kernel narrow casts in v4 to round-to-nearest-even (RTNE)
to match the upstream reference cast strategy and the golden side's
torch defaults:

* FP32 -> INT32 in INT8 quant: mode="round" (half-away-from-zero) ->
  mode="rint" (RTNE) in qkv_proj_rope, sparse_attn, indexer (x3),
  moe_expert (x4).
* FP32 -> BF16: explicit mode="rint" added at every call site (44
  spots across hc_pre, hc_post, qkv_proj_rope, sparse_attn, indexer,
  indexer_compressor, compressor_ratio4/128, moe_expert, attention_swa,
  attention_hca). Previously inherited pl.cast default "round" which is
  half-away-from-zero; the new explicit mode matches torch's default
  RTNE used in goldens via .to(torch.bfloat16).

Golden helpers replaced:

* _round_half_away_from_zero / round_half_away_from_zero now use
  torch.round (RTNE by default) directly. Removed the local helpers in
  qkv_proj_rope, sparse_attn, indexer, moe_expert, attention_swa,
  attention_hca.

The INT32 -> FP16 step in the INT8 quant chain (mode="round") and the
FP16 -> INT8 trunc step are unchanged: they already match the upstream
reference's CAST_ROUND + CAST_TRUNC semantics. The reference also sets
satmode=ON on the final INT8 cast; pl.cast does not expose a satmode
parameter today, but INT8_SCALE_MAX = 127.0 mathematically rules out
overflow so the gap is benign.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7be75ed5-5809-42f7-a1ef-ca7cc95148e2

📥 Commits

Reviewing files that changed from the base of the PR and between 597fc8f and 149c039.

📒 Files selected for processing (11)
  • models/deepseek/v4/attention_hca.py
  • models/deepseek/v4/attention_swa.py
  • models/deepseek/v4/compressor_ratio128.py
  • models/deepseek/v4/compressor_ratio4.py
  • models/deepseek/v4/hc_post.py
  • models/deepseek/v4/hc_pre.py
  • models/deepseek/v4/indexer.py
  • models/deepseek/v4/indexer_compressor.py
  • models/deepseek/v4/moe_expert.py
  • models/deepseek/v4/qkv_proj_rope.py
  • models/deepseek/v4/sparse_attn.py

📝 Walkthrough

Walkthrough

This PR standardizes numerical rounding semantics across all DeepSeek-V4 decode kernels and their PyTorch golden references. Golden reference helpers replace custom half-away-from-zero rounding with torch.round(), and JIT kernels apply mode="rint" at BF16, INT32, and INT8 conversion boundaries throughout RoPE, quantization, KV-cache, and output paths.

Changes

DeepSeek-V4 Rounding Semantics Standardization

Layer / File(s) Summary
Golden Reference Quantization Helpers
models/deepseek/v4/indexer.py, models/deepseek/v4/moe_expert.py, models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/sparse_attn.py, models/deepseek/v4/attention_hca.py, models/deepseek/v4/attention_swa.py
PyTorch golden references remove _round_half_away_from_zero and update _int8_quant_per_row and _quant_w_per_channel helpers to use torch.round() for deterministic INT8 quantization across indexer, MOE, RoPE projection, sparse attention, and attention kernels.
RoPE Path BF16 Casting
models/deepseek/v4/attention_hca.py, models/deepseek/v4/attention_swa.py, models/deepseek/v4/indexer.py, models/deepseek/v4/indexer_compressor.py, models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/sparse_attn.py
RoPE cosine/sine intermediate tensors across attention HCA/SWA, indexer, and sparse-attention kernels now cast to BF16 using mode="rint", affecting both main and compressor RoPE computation steps in decode orchestration.
Activation & Weight Quantization Casting
models/deepseek/v4/indexer.py, models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/moe_expert.py, models/deepseek/v4/sparse_attn.py
INT32/INT8 quantization in QR projection, hadamard, dynamic activation, and MOE expert paths now use mode="rint" instead of default or "round" modes, updating per-tile and per-row requantization across indexed projection and expert selection kernels.
KV Cache & RMSNorm Path Casting
models/deepseek/v4/compressor_ratio128.py, models/deepseek/v4/compressor_ratio4.py, models/deepseek/v4/indexer_compressor.py, models/deepseek/v4/qkv_proj_rope.py
RMSNorm intermediate BF16 conversions and KV cache writes across compressor (ratio-4 and ratio-128) and indexer-compressor paths now use mode="rint", including pooled-KV chunk normalization, rope accumulator downcasts, normed-KV assembly, and final cache-write steps.
Output Assembly & Single-File Paths
models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/moe_expert.py, models/deepseek/v4/hc_post.py, models/deepseek/v4/hc_pre.py
QKV projection token assembly, shared-expert output write, and pre/post-pipeline input/output paths now use mode="rint" for final FP32-to-BF16 and INT8 conversions instead of default cast behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#265: Modifies indexer_compressor.py in the decode compressor path where this PR applies the new mode="rint" rounding semantics to BF16/quantization conversions.
  • hw-native-sys/pypto-lib#246: Introduces HCA ratio-128 decode path orchestration in attention_hca.py where this PR updates RoPE and weight quantization rounding behavior.
  • hw-native-sys/pypto-lib#270: Modifies compressor_ratio128.py in the ratio-128 compressor kernel implementation where this PR tightens BF16 downcast rounding with mode="rint".

Poem

🐰 Hop through the kernels, rounding so neat,
From half-away-from-zero to even-more-sweet,
Each BF16 cast, each INT8 conversion tight,
Now rint brings determinism to the decode at night!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: aligning DeepSeek V4 INT8/BF16 cast behavior to round-to-nearest-even (RTNE), which is the core objective of this PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing detailed explanation of what was changed (cast modes and rounding behavior), where (multiple files), validation results, and known caveats.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhangqi-chen zhangqi-chen merged commit da8921c into hw-native-sys:main May 14, 2026
6 checks passed
@zhangqi-chen zhangqi-chen deleted the feat/align-cast-mode-rtne branch May 14, 2026 07:29
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes rounding behavior across multiple model components, including attention mechanisms (HCA, SWA, Sparse), MoE experts, and indexers. The changes replace custom 'round half away from zero' implementations with standard torch.round and explicitly set the rounding mode to 'rint' (round to nearest, ties to even) for pl.cast operations involving BF16 and INT32 conversions. These updates ensure consistent numerical behavior during quantization and type casting across the codebase. No review comments were provided for this pull request, and I have no additional feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant