Skip to content

Align DeepSeek V4 per-output compare with reference tolerance#277

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:feat/v4-align-compare-with-gitcode
May 14, 2026
Merged

Align DeepSeek V4 per-output compare with reference tolerance#277
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
zhangqi-chen:feat/v4-align-compare-with-gitcode

Conversation

@zhangqi-chen
Copy link
Copy Markdown
Collaborator

Summary

Switches per-output validation in the DeepSeek V4 kernels from broad torch.allclose(rtol, atol) defaults to ratio_allclose with per-output tolerances matching the upstream reference scheme.

  • hc_pre: x_mixed atol=1e-4 rtol=1/128; post/comb atol=2.5e-5 rtol=5e-3.
  • qkv_proj_rope: q/kv atol=1e-4 rtol=1/128; qr INT8 LSB exact (atol=1 rtol=0 max_error_ratio=0); qr_scale atol=2.5e-5 rtol=5e-3. Drops the bespoke int8_lsb_compare helper.
  • sparse_attn: attn_out atol=1e-4 rtol=1/128 across all three compress_ratio paths (0 / 4 / 128).
  • attention_swa: x_out atol=1e-4 rtol=1/128 (end-to-end fused kernel).
  • compressor_ratio4 / compressor_ratio128 / indexer_compressor: kv atol=1e-4 rtol=1/128, kv_state/score_state atol=1e-3 rtol=1e-3, all with max_error_ratio=0 to mirror strict allclose. kv_cache keeps bf16_allclose_or_ulp() (no reference counterpart).
  • indexer: score atol=1e-4 rtol=1/128 (closest analog to the prolog's weights output); idx_kv_cache and topk_idxs comparators unchanged.

Validation results (device 8, a2a3)

All single-stage kernels pass under the new tolerances: hc_pre, qkv_proj_rope, sparse_attn (×3 ratios), indexer, indexer_compressor. The compressor ratio=4/128 kv outputs fall just outside strict allclose (~0.085% and ~0.39% bad points) but well within the 0.5% outlier escape used by attention-class outputs. The end-to-end attention_swa kernel exceeds the single-stage tolerance due to error accumulation across hc_pre → qkv_proj_rope (W8A8) → sparse_attn → hc_post.

Related Issues

Switches per-output validation in the v4 kernels from the broad
torch.allclose(rtol, atol) defaults to ratio_allclose with per-output
tolerances matching the upstream reference scheme:

* hc_pre: x_mixed atol=1e-4 rtol=1/128; post/comb atol=2.5e-5 rtol=5e-3.
* qkv_proj_rope: q/kv atol=1e-4 rtol=1/128; qr INT8 LSB exact
  (atol=1 rtol=0 max_error_ratio=0); qr_scale atol=2.5e-5 rtol=5e-3.
  Drops the bespoke int8_lsb_compare helper.
* sparse_attn: attn_out atol=1e-4 rtol=1/128 across all three
  compress_ratio paths (0 / 4 / 128).
* attention_swa: x_out atol=1e-4 rtol=1/128 (end-to-end fused kernel;
  see compare_settings_vs_gitcode.md notes on accumulated error).
* compressor_ratio4 / compressor_ratio128 / indexer_compressor:
  kv atol=1e-4 rtol=1/128, kv_state/score_state atol=1e-3 rtol=1e-3,
  all with max_error_ratio=0 to mirror strict allclose. kv_cache keeps
  bf16_allclose_or_ulp() which has no reference counterpart.
* indexer: score atol=1e-4 rtol=1/128 (closest analog to the prolog's
  weights output); idx_kv_cache and topk_idxs comparators unchanged.

All single-stage kernels pass under the new tolerances. The compressor
ratio=4/128 paths fall just outside strict allclose on kv (~0.085% and
~0.39% bad points) but well within the 0.5% outlier escape used by
attention-class outputs. The end-to-end attention_swa kernel exceeds
the single-stage tolerance due to error accumulation across hc_pre →
qkv_proj_rope (W8A8) → sparse_attn → hc_post — see
models/deepseek/v4/compare_settings_vs_gitcode.md (local reference, not
committed) for the per-stage cross-walk.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0065403c-c514-4d73-b65b-71ef783feb2a

📥 Commits

Reviewing files that changed from the base of the PR and between 8f7af9a and 7c33367.

📒 Files selected for processing (8)
  • models/deepseek/v4/attention_swa.py
  • models/deepseek/v4/compressor_ratio128.py
  • models/deepseek/v4/compressor_ratio4.py
  • models/deepseek/v4/hc_pre.py
  • models/deepseek/v4/indexer.py
  • models/deepseek/v4/indexer_compressor.py
  • models/deepseek/v4/qkv_proj_rope.py
  • models/deepseek/v4/sparse_attn.py
💤 Files with no reviewable changes (1)
  • models/deepseek/v4/attention_swa.py

📝 Walkthrough

Walkthrough

This PR updates test harnesses across eight Deepseek v4 model files to adopt a new ratio_allclose comparison utility. Seven files add imports and update output validation to use ratio-based tolerances; one file removes redundant comment text. All changes standardize JIT test correctness checking.

Changes

Test Harness Unification with ratio_allclose

Layer / File(s) Summary
Harness imports for ratio_allclose
models/deepseek/v4/compressor_ratio128.py, models/deepseek/v4/compressor_ratio4.py, models/deepseek/v4/hc_pre.py, models/deepseek/v4/indexer.py, models/deepseek/v4/indexer_compressor.py, models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/sparse_attn.py
All test harnesses import ratio_allclose from golden module, providing the foundation for standardized tolerance-based output validation across multiple files.
Multi-output validation for compressor tests
models/deepseek/v4/compressor_ratio128.py, models/deepseek/v4/compressor_ratio4.py, models/deepseek/v4/indexer_compressor.py
Compressor and indexer-compressor tests expand compare_fn configurations from single kv_cache validation to multi-output checking of kv, kv_state, and score_state using ratio_allclose with per-tensor tolerance scaling (e.g., rtol=1.0/128 for kv), while retaining bf16_allclose_or_ulp for bfloat16 tensor comparisons.
Single/targeted output validation updates
models/deepseek/v4/hc_pre.py, models/deepseek/v4/indexer.py, models/deepseek/v4/qkv_proj_rope.py, models/deepseek/v4/sparse_attn.py
Four harnesses add or update compare_fn configurations: hc_pre validates x_mixed, post, and comb; indexer replaces score validation to use ratio_allclose(atol=1e-4, rtol=1.0/128); qkv_proj_rope standardizes q, kv, qr, and qr_scale validation replacing a bespoke int8_lsb_compare helper; sparse_attn adds attn_out validation with scaled ratio tolerance.
Comment cleanup
models/deepseek/v4/attention_swa.py
Removes an explanatory comment block preceding RunConfig tolerance configuration, while preserving numeric tolerance settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#273: Adds and exports ratio_allclose to the golden module, which is the foundation that these test harnesses depend on.
  • hw-native-sys/pypto-lib#270: Updates compressor APIs and state layouts (kv, kv_state, score_state) that align with the multi-output validation changes in compressor_ratio128 and compressor_ratio4 test harnesses.
  • hw-native-sys/pypto-lib#265: Introduces indexer compressor tensors (kv, kv_state, score_state) that correspond to the expanded compare_fn validation in indexer_compressor and compressor_ratio4 tests.

Poem

🐰 A rabbit hops through test files with glee,
Ratio checks now unified and free,
From sparse to compressor, each harness aligned,
With tolerance logic, so carefully designed,
Golden comparisons shine—QA's dream, you see! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective of the changeset: aligning per-output comparison tolerances in DeepSeek V4 kernels with reference standards.
Description check ✅ Passed The description comprehensively explains the changes, including specific tolerance parameters for each affected file and validation results demonstrating the changes work correctly.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the validation logic across several DeepSeek v4 model components by replacing or augmenting existing comparison functions with ratio_allclose. Key changes include the removal of manual INT8 comparison logic in qkv_proj_rope.py and the addition of detailed compare_fn entries for outputs such as kv, kv_state, and score_state in the compressor and indexer modules. Feedback from the reviewer points out that explicitly setting max_error_ratio=0.0 in the compressor modules will cause validation failures, as it prevents the 0.5% outlier allowance intended for these kernels.

),
compare_fn={"kv_cache": bf16_allclose_or_ulp()},
compare_fn={
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR summary states that the kv outputs for ratio 4/128 fall just outside strict allclose (~0.085% and ~0.39% bad points) but are within the 0.5% outlier allowance. However, the code here explicitly sets max_error_ratio=0.0, which enforces strict allclose and will cause the validation to fail for these kernels. You should remove the max_error_ratio=0.0 argument for the kv output to allow the default 0.5% outlier escape.

Suggested change
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128),

),
compare_fn={"kv_cache": bf16_allclose_or_ulp()},
compare_fn={
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR summary states that the kv outputs for ratio 4/128 fall just outside strict allclose (~0.085% and ~0.39% bad points) but are within the 0.5% outlier allowance. However, the code here explicitly sets max_error_ratio=0.0, which enforces strict allclose and will cause the validation to fail for these kernels. You should remove the max_error_ratio=0.0 argument for the kv output to allow the default 0.5% outlier escape.

Suggested change
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128),

),
compare_fn={"kv_cache": bf16_allclose_or_ulp()},
compare_fn={
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PR summary states that the kv outputs for ratio 4/128 fall just outside strict allclose (~0.085% and ~0.39% bad points) but are within the 0.5% outlier allowance. However, the code here explicitly sets max_error_ratio=0.0, which enforces strict allclose and will cause the validation to fail for these kernels. You should remove the max_error_ratio=0.0 argument for the kv output to allow the default 0.5% outlier escape.

Suggested change
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128, max_error_ratio=0.0),
"kv": ratio_allclose(atol=1e-4, rtol=1.0 / 128),

@zhangqi-chen zhangqi-chen merged commit 3888a86 into hw-native-sys:main May 14, 2026
6 checks passed
@zhangqi-chen zhangqi-chen deleted the feat/v4-align-compare-with-gitcode branch May 14, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant