llama/rope: gate fp64 hf_precompute_freqs_cis on cos/sin scaling#19308
llama/rope: gate fp64 hf_precompute_freqs_cis on cos/sin scaling#19308JacobSzwejbka merged 1 commit intopytorch:mainfrom
Conversation
a79521b ("Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family") unconditionally moved hf_precompute_freqs_cis to fp64 cos/sin precompute with a final cast to fp32. That works for the Phi-4 device validation that motivated the commit, but it broke test_static_attention.py::test_within_transformer on the Linux unittest runners (pull, pull-editable, trunk-release have been 100% red since the commit landed). The test compares mha_transformer (built with use_hf_rope=False, taking the pure-fp32 precompute_freqs_cis path) against static_transformer (built with use_hf_rope=True, taking hf_precompute_freqs_cis) at rtol=1e-3, with shared weights. Before a79521b, both paths produced bit-identical fp32 cos/sin tables (verified empirically: 0/192 entries differed). After the commit, HF cos/sin diverge from non-HF by ~1 ULP in 38/192 entries; that drift compounds across 4 transformer layers and tips past rtol=1e-3 on the CI runners (Python 3.10, source-built torch). Local Python 3.12 stayed just barely within tolerance, which is why review missed it. Gate the fp64 precompute on the property the original commit was actually protecting: a non-trivial cos/sin scale being applied. That is either LongRoPE active (Phi-3 / Phi-4 set short_factor and long_factor via config) or an explicit attention_factor != 1.0 passed through. Both cases preserve fp64; vanilla HF RoPE (Llama family, the test config) goes back to fp32 throughout and re-establishes bit-identical agreement with the non-HF path. Authored with Claude Code.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19308
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 1 Pending, 2 Unrelated FailuresAs of commit 4bae62e with merge base acffcb0 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Summary
a79521b ("Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family") unconditionally moved hf_precompute_freqs_cis to fp64 cos/sin precompute with a final cast to fp32. That works for the Phi-4 device validation that motivated the commit, but it broke test_static_attention.py::test_within_transformer on the Linux unittest runners (pull, pull-editable, trunk-release have been 100% red since the commit landed).
The test compares mha_transformer (built with use_hf_rope=False, taking the pure-fp32 precompute_freqs_cis path) against static_transformer (built with use_hf_rope=True, taking hf_precompute_freqs_cis) at rtol=1e-3, with shared weights. Before a79521b, both paths produced bit-identical fp32 cos/sin tables (verified empirically: 0/192 entries differed). After the commit, HF cos/sin diverge from non-HF by ~1 ULP in 38/192 entries; that drift compounds across 4 transformer layers and tips past rtol=1e-3 on the CI runners (Python 3.10, source-built torch). Local Python 3.12 stayed just barely within tolerance, which is why review missed it.
Gate the fp64 precompute on the property the original commit was actually protecting: a non-trivial cos/sin scale being applied. That is either LongRoPE active (Phi-3 / Phi-4 set short_factor and long_factor via config) or an explicit attention_factor != 1.0 passed through. Both cases preserve fp64; vanilla HF RoPE (Llama family, the test config) goes back to fp32 throughout and re-establishes bit-identical agreement with the non-HF path.
Authored with Claude Code.
Test plan
CI