Skip to content

test(qwen3): cover TP=2 in the hf golden gate#322

Merged
xiaguan merged 1 commit into
openinfer-project:mainfrom
FeathBow:test/245-qwen3-tp2-golden-gate
Jun 10, 2026
Merged

test(qwen3): cover TP=2 in the hf golden gate#322
xiaguan merged 1 commit into
openinfer-project:mainfrom
FeathBow:test/245-qwen3-tp2-golden-gate

Conversation

@FeathBow

Copy link
Copy Markdown
Contributor

Description

Refs #245

Adds TP=2 coverage to the Qwen3-4B HF golden gate. The gate previously only ever ran device_ordinals: vec![0]. The eager replay suite (bs=1 sequential with the determinism rerun, batched isolation, both prefix-cached replays) is extracted into run_eager_suite(golden, model_path, devices, label) and rerun under TP=2 with the same golden and tolerances. No new golden files. The pass lives in the existing #[test], so it runs in the normal local test flow on any machine with >= 2 GPUs and skips automatically below that. TP=8 is the follow-on already named in the issue.

Test Env

  • Dual-GH200 (aarch64, sm_90) node: full gate green, 7 single-GPU + 4 TP=2 passes in one process. TP=2 sits at the single-GPU noise floor (mean ~0.031, p99 ~0.12-0.13).
  • Single GPU (x86_64, sm_89): existing passes green and unchanged, TP pass skips with skipping hf_golden_gate TP=2 pass: 1 CUDA device(s) visible, need 2.
Full gate output on the dual-GH200 (single-GPU + TP=2, one process)
hf_golden_gate [sequential bs=1 eager]: 816 positions, 6528 head deltas — mean 0.0318 p50 0.0242 p99 0.1173 max 0.3124
hf_golden_gate [sequential bs=1 eager]: worst head delta 0.3124 @ seq 7 pos 5 token 68172 (pega -9.9384, HF -10.2508)
hf_golden_gate [batched eager (9, no pad)]: 153 positions, 1224 head deltas — mean 0.0325 p50 0.0240 p99 0.1558 max 0.3124
hf_golden_gate [batched eager (9, no pad)]: worst head delta 0.3124 @ seq 7 pos 5 token 239 (pega -9.0635, HF -9.3758)
hf_golden_gate [sequential bs=1 eager cached replay]: 816 positions, 6528 head deltas — mean 0.0313 p50 0.0253 p99 0.1168 max 0.3124
hf_golden_gate [sequential bs=1 eager cached replay]: worst head delta 0.3124 @ seq 7 pos 5 token 68172 (pega -9.9384, HF -10.2508)
hf_golden_gate [batched eager cached replay (9)]: 153 positions, 1224 head deltas — mean 0.0313 p50 0.0248 p99 0.1204 max 0.3124
hf_golden_gate [batched eager cached replay (9)]: worst head delta 0.3124 @ seq 7 pos 5 token 68172 (pega -9.9384, HF -10.2508)
hf_golden_gate [batched cuda-graph (9 padded)]: 153 positions, 1224 head deltas — mean 0.0325 p50 0.0240 p99 0.1558 max 0.3124
hf_golden_gate [batched cuda-graph (9 padded)]: worst head delta 0.3124 @ seq 7 pos 5 token 239 (pega -9.0635, HF -9.3758)
hf_golden_gate [batched cuda-graph (5 padded)]: 85 positions, 680 head deltas — mean 0.0311 p50 0.0248 p99 0.1271 max 0.1928
hf_golden_gate [batched cuda-graph (5 padded)]: worst head delta 0.1928 @ seq 3 pos 9 token 398 (pega -5.5985, HF -5.4057)
hf_golden_gate [batched cuda-graph cached replay (5)]: 85 positions, 680 head deltas — mean 0.0309 p50 0.0262 p99 0.1260 max 0.1816
hf_golden_gate [batched cuda-graph cached replay (5)]: worst head delta 0.1816 @ seq 4 pos 9 token 47534 (pega -5.2187, HF -5.4003)
hf_golden_gate [tp2 sequential bs=1 eager]: 816 positions, 6528 head deltas — mean 0.0311 p50 0.0245 p99 0.1189 max 0.2741
hf_golden_gate [tp2 sequential bs=1 eager]: worst head delta 0.2741 @ seq 14 pos 7 token 291 (pega -4.8874, HF -4.6133)
hf_golden_gate [tp2 batched eager (9, no pad)]: 153 positions, 1224 head deltas — mean 0.0302 p50 0.0221 p99 0.1280 max 0.3124
hf_golden_gate [tp2 batched eager (9, no pad)]: worst head delta 0.3124 @ seq 7 pos 5 token 68172 (pega -9.9384, HF -10.2508)
hf_golden_gate [tp2 sequential bs=1 eager cached replay]: 816 positions, 6528 head deltas — mean 0.0313 p50 0.0246 p99 0.1175 max 0.2741
hf_golden_gate [tp2 sequential bs=1 eager cached replay]: worst head delta 0.2741 @ seq 14 pos 7 token 291 (pega -4.8874, HF -4.6133)
hf_golden_gate [tp2 batched eager cached replay (9)]: 153 positions, 1224 head deltas — mean 0.0309 p50 0.0250 p99 0.1234 max 0.2034
hf_golden_gate [tp2 batched eager cached replay (9)]: worst head delta 0.2034 @ seq 4 pos 1 token 59941 (pega -2.3543, HF -2.5578)
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 33.68s

Type of Change

  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
  • I have performed a self-review of my own code.
  • I have formatted my commits according to Commitizen conventions.
  • I have run the local test suite and all tests pass (see CLAUDE.md).

@xiaguan xiaguan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan merged commit 7c68f77 into openinfer-project:main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants