test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269
test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269sxvvv wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new test suite, tests/kernels/moe/test_topk_softmax.py, to verify the numerical correctness of the MetaX MoE top-k gating kernels (topk_softmax and topk_sigmoid). The tests compare the outputs of the custom kernels against a PyTorch reference implementation across various shapes, token counts, and configurations (with/without bias and renormalization). It also explicitly handles a known kernel bug at a specific shape using pytest.mark.xfail. No review comments were provided, so there is no additional feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
The |
…gating Validated on MetaX C500: 210 passed, 36 xfailed. The xfails pin a real kernel dispatch bug at (num_experts=128, topk=8).
63ff860 to
b68c455
Compare
What
Adds the first test coverage for the two
_moe_CMoE gating ops that back the generic (non-grouped) fused-MoE routing path vLLM-metax inherits from vLLM (fused_moe.fused_topk->ops.topk_softmax):_moe_C.topk_softmax_moe_C.topk_sigmoidNew file:
tests/kernels/moe/test_topk_softmax.py. Addresses the 26Q2 roadmap item "Unit test that ported from vllm'stestsfolder" for cuda kernels / custom ops (#233).How it tests
Runs the real kernels on MetaX hardware and compares against a PyTorch reference whose bias semantics mirror vLLM-metax's own
maca_grouped_topk: experts are selected on the bias-corrected scores, but routing weights are gathered from the unbiased scores. Coverage:(num_experts, topk)shapes, token counts, seeds, with/without correction biasA real kernel bug it pins down
At exactly
(num_experts=128, topk=8)the gating kernels take a wrong routing-weight branch:Expert selection is still correct in all cases; only the weights are wrong. The defect is specific to this single tile (tk=8 is fine at every other expert count; 128 experts is fine at every other topk) and is data-independent.
(128, top-8)is a common MoE shape, so the affected cases are markedxfail(strict=True)rather than skipped — a future kernel fix will flip them green, and any regression here surfaces as an unexpected pass.Safety on non-MetaX CI
The module skips cleanly via
importorskip("mcoplib._moe_C")when the compiled extension (and therefore a MetaX backend) is absent.Validation
MetaX C500 (MACA 3.5.3.20, torch 2.8.0+metax3.5.3.9): 210 passed, 36 xfailed.