Skip to content

test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269

Open
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:test/moe-topk-gating
Open

test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:test/moe-topk-gating

Conversation

@sxvvv
Copy link
Copy Markdown

@sxvvv sxvvv commented Jun 6, 2026

What

Adds the first test coverage for the two _moe_C MoE gating ops that back the generic (non-grouped) fused-MoE routing path vLLM-metax inherits from vLLM (fused_moe.fused_topk -> ops.topk_softmax):

  • _moe_C.topk_softmax
  • _moe_C.topk_sigmoid

New file: tests/kernels/moe/test_topk_softmax.py. Addresses the 26Q2 roadmap item "Unit test that ported from vllm's tests folder" for cuda kernels / custom ops (#233).

How it tests

Runs the real kernels on MetaX hardware and compares against a PyTorch reference whose bias semantics mirror vLLM-metax's own maca_grouped_topk: experts are selected on the bias-corrected scores, but routing weights are gathered from the unbiased scores. Coverage:

  • expert selection (order-independent), across softmax & sigmoid, several (num_experts, topk) shapes, token counts, seeds, with/without correction bias
  • routing weights
  • renormalization sums to 1
  • output dtypes / shapes / in-range distinct ids

A real kernel bug it pins down

At exactly (num_experts=128, topk=8) the gating kernels take a wrong routing-weight branch:

case weight returned correct
topk_softmax, no bias softmax(scores) OK
topk_softmax, bias biased scores should be unbiased
topk_sigmoid, no bias softmax path should be sigmoid
topk_sigmoid, bias softmax path should be unbiased sigmoid

Expert selection is still correct in all cases; only the weights are wrong. The defect is specific to this single tile (tk=8 is fine at every other expert count; 128 experts is fine at every other topk) and is data-independent. (128, top-8) is a common MoE shape, so the affected cases are marked xfail(strict=True) rather than skipped — a future kernel fix will flip them green, and any regression here surfaces as an unexpected pass.

Safety on non-MetaX CI

The module skips cleanly via importorskip("mcoplib._moe_C") when the compiled extension (and therefore a MetaX backend) is absent.

Validation

MetaX C500 (MACA 3.5.3.20, torch 2.8.0+metax3.5.3.9): 210 passed, 36 xfailed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new test suite, tests/kernels/moe/test_topk_softmax.py, to verify the numerical correctness of the MetaX MoE top-k gating kernels (topk_softmax and topk_sigmoid). The tests compare the outputs of the custom kernels against a PyTorch reference implementation across various shapes, token counts, and configurations (with/without bias and renormalization). It also explicitly handles a known kernel bug at a specific shape using pytest.mark.xfail. No review comments were provided, so there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@sxvvv
Copy link
Copy Markdown
Author

sxvvv commented Jun 6, 2026

The xfail(strict=True) cases in this PR correspond to a real kernel bug, now filed separately as #270topk_softmax/topk_sigmoid return wrong routing weights at exactly (num_experts=128, topk=8). Once that kernel is fixed, the strict xfails here will surface as unexpected passes; the fix then only needs the xfail marker removed.

…gating

Validated on MetaX C500: 210 passed, 36 xfailed. The xfails pin a real
kernel dispatch bug at (num_experts=128, topk=8).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant