test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating by sxvvv · Pull Request #269 · MetaX-MACA/vLLM-metax

sxvvv · 2026-06-06T10:51:04Z

What

Adds the first test coverage for the two _moe_C MoE gating ops that back the generic (non-grouped) fused-MoE routing path vLLM-metax inherits from vLLM (fused_moe.fused_topk -> ops.topk_softmax):

_moe_C.topk_softmax
_moe_C.topk_sigmoid

New file: tests/kernels/moe/test_topk_softmax.py. Addresses the 26Q2 roadmap item "Unit test that ported from vllm's tests folder" for cuda kernels / custom ops (#233).

How it tests

Runs the real kernels on MetaX hardware and compares against a PyTorch reference whose bias semantics mirror vLLM-metax's own maca_grouped_topk: experts are selected on the bias-corrected scores, but routing weights are gathered from the unbiased scores. Coverage:

expert selection (order-independent), across softmax & sigmoid, several (num_experts, topk) shapes, token counts, seeds, with/without correction bias
routing weights
renormalization sums to 1
output dtypes / shapes / in-range distinct ids

A real kernel bug it pins down

At exactly (num_experts=128, topk=8) the gating kernels take a wrong routing-weight branch:

case	weight returned	correct
topk_softmax, no bias	softmax(scores)	OK
topk_softmax, bias	biased scores	should be unbiased
topk_sigmoid, no bias	softmax path	should be sigmoid
topk_sigmoid, bias	softmax path	should be unbiased sigmoid

Expert selection is still correct in all cases; only the weights are wrong. The defect is specific to this single tile (tk=8 is fine at every other expert count; 128 experts is fine at every other topk) and is data-independent. (128, top-8) is a common MoE shape, so the affected cases are marked xfail(strict=True) rather than skipped — a future kernel fix will flip them green, and any regression here surfaces as an unexpected pass.

Safety on non-MetaX CI

The module skips cleanly via importorskip("mcoplib._moe_C") when the compiled extension (and therefore a MetaX backend) is absent.

Validation

MetaX C500 (MACA 3.5.3.20, torch 2.8.0+metax3.5.3.9): 210 passed, 36 xfailed.

gemini-code-assist

Code Review

This pull request introduces a new test suite, tests/kernels/moe/test_topk_softmax.py, to verify the numerical correctness of the MetaX MoE top-k gating kernels (topk_softmax and topk_sigmoid). The tests compare the outputs of the custom kernels against a PyTorch reference implementation across various shapes, token counts, and configurations (with/without bias and renormalization). It also explicitly handles a known kernel bug at a specific shape using pytest.mark.xfail. No review comments were provided, so there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

sxvvv · 2026-06-06T11:02:46Z

The xfail(strict=True) cases in this PR correspond to a real kernel bug, now filed separately as #270 — topk_softmax/topk_sigmoid return wrong routing weights at exactly (num_experts=128, topk=8). Once that kernel is fixed, the strict xfails here will surface as unexpected passes; the fix then only needs the xfail marker removed.

…gating Validated on MetaX C500: 210 passed, 36 xfailed. The xfails pin a real kernel dispatch bug at (num_experts=128, topk=8).

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

sxvvv mentioned this pull request Jun 6, 2026

[Bug]: MoE topk_softmax/topk_sigmoid return wrong routing weights at (num_experts=128, topk=8) #270

Open

test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid …

b68c455

…gating Validated on MetaX C500: 210 passed, 36 xfailed. The xfails pin a real kernel dispatch bug at (num_experts=128, topk=8).

sxvvv force-pushed the test/moe-topk-gating branch from 63ff860 to b68c455 Compare June 6, 2026 11:08

ILikeIneine force-pushed the master branch from 3e2816c to 3029d1e Compare June 8, 2026 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269

test: add real-GPU numerical tests for MoE topk_softmax/topk_sigmoid gating#269
sxvvv wants to merge 1 commit into
MetaX-MACA:masterfrom
sxvvv:test/moe-topk-gating

sxvvv commented Jun 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

sxvvv commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sxvvv commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How it tests

A real kernel bug it pins down

Safety on non-MetaX CI

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

sxvvv commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sxvvv commented Jun 6, 2026 •

edited

Loading