Skip to content

Potential numerical accuracy issue in the fused_moe implementation #23

@huanghua1994

Description

@huanghua1994

Test environment: CTK 13.1, torch = 2.9.1+cu130, cuda-tile = 1.0.0, single B200 GPU

When using a parameter set (num_tokens, hidden_size, moe_intermediate_size, n_experts, top_k) = (128, 4096, 2048, 16, 4) in tests/ops/test_moe.py and add torch.manual_seed(0) in line 101, will have mismatched results:

(Earlier output omitted)
>       assert passed, f"\n{failed_msgs}"
               ^^^^^^
E       AssertionError:
E               *** OUTPUT 0 DID NOT MATCH THE REFERENCE (rtol=0.1, atol=0.1) ***
E                       allclose: False
E                       matched: 523466 / 524288 [99.84%]
E                       ref range:    -1.1700e+02 :  1.2300e+02
E                       test range:   -1.1650e+02 :  1.2300e+02
E                       |ref| range:   0.0000e+00 :  1.2300e+02
E                       |test| range:  0.0000e+00 :  1.2300e+02
E                       max absolute difference:  1.0000e+00
E                       max relative change:      4.3000e+01
E                       max max mean change:      2.0000e+00
E                       max arith mean change:    5.1400e+02
E                       shape: torch.Size([128, 4096]) stride: (4096, 1) dtype: torch.bfloat16
E                       mismatched indices:tensor([[   0, 1842],
E               [   0, 2071],
E               [   1,  675],
E               ...,
E               [ 127, 1728],
E               [ 127, 2207],
E               [ 127, 2628]])

For the same parameter set, if dtype is changed to float16, the test can pass.
Using a smaller hidden_size or a smaller moe_intermediate_size can also pass the test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions