[feature] support mctlassEx bf16 moe#266
Conversation
Signed-off-by: Yi Zhang (yi1zhang) - Linux Multimedia <yi1zhang@metax-tech.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for bf16 MoE (Mixture of Experts) utilizing the mctlass library, including custom operator registration and routing logic in the fused MoE execution path. The review feedback highlights critical issues where the code assumes mctlass is successfully imported based solely on environment variables, which could lead to runtime crashes if the library is missing; it is recommended to explicitly verify that the mctlass operators are not None. Additionally, several type annotations for scale and bias parameters should be updated to allow None values to prevent static type-checking errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if ( | ||
| hidden_states.dtype == torch.bfloat16 | ||
| and not self.quant_config.use_int4_w4a8 | ||
| and not self.quant_config.use_int4_w4a16 | ||
| and not self.quant_config.use_int8_w8a8 | ||
| and not self.quant_config.use_int8_w8a16 | ||
| and mx_envs.MACA_VLLM_ENABLE_MCTLASS_FUSED_MOE | ||
| and mx_envs.MACA_VLLM_ENABLE_MCTLASS_PYTHON_API | ||
| ): |
There was a problem hiding this comment.
The bf16 moe block checks mx_envs.MACA_VLLM_ENABLE_MCTLASS_FUSED_MOE and mx_envs.MACA_VLLM_ENABLE_MCTLASS_PYTHON_API, but it does not verify if the optional mctlassEx library was successfully imported (i.e., mctlass_ops.mctlass_moe_gemm is not None). If these environment variables are enabled but mctlassEx is not installed, this will result in an AssertionError and crash the execution instead of falling back to the default Triton implementation.
Please add a check to ensure mctlass_ops.mctlass_moe_gemm is not None before entering this block.
if (
hidden_states.dtype == torch.bfloat16
and not self.quant_config.use_int4_w4a8
and not self.quant_config.use_int4_w4a16
and not self.quant_config.use_int8_w8a8
and not self.quant_config.use_int8_w8a16
and mx_envs.MACA_VLLM_ENABLE_MCTLASS_FUSED_MOE
and mx_envs.MACA_VLLM_ENABLE_MCTLASS_PYTHON_API
and getattr(mctlass_ops, "mctlass_moe_gemm", None) is not None
):| elif ( | ||
| A.dtype == torch.bfloat16 and B.dtype == torch.bfloat16 and C.dtype == torch.bfloat16 | ||
| and A_scale is None and B_scale is None | ||
| and mx_envs.MACA_VLLM_ENABLE_MCTLASS_FUSED_MOE | ||
| and mx_envs.MACA_VLLM_ENABLE_MCTLASS_PYTHON_API | ||
| ): |
There was a problem hiding this comment.
Similar to the issue in triton_moe.py, this elif block will call mctlass_ops.cutlass_moe_mm_bf16 which asserts that mctlass_moe_gemm is not None. If mctlassEx is not installed, this will crash instead of falling back to the else block (Triton kernel).
Please add a check to ensure mctlass_ops.mctlass_moe_gemm is not None in the condition.
elif (
A.dtype == torch.bfloat16 and B.dtype == torch.bfloat16 and C.dtype == torch.bfloat16
and A_scale is None and B_scale is None
and mx_envs.MACA_VLLM_ENABLE_MCTLASS_FUSED_MOE
and mx_envs.MACA_VLLM_ENABLE_MCTLASS_PYTHON_API
and getattr(mctlass_ops, "mctlass_moe_gemm", None) is not None
):Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.