feat: add MXFP8 fused operators for Wan transformer inference on SM120 by Fatemanx · Pull Request #1090 · ModelTC/LightX2V

Fatemanx · 2026-05-23T04:06:22Z

Implement three fused CUDA kernels for MXFP8 quantized inference on Blackwell (SM120):

scaled_mxfp8_gelu_quant: fuse GELU activation + E8M0 quantization
scaled_mxfp8_modulate_quant: fuse scale/shift modulation + quantization
cutlass_scaled_mxfp8_mm_residual_gate: fuse GEMM + residual + gate in CUTLASS 3.x epilogue

Performance on RTX 5090 (Wan 5B FFN, m=4096, hidden=1536, ffn=8960):

GELU+Quant: 1.30× faster (27.8μs → 21.3μs)
Modulate+Quant: 3.26× faster (92.7μs → 28.5μs)
GEMM+Residual+Gate: 1.40× faster (194.7μs → 138.9μs)
End-to-end FFN: 1.20× faster (608μs → 505μs, -103μs per block)
Reduces kernel launches from 7 to 3 per FFN block

Features:

Supports all Wan tasks (t2v/i2v/flf2v/animate/s2v/rs2v)
Auto-fallback on non-SM120 GPUs (H100/A100/RTX4090) with warning
Handles FP16/BF16 activations (kernel auto-detects dtype)
One-time device capability probe at init (eliminates ~4000 redundant checks per inference)

Tested: 10/10 unit tests pass, 6/6 fallback scenarios verified

gemini-code-assist

Code Review

This pull request implements MXFP8 fused operations for the Wan transformer model, specifically optimized for SM120/SM120a GPUs. The changes include new CUDA kernels for MXFP8 GeLU quantization, modulate quantization, and a fused residual-gate GEMM utilizing CUTLASS, along with corresponding Python wrappers and unit tests. Reviewer feedback highlights several optimization and code quality improvements: moving static parameter device transfers out of the inference loop, consolidating duplicated hardware validation logic into a common utility, replacing std::cerr with idiomatic TORCH_CHECK calls, improving numerical precision by avoiding intermediate rounding in the residual update, and eliminating dynamic tensor allocations in the performance-critical path.

gemini-code-assist · 2026-05-23T04:08:10Z

+        return self._mxfp8_apply_quantized(module, input_tensor_quant, input_tensor_scale)
+
+    def _mxfp8_apply_quantized(self, module, input_tensor_quant, input_tensor_scale):
+        module.alpha = module.alpha.to(module.weight.device)


Moving module.alpha to the weight device in every iteration of the inference loop introduces unnecessary Python overhead and potential synchronization points. Since alpha is a static quantization parameter, it should ideally be moved to the correct device once during model initialization. At the very least, check if the device move is necessary before performing it to avoid redundant operations.

if module.alpha.device != module.weight.device: module.alpha = module.alpha.to(module.weight.device)

gemini-code-assist · 2026-05-23T04:08:10Z

+inline void check_sm120_or_throw(torch::Tensor const& tensor, char const* op_name) {
+  int device = tensor.get_device();
+  check_valid_cuda_device_index(device, op_name);
+
+  static std::array<std::once_flag, kMaxCudaDevices> device_once;
+  static std::array<int, kMaxCudaDevices> cached_major{};
+  static std::array<int, kMaxCudaDevices> cached_minor{};
+
+  std::call_once(device_once[device], [device]() {
+    CHECK_CUDA_SUCCESS(cudaDeviceGetAttribute(&cached_major[device], cudaDevAttrComputeCapabilityMajor, device));
+    CHECK_CUDA_SUCCESS(cudaDeviceGetAttribute(&cached_minor[device], cudaDevAttrComputeCapabilityMinor, device));
+  });
+
+  TORCH_CHECK(
+      cached_major[device] == 12,
+      op_name,
+      " is only supported on SM120/SM120a GPUs, got CUDA device ",
+      device,
+      " with compute capability ",
+      cached_major[device],
+      ".",
+      cached_minor[device]);
+}


The check_sm120_or_throw utility is duplicated in both mxfp8_quant_kernels_sm120.cu and mxfp8_scaled_mm_kernels_sm120.cu. Furthermore, the implementations differ (one uses cudaDeviceGetAttribute while the other uses cudaGetDeviceProperties). This function should be moved to a common header (e.g., utils.h) to ensure consistency and reduce code duplication.

gemini-code-assist · 2026-05-23T04:08:10Z

+    default: {
+      std::cerr << "Observing: " << input.scalar_type() << " for the input datatype which is invalid";
+      throw std::runtime_error("Unsupported input data type for gelu_quantize_to_fp8.");
+    }


Using std::cerr for error reporting in a library is not recommended as it bypasses standard logging mechanisms and can be missed in production environments. Using TORCH_CHECK is more idiomatic in PyTorch extensions and provides a better error message including the tensor type.

default: { TORCH_CHECK(false, "Unsupported input data type for gelu_quantize_to_fp8: ", input.scalar_type()); }

gemini-code-assist · 2026-05-23T04:08:11Z

+    float product = __bfloat162float(ffn_out[idx]) * __bfloat162float(gate[gate_idx]);
+    __nv_bfloat16 rounded_product = __float2bfloat16(product);
+    float sum = __bfloat162float(residual[idx]) + __bfloat162float(rounded_product);
+    residual[idx] = __float2bfloat16(sum);


The intermediate rounding to __nv_bfloat16 at line 661 is unnecessary and reduces numerical precision. You can perform the addition in float before the final rounding to the output type to maintain higher accuracy during the residual update.

float product = __bfloat162float(ffn_out[idx]) * __bfloat162float(gate[gate_idx]); float sum = __bfloat162float(residual[idx]) + product; residual[idx] = __float2bfloat16(sum);

gemini-code-assist · 2026-05-23T04:08:11Z

+        residual, A, B, A_sf, B_sf, alpha, bias, gate, meta.m, meta.n, meta.k, stream);
+    return;
+  }
+  auto ffn_out = torch::empty_like(residual);


Allocating a new tensor using torch::empty_like(residual) in the fallback path for 2D gates will significantly degrade performance during prefill (where M is large). Since this PR aims to optimize the FFN block, you should avoid dynamic allocations in the hot path. Consider using a pre-allocated workspace or allowing the caller to provide an output buffer.

Implement three fused CUDA kernels for MXFP8 quantized inference on Blackwell (SM120): 1. scaled_mxfp8_gelu_quant: fuse GELU activation + E8M0 quantization 2. scaled_mxfp8_modulate_quant: fuse scale/shift modulation + quantization 3. cutlass_scaled_mxfp8_mm_residual_gate: fuse GEMM + residual + gate in CUTLASS 3.x epilogue Performance on RTX 5090 (Wan 5B FFN, m=4096, hidden=1536, ffn=8960): - GELU+Quant: 1.30× faster (27.8μs → 21.3μs) - Modulate+Quant: 3.26× faster (92.7μs → 28.5μs) - GEMM+Residual+Gate: 1.40× faster (194.7μs → 138.9μs) - End-to-end FFN: 1.20× faster (608μs → 505μs, -103μs per block) - Reduces kernel launches from 7 to 3 per FFN block Features: - Supports all Wan tasks (t2v/i2v/flf2v/animate/s2v/rs2v) - Auto-fallback on non-SM120 GPUs (H100/A100/RTX4090) with warning - Handles FP16/BF16 activations (kernel auto-detects dtype) - One-time device capability probe at init (eliminates ~4000 redundant checks per inference) Tested: 10/10 unit tests pass, 6/6 fallback scenarios verified Address review feedback (PR ModelTC#1090): - Skip alpha device move when already on target device - Extract check_sm120_or_throw to shared header sm120_utils.h - Replace std::cerr with TORCH_CHECK in dtype switch fallbacks - Avoid intermediate BF16 round in residual_gate kernel - Apply ruff-format Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Fatemanx · 2026-05-23T04:46:13Z

Thanks @gemini-code-assist for the thorough review! Addressed all five comments in the latest amended commit:

✅ Comment 1 (alpha device move): Added if module.alpha.device != module.weight.device guard in both _mxfp8_apply_quantized and _mxfp8_apply_residual_gate_quantized to skip the redundant op when already on the target device.
✅ Comment 2 (duplicated check_sm120_or_throw): Extracted to a shared header lightx2v_kernel/csrc/gemm/sm120_utils.h under namespace lightx2v_kernel. Both kernel files now #include it and call lightx2v_kernel::check_sm120_or_throw. Standardized on cudaGetDeviceProperties for consistency.
✅ Comment 3 (std::cerr → TORCH_CHECK): Replaced three switch default cases (quantize_to_fp8, gelu_quantize_to_fp8, modulate_quantize_to_fp8) with single-line TORCH_CHECK(false, ...).
✅ Comment 4 (BF16 intermediate round): Removed the __nv_bfloat16 rounded_product step in mxfp8_residual_gate_bf16_kernel; now keeps the product in fp32 until the final round, improving numerical precision.
⏸️ Comment 5 (2D fallback dynamic allocation): Deferred to a follow-up PR. The production Wan inference path always passes a 1D gate (c_gate_msa.squeeze() is shape (hidden,)), so it never triggers the 2D fallback. Refactoring the fallback to use a pre-allocated workspace is best done in a separate PR alongside other potential 2D path optimizations.

Verification:

pre-commit run --all-files: all hooks pass
pytest lightx2v_kernel/test/mxfp8_mxfp8/test_fused_ffn.py: 10/10 pass
Net diff: +22 / -98 lines (deduplication shrinks the patch).

gemini-code-assist · 2026-05-23T04:46:27Z

Thanks for the detailed update and for addressing the review comments. The changes, including the extraction of the SM120 check to a shared header and the replacement of std::cerr with TORCH_CHECK, are well-implemented. I also appreciate the verification steps you've taken. The PR looks good to go from my perspective.

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Fatemanx force-pushed the perf/wan-mxfp8-fuse-op branch from 3b4ec5e to 94816cc Compare May 23, 2026 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MXFP8 fused operators for Wan transformer inference on SM120#1090

feat: add MXFP8 fused operators for Wan transformer inference on SM120#1090
Fatemanx wants to merge 1 commit into
ModelTC:mainfrom
Fatemanx:perf/wan-mxfp8-fuse-op

Fatemanx commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

Fatemanx commented May 23, 2026

Uh oh!

gemini-code-assist Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fatemanx commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Fatemanx commented May 23, 2026

Uh oh!

gemini-code-assist Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant