Skip to content

[MoE] Add swiglu_oai (OAI SwiGLU) for per-token fp8 CK XDL 2-stage MoE#3886

Open
LJ-underdog wants to merge 1 commit into
mainfrom
feat/swiglu-oai-ck2stage
Open

[MoE] Add swiglu_oai (OAI SwiGLU) for per-token fp8 CK XDL 2-stage MoE#3886
LJ-underdog wants to merge 1 commit into
mainfrom
feat/swiglu-oai-ck2stage

Conversation

@LJ-underdog

@LJ-underdog LJ-underdog commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Motivation

Enable the OAI-form SwiGLU activation (swiglu_oai, gate * sigmoid(1.702 * gate) * (up + 1), gpt-oss style) for the per-token fp8 (PTPC) CK XDL 2-stage MoE path. This path currently supports only silu/gelu, so gpt-oss / OAI-style MoE models cannot run on it.

Technical Details

  • fused_moe.py: add AITER_SWIGLU_CK2STAGE escape hatch and route ActivationType.Swiglu into the 2-stage path.
  • utility/dtypes.py: swiglu activation plumbing.
  • gemm_moe_ck2stages.cu / gemm_moe_ck2stages_common.py: map_activation_to_ck_stage1 maps Swiglu to CK ActOP 3.
  • gen_instances.py: add ACT_OP_MAP, the swiglu argparse choice, and plain-fp8 per_token / per_tensor swiglu codegen instances.
  • swiglu_oai routes through the plain gridwise_moe_gemm.hpp (per_token fp8 -> a8w8 tag, plain cuh), the same path as silu/gelu. The activation is computed in fp32 in the epilogue and is orthogonal to the GEMM compute (MFMA/tile/pipeline untouched) and to quantization (existing per-token dequant reused).
  • The submodule pin is kept identical to aiter main; the CK-side epilogue comes from [CK][MoE] Add swiglu_oai (OAI SwiGLU) activation to XDL 2-stage MoE epilogue rocm-libraries#8749, so this PR must be built/merged together with it (CK must provide swiglu_oai_and_mul).

Test Plan

Op-isolate on gfx942 (MI308X): run the per-token fp8 swiglu_oai MoE via JIT on-demand (no manual codegen patch) and compare the output against a torch fp32 OAI-SwiGLU reference (cos_sim).

Test Result

JIT auto-generates and compiles the swiglu x per_token x f8 instances and the op runs without error. cos_sim = 0.999993 vs the torch fp32 OAI-SwiGLU reference; no NaN; dispatched to the CK 2-stage GridwiseMoeGemm kernel (verified via rocprofv3).

Submission Checklist

@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3886 --add-label <label>

@LJ-underdog LJ-underdog force-pushed the feat/swiglu-oai-ck2stage branch 2 times, most recently from 7112b15 to f0f75f8 Compare June 24, 2026 02:18
@LJ-underdog LJ-underdog changed the title Feat/swiglu oai ck2stage [MoE] Add swiglu_oai (OAI SwiGLU) for per-token fp8 CK XDL 2-stage MoE Jun 24, 2026
@LJ-underdog LJ-underdog force-pushed the feat/swiglu-oai-ck2stage branch 2 times, most recently from 01a55ec to 3d1291a Compare June 24, 2026 05:22
@LJ-underdog LJ-underdog marked this pull request as ready for review June 24, 2026 05:49
@LJ-underdog LJ-underdog requested review from a team, Copilot and valarLip June 24, 2026 05:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OAI-style SwiGLU (swiglu_oai) support to the CK XDL 2-stage MoE path (including per-token FP8/PTPC) by plumbing ActivationType.Swiglu through codegen and runtime activation mapping so GPT-OSS/OAI-style MoE models can run on this kernel family.

Changes:

  • Add/standardize CK stage1 activation-op mapping for Swiglu (CK ActOP=3) in both runtime dispatch (gemm_moe_ck2stages.cu) and codegen (gemm_moe_ck2stages_common.py, gen_instances.py).
  • Extend CK 2-stage instance generation to produce AOT “plain-f8” swiglu instances for {per_tensor, per_token} × {f16, b16} outputs.
  • Make activation string parsing explicit and add "swiglu" to supported activation strings (aiter/utility/dtypes.py).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
csrc/ck_gemm_moe_2stages_codegen/gen_instances.py Adds swiglu act-op mapping + AOT generation loop for plain-f8 swiglu instances; updates activation CLI choices.
csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages.cu Replaces boolean inversion with explicit ActivationType→CK ActOP mapping (adds Swiglu support).
csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py Introduces shared ACT_OP_MAP/ACT_OP_NAME and switches GEMM1 kernel naming/config from bool ActOP to int ActOP.
aiter/utility/dtypes.py Updates str2ActivationType to explicitly accept "swiglu" (and provide clearer errors).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/ck_gemm_moe_2stages_codegen/gen_instances.py Outdated
@LJ-underdog LJ-underdog force-pushed the feat/swiglu-oai-ck2stage branch from 3d1291a to e611971 Compare June 24, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants