CUDA: add a fused top-K MoE kernel #16130

am17an · 2025-09-20T14:19:10Z

This kernel does the following:

softmax over the logits per token [n_experts, n_tokens]
argmax reduce over the top-k (n_experts_used) logits
write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

Should be more useful in TG than PP

Model	Test	t/s master	t/s patch	Speedup
qwen3moe 30B.A3B Q4_K_M	pp512	6502.24	6561.50	1.01
qwen3moe 30B.A3B Q4_K_M	tg128	194.63	207.75	1.07

am17an · 2025-09-21T02:35:44Z

Some more performance numbers

Model	Microbatch size	Test	t/s master	t/s patch	Speedup
qwen3moe 30B.A3B Q4_K_M	1	pp4096	194.25	205.88	1.06
qwen3moe 30B.A3B Q4_K_M	2	pp4096	206.51	213.80	1.04
qwen3moe 30B.A3B Q4_K_M	4	pp4096	355.94	366.72	1.03
qwen3moe 30B.A3B Q4_K_M	8	pp4096	585.58	600.32	1.03
qwen3moe 30B.A3B Q4_K_M	16	pp4096	922.33	940.61	1.02
qwen3moe 30B.A3B Q4_K_M	32	pp4096	1533.54	1564.10	1.02
qwen3moe 30B.A3B Q4_K_M	256	pp4096	3722.88	3751.81	1.01
qwen3moe 30B.A3B Q4_K_M	512	pp4096	4559.34	4588.63	1.01

JohannesGaessler

I would recommend you cache the logits in registers from the start instead of reading the same data from VRAM twice.

ggml/src/ggml-cuda/topk-moe.cu

am17an · 2025-09-22T15:14:43Z

Did not see perf improvements after the changes, TG improves by 6-7% still

ggml/src/ggml-cuda/topk-moe.cu

tests/test-backend-ops.cpp

ggerganov · 2025-09-22T17:01:51Z

ggml/src/ggml-cuda/ggml-cuda.cu

+    //special case for topk-moe
+    if (ops.size() == 5 && ops.begin()[0] == GGML_OP_SOFT_MAX && ops.begin()[1] == GGML_OP_RESHAPE && ops.begin()[2] == GGML_OP_ARGSORT
+        && ops.begin()[3] == GGML_OP_VIEW && ops.begin()[4] == GGML_OP_GET_ROWS) {
+
+        for (int i = 0; i < 5; i++) {
+            if (cgraph->nodes[node_idx + i]->op != ops.begin()[i]) return false;
+        }
+


Isn't all this redundant since it is performed in ggml_can_fuse?

I think the logic of this function should be that you always apply ggml_can_fuse first and only then do special-cases.

Edit: got it, the RESHAPE and the VIEW are problematic in this case.

Yeah, as I mentioned in #16102 (comment), if I have to remove the empty ops before passing to ggml_can_fuse then it still be a special case

am17an · 2025-09-23T06:27:19Z

Hmm, looks like this causes a illegal memory access when running Qwen3-30B-A3B-Q4_0.gguf locally, tests in test-backend-ops don't capture this

JohannesGaessler · 2025-09-23T11:14:57Z

Compile the CUDA code with -lineinfo, then use the compute-sanitizer to get the exact line causing the issue.

am17an · 2025-09-23T11:21:48Z

Yeah I did that, however I don't understand it yet. It seems like there is some tensor not getting propagated downstream properly

========= Invalid __global__ read of size 16 bytes
=========     at void quantize_mmq_q8_1<(mmq_q8_1_ds_layout)1>(const float *, const int *, void *, long, long, long, long, long, int, int)+0x530
=========     by thread (96,0,0) in block (2,0,0)
=========     Access to 0xb1f1a36600 is out of bounds
=========     and is 652591982081 bytes after the nearest allocation at 0x1a00000000 of size 2097152 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: quantize_mmq_q8_1_cuda(float const*, int const*, void*, ggml_type, long, long, long, long, long, long, long, long, CUstream_st*) in quantize.cu:181 [0x289726] in libggml-cuda.so
=========         Host Frame: ggml_cuda_mul_mat_q(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) in mmq.cu:341 [0x1a5f82] in libggml-cuda.so
=========         Host Frame: ggml_cuda_mul_mat_id(ggml_backend_cuda_context&, ggml_tensor*) in ggml-cuda.cu:2110 [0x177ea7] in libggml-cuda.so
=========         Host Frame: ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) in ggml-cuda.cu:2410 [0x17947d] in libggml-cuda.so
=========         Host Frame: evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&) in ggml-cuda.cu:3008 [0x17bf39] in libggml-cuda.so
=========         Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) in ggml-cuda.cu:3126 [0x17c787] in libggml-cuda.so
=========         Host Frame: ggml_backend_graph_compute_async in ggml-backend.cpp:359 [0x6e896] in libggml-base.so
=========         Host Frame: ggml_backend_sched_compute_splits(ggml_backend_sched*) in ggml-backend.cpp:1553 [0x738e5] in libggml-base.so
=========         Host Frame: ggml_backend_sched_graph_compute_async in ggml-backend.cpp:1753 [0x74719] in libggml-base.so
=========         Host Frame: llama_context::graph_compute(ggml_cgraph*, bool) in llama-context.cpp:1460 [0x40cea1] in libllama.so
=========         Host Frame: llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) in llama-context.cpp:784 [0x409b57] in libllama.so
=========         Host Frame: llama_context::decode(llama_batch const&) in llama-context.cpp:1088 [0x40b0d2] in libllama.so
=========         Host Frame: llama_decode in llama-context.cpp:2726 [0x411ba6] in libllama.so
=========         Host Frame: common_init_from_params(common_params&) in common.cpp:1066 [0x256b3d] in llama-cli
=========         Host Frame: main in main.cpp:140 [0x833e2] in llama-cli

am17an · 2025-09-23T14:13:23Z

It looks like the dimensions are not right when using llama-cli for a model (but are correct when using test-backend-ops) I get n_expert = 128 and n_expert_used = 128 using llama-cli (where I expect n_expert_used to be 8), not sure if what happens to the RESHAPE/VIEW operations. @slaren could you please check if what I'm doing in llama-graph is correct?

slaren · 2025-09-23T18:26:29Z

@slaren could you please check if what I'm doing in llama-graph is correct?

Yes, there is no problem with calling ggml_build_forward_expand to force the nodes to be in a certain order, it is already done in other places.

am17an · 2025-09-24T03:29:58Z

It does not crash when doing --no-warmup, somehow setting n_expert_used = n_expert during warmup triggers an illegal memory access, however doing this in test-backend-ops does not, and compute-sanitizer is also clean. I don't see anything that changes in warmup except for setting n_expert_used = n_expert, any help would be appreciated!

Changing this line to use n_expert_used makes everything work

llama.cpp/src/llama-graph.cpp

Line 545 in 8ba548d

n_expert_used (cparams.warmup ? hparams.n_expert : hparams.n_expert_used),

am17an · 2025-09-24T06:55:09Z

Interestingly, if I move ggml_build_forward_expand after norm_w (in build_moe_ffn) and also fuse the norm, the warmup also works. So my suspicion is that it's something to do with how we make the graph during warmup, but I'm not sure

ggerganov · 2025-09-24T06:59:17Z

I think the special case path added in this PR does not have bounds check for the number of nodes in the graph - could this be causing the illegal memory access?

am17an · 2025-09-24T07:38:06Z

I think the special case path added in this PR does not have bounds check for the number of nodes in the graph - could this be causing the illegal memory access?

Added the bounds check and the problem is still there. Since it only happens on warmup and test-backend-ops cannot replicate, my suspicion is that it somehow this is messing up the graph. Is there another way to debug this?

ggerganov · 2025-09-24T07:39:50Z

Does it still happen if you keep the build forward expand and remove the new fusing logic?

am17an · 2025-09-24T07:56:32Z

Does it still happen if you keep the build forward expand and remove the new fusing logic?

No doesn't happen if I remove the fusing logic and keep the build forward expand

am17an · 2025-09-24T09:38:59Z

Ok the bug was not handling ties properly in the kernel, after that it all works. I'm not exactly sure why though

am17an · 2025-09-24T12:12:45Z

If this has no side-effects on the scheduler/allocator logic, would be the best option. I don't think the backends would ever need to see the empty nodes of the graph - they should always need only the nodes that do actual reads and writes.

This would complicate fusion at least. Right now you can look at llama-graph and call build forward to get the exact sequence you see in llama-graph. I don't know how it would look like once these ops go

jeffbolznv · 2025-09-24T12:26:44Z

I'm in favor of removing the empty nodes from the graph. I think it will simplify fusion and graph_optimize.

am17an · 2025-09-24T15:19:01Z

Added optional norm + TODO about changes to do once we figure out how to handle empty ops. Performance results for Qwen3-30B-A3B-Q4_0.gguf on a RTX 4090

Model	Microbatch size	Test	t/s master	t/s patch	Speedup
qwen3moe 30B.A3B Q4_K_M	1	pp512	203.32	219.91	1.08
qwen3moe 30B.A3B Q4_K_M	2	pp512	212.57	223.57	1.05
qwen3moe 30B.A3B Q4_K_M	4	pp512	368.25	383.43	1.04
qwen3moe 30B.A3B Q4_K_M	8	pp512	602.85	626.88	1.04
qwen3moe 30B.A3B Q4_K_M	16	pp512	966.81	986.07	1.02
qwen3moe 30B.A3B Q4_K_M	32	pp512	1644.49	1684.47	1.02
qwen3moe 30B.A3B Q4_K_M	512	pp512	6324.69	6394.87	1.01

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/topk-moe.cu

This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

…to registers before

ggml/src/ggml-cuda/topk-moe.cu

am17an · 2025-09-25T09:04:15Z

@JohannesGaessler will you merge once CI passes?

JohannesGaessler · 2025-09-25T09:08:01Z

As described in CONTRIBUTING.md: "Let other maintainers merge their own PRs". So I won't merge this PR unless you specifically ask me to.

am17an · 2025-09-25T09:10:53Z

As described in CONTRIBUTING.md: "Let other maintainers merge their own PRs". So I won't merge this PR unless you specifically ask me to.

I don't have write access. Let me ping you once the CI passes to merge

am17an · 2025-09-25T13:42:33Z

The CI failures don't seem related, so this should be good to merge. @JohannesGaessler

* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback

am17an requested review from JohannesGaessler and ggerganov September 20, 2025 14:19

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 20, 2025

am17an force-pushed the cuda_topk_moe branch from 4d6c41a to 7345668 Compare September 20, 2025 14:36

am17an force-pushed the cuda_topk_moe branch from 324ecbb to 613b6c3 Compare September 21, 2025 03:59

JohannesGaessler reviewed Sep 22, 2025

View reviewed changes

am17an force-pushed the cuda_topk_moe branch from 613b6c3 to 17c9e7c Compare September 22, 2025 14:56

am17an requested a review from CISC as a code owner September 22, 2025 14:56

am17an requested a review from JohannesGaessler September 22, 2025 15:13

JohannesGaessler approved these changes Sep 22, 2025

View reviewed changes

ggml/src/ggml-cuda/topk-moe.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/topk-moe.cu Outdated Show resolved Hide resolved

ggerganov reviewed Sep 22, 2025

View reviewed changes

am17an force-pushed the cuda_topk_moe branch from 17c9e7c to 7a258bf Compare September 23, 2025 02:00

am17an requested a review from slaren as a code owner September 23, 2025 02:00

am17an force-pushed the cuda_topk_moe branch from 7a258bf to a275f10 Compare September 24, 2025 09:35

am17an force-pushed the cuda_topk_moe branch from a275f10 to bb0e5d0 Compare September 24, 2025 15:16

am17an requested a review from JohannesGaessler September 24, 2025 15:17

JohannesGaessler reviewed Sep 24, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/topk-moe.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/topk-moe.cu Outdated Show resolved Hide resolved

am17an force-pushed the cuda_topk_moe branch from bb0e5d0 to 2ea8133 Compare September 25, 2025 01:42

am17an added 6 commits September 25, 2025 10:04

Refactor into ggml_cuda_should_use_topk_moe

9fc0396

Review: Use better coalescing pattern, use WARP_SIZE, store logits in…

8b780cc

…to registers before

Review: format + micro-optimizations

ce867aa

Fix bug: fix tie breakers

2930668

Add optional norm + clean-up code

240b2c1

am17an force-pushed the cuda_topk_moe branch 2 times, most recently from 4b2d2b9 to 639e954 Compare September 25, 2025 02:28

Use smem for final write

e772b28

am17an force-pushed the cuda_topk_moe branch from 639e954 to e772b28 Compare September 25, 2025 03:09

Add bounds check

53acfe6

am17an force-pushed the cuda_topk_moe branch from 941bc9e to 53acfe6 Compare September 25, 2025 08:19

JohannesGaessler reviewed Sep 25, 2025

View reviewed changes

ggml/src/ggml-cuda/topk-moe.cu Outdated Show resolved Hide resolved

Use better memory pattern for writeback

33856e1

JohannesGaessler approved these changes Sep 25, 2025

View reviewed changes

JohannesGaessler merged commit 077c94d into ggml-org:master Sep 25, 2025
58 of 64 checks passed

am17an deleted the cuda_topk_moe branch September 26, 2025 08:43

CUDA: add a fused top-K MoE kernel #16130

CUDA: add a fused top-K MoE kernel #16130

Conversation

am17an commented Sep 20, 2025

Uh oh!

am17an commented Sep 21, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

am17an commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Sep 23, 2025

Uh oh!

am17an commented Sep 23, 2025

Uh oh!

am17an commented Sep 23, 2025

Uh oh!

slaren commented Sep 23, 2025

Uh oh!

am17an commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 24, 2025

Uh oh!

am17an commented Sep 24, 2025

Uh oh!

ggerganov commented Sep 24, 2025

Uh oh!

am17an commented Sep 24, 2025

Uh oh!

am17an commented Sep 24, 2025

Uh oh!

am17an commented Sep 24, 2025

Uh oh!

jeffbolznv commented Sep 24, 2025

Uh oh!

am17an commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Sep 25, 2025

Uh oh!

JohannesGaessler commented Sep 25, 2025

Uh oh!

am17an commented Sep 25, 2025

Uh oh!

am17an commented Sep 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

ggerganov Sep 22, 2025 •

edited

Loading

am17an commented Sep 23, 2025 •

edited

Loading

am17an commented Sep 24, 2025 •

edited

Loading

am17an commented Sep 24, 2025 •

edited

Loading