-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CUDA: add a fused top-K MoE kernel #16130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4d6c41a
to
7345668
Compare
Some more performance numbers
|
324ecbb
to
613b6c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend you cache the logits in registers from the start instead of reading the same data from VRAM twice.
613b6c3
to
17c9e7c
Compare
Did not see perf improvements after the changes, TG improves by 6-7% still |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
//special case for topk-moe | ||
if (ops.size() == 5 && ops.begin()[0] == GGML_OP_SOFT_MAX && ops.begin()[1] == GGML_OP_RESHAPE && ops.begin()[2] == GGML_OP_ARGSORT | ||
&& ops.begin()[3] == GGML_OP_VIEW && ops.begin()[4] == GGML_OP_GET_ROWS) { | ||
|
||
for (int i = 0; i < 5; i++) { | ||
if (cgraph->nodes[node_idx + i]->op != ops.begin()[i]) return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't all this redundant since it is performed in ggml_can_fuse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic of this function should be that you always apply ggml_can_fuse
first and only then do special-cases.
Edit: got it, the RESHAPE
and the VIEW
are problematic in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I mentioned in #16102 (comment), if I have to remove the empty ops before passing to ggml_can_fuse
then it still be a special case
17c9e7c
to
7a258bf
Compare
Hmm, looks like this causes a illegal memory access when running |
Compile the CUDA code with |
Yeah I did that, however I don't understand it yet. It seems like there is some tensor not getting propagated downstream properly
|
It looks like the dimensions are not right when using |
Yes, there is no problem with calling |
It does not crash when doing Changing this line to use n_expert_used makes everything work Line 545 in 8ba548d
|
Interestingly, if I move |
I think the special case path added in this PR does not have bounds check for the number of nodes in the graph - could this be causing the illegal memory access? |
Added the bounds check and the problem is still there. Since it only happens on warmup and |
Does it still happen if you keep the build forward expand and remove the new fusing logic? |
No doesn't happen if I remove the fusing logic and keep the build forward expand |
7a258bf
to
a275f10
Compare
Ok the bug was not handling ties properly in the kernel, after that it all works. I'm not exactly sure why though |
This would complicate fusion at least. Right now you can look at llama-graph and call build forward to get the exact sequence you see in llama-graph. I don't know how it would look like once these ops go |
I'm in favor of removing the empty nodes from the graph. I think it will simplify fusion and graph_optimize. |
a275f10
to
bb0e5d0
Compare
Added optional norm + TODO about changes to do once we figure out how to handle empty ops. Performance results for
|
bb0e5d0
to
2ea8133
Compare
This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
…to registers before
4b2d2b9
to
639e954
Compare
639e954
to
e772b28
Compare
941bc9e
to
53acfe6
Compare
@JohannesGaessler will you merge once CI passes? |
As described in |
I don't have write access. Let me ping you once the CI passes to merge |
The CI failures don't seem related, so this should be good to merge. @JohannesGaessler |
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
This kernel does the following:
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
Should be more useful in TG than PP