-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance, Hardware] MoE weights padding to AMD MI300x GPUs #1836
Conversation
Please fix the CI. |
@merrymercy Fixed the CI just now. Thanks! |
@@ -572,6 +588,18 @@ def process_weights_after_loading(self, layer: Module) -> None: | |||
start += shard_size | |||
|
|||
layer.w13_scale = torch.nn.Parameter(max_w13_scales, requires_grad=False) | |||
# If ROCm, apply weight padding (min. Mem channel contention) only if set | |||
if is_hip() and bool(int(os.getenv("MOE_PADDING", "0"))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move all is_hip
under a single branch.
e.g., L555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy understand, the order of data crunching makes me intend to keep dummy padding at very last to avoid error prone situation from intervening normalize_
, _dequantize
. _fp8_quant
, etc., and easier to read.
Motivation
Padding MoE weights (last dim) to minimize Memory Channel Contention (only to AMD Instinct GPUs)
Test shows approximate performance boost of prefill +2.2%, decode +3.0% for Grok-1 on setting: b32/i1024/o512
Modifications
As mentioned: fused_moe.py and layer.py
To enable this feature, set binary flag
MOE_PADDING=1
at command line, orexport MOE_PADDING=1
in console.Checklist