Better sharding for dsv3 moe layer #2373

suexu1025 · 2025-09-19T22:44:55Z

Description

land sharding strategy for moe layer.

Add --fsdp_shard_on_exp=true to enable shard fsdp on num_expert dim
dsv3 step time decrease from
before (fsdp on embed): [47s]
after (fsdp on num_expert)
[43s]
no change for mixtral 8x7b model.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

gobbleturk

@richjames0 for thoughts =D

src/MaxText/layers/moe.py

RissyRan

Thanks Qinwen! Could you also attach test results in the description?

src/MaxText/layers/moe.py

suexu1025 · 2025-09-22T17:55:50Z

Thanks Qinwen! Could you also attach test results in the description?
pls see the test results here
https://screenshot.googleplex.com/5xAhsHPUdHFS6zc

src/MaxText/layers/moe.py

RissyRan · 2025-09-23T16:43:09Z

Thanks Qinwen! Could you also attach test results in the description?
pls see the test results here
https://screenshot.googleplex.com/5xAhsHPUdHFS6zc

Thanks! I meant to say the original links, so that we could directly to see profiles or anything. One example here, usually put original links for reviewers.

src/MaxText/layers/moe.py

RissyRan

Just nits

src/MaxText/layers/moe.py

src/MaxText/pyconfig.py

richjames0

lgtm

update refactor update Add optional config Explicitly shard input tensors across mesh devices Run on 0.7.2 candidate image Fix typo in image tag Revert to use latest tag update test for new jax version Remove sharding rules for q_lora and kv_lora from base.yml update with configs clean up update

RissyRan · 2025-09-24T17:22:49Z

src/MaxText/pyconfig.py

+  if raw_keys["fsdp_shard_on_exp"] and raw_keys["num_experts"] % raw_keys["ici_fsdp_parallelism"]!=0: 
+    raise ValueError("fsdp_shard_on_exp requires num_experts is divisiable by ici_fsdp_parallelism.")
+  if raw_keys["fsdp_shard_on_exp"] and (using_tensor_parallelism(raw_keys) or useing_expert_parallelism(raw_keys)): 
+    raise ValueError("fsdp_shard_on_exp requires ici_expert_parallelism = 1 and ici_tensor_parallelism/ici_tensor_transpose_parallelism = 1.")


Nit: probably just say fsdp_shard_on_exp does not support EP and TP shardings?

RissyRan

Thanks @suexu1025

RissyRan · 2025-09-24T17:24:05Z

src/MaxText/layers/attention_mla.py

          axis=-1,
          kernel_init=self.kernel_init,
-          kernel_axes=("embed", "q_lora"),
+          kernel_axes=("embed", "q_lora_up_proj"),


This part is helpful for all MLA models or only DS V3?

cc @richjames0 another case.

suexu1025 requested review from RissyRan, michelle-yooh, gagika and richjames0 as code owners September 19, 2025 22:44

suexu1025 requested a review from gobbleturk September 19, 2025 22:49

suexu1025 changed the title ~~Better Sharding for dsv3 moe layer~~ Better sharding for dsv3 moe layer Sep 19, 2025

gobbleturk reviewed Sep 19, 2025

View reviewed changes

src/MaxText/layers/moe.py Show resolved Hide resolved

src/MaxText/layers/moe.py Outdated Show resolved Hide resolved

RissyRan reviewed Sep 19, 2025

View reviewed changes