Add single parameter allgather optimization for zero3 #7661

aeeeeeep · 2025-10-31T13:41:44Z

Perform allgather operations on individual parameters instead of parameter lists.
Significantly reduce peak memory usage in high memory pressure scenarios.
Improve performance by minimizing temporary buffer requirements.
The behavior is enabled via a new boolean flag under the section

"zero_optimization": {
  "stage3_allgather_single_param": true 
 }

By default the optimization is not enabled.

Signed-off-by: aeeeeeep <[email protected]>

sfc-gh-truwase · 2025-11-01T17:04:05Z

@aeeeeeep thanks for this contribution. Are you able to share some data showing the benefits of this optimization?

aeeeeeep · 2025-11-01T17:38:07Z

@aeeeeeep thanks for this contribution. Are you able to share some data showing the benefits of this optimization?

Thanks for your feedback! I’ll share detailed data within the next few days.

Signed-off-by: aeeeeeep <[email protected]>

Make it very clear that `TiledMLP`'s memory saving has a cost of recomputing forward. Signed-off-by: aeeeeeep <[email protected]>

@sfc-gh-truwase

…eepspeedai#7659) fixes deepspeedai#7650 adding a `value.dim()>0` check to prevent slicing of 0-dim tensors cc @sfc-gh-truwase Signed-off-by: Naveenraj Kamalakannan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: aeeeeeep <[email protected]>

Signed-off-by: aeeeeeep <[email protected]>

aeeeeeep · 2025-12-10T03:48:05Z

The optimization shows measurable benefits only on a specific accelerator (under NDA), where hardware/driver overhead for memory allocation is significantly higher. This extended allocation latency indirectly delays memory release on communication streams.

While this PR's reduction of allocation operations is theoretically sound, the practical benefit appears hardware-dependent and isn't observable on NVIDIA platforms.

sfc-gh-truwase · 2025-12-10T13:23:33Z

While this PR's reduction of allocation operations is theoretically sound, the practical benefit appears hardware-dependent and isn't observable on NVIDIA platforms.

Thanks for the explanation. Can you confirm there is no regression on NVIDIA platform?

aeeeeeep · 2025-12-10T13:33:59Z

Confirmed no regression on NVIDIA platform from my tests a few weeks ago (both performance and accuracy).
I’ll re-validate with the latest code tomorrow (~20 h from now) and update code immediately if anything changes.

aeeeeeep requested review from tjruwase and tohtana as code owners October 31, 2025 13:41

Add single parameter allgather optimization for zero3

77a51f7

Signed-off-by: aeeeeeep <[email protected]>

aeeeeeep force-pushed the allgather_single_param branch from f580558 to 77a51f7 Compare October 31, 2025 13:53

aeeeeeep marked this pull request as draft November 1, 2025 06:21

aeeeeeep force-pushed the allgather_single_param branch from 3613b25 to d55f736 Compare November 1, 2025 16:26

aeeeeeep marked this pull request as ready for review November 1, 2025 16:27

aeeeeeep force-pushed the allgather_single_param branch from d55f736 to 30814fa Compare November 1, 2025 16:28

aeeeeeep force-pushed the allgather_single_param branch from efecf04 to d6cd73d Compare November 1, 2025 17:38

aeeeeeep and others added 6 commits November 13, 2025 12:40

adaptor func _allgather_params

d0f5950

Signed-off-by: aeeeeeep <[email protected]>

fix undefined name

205ae2c

Signed-off-by: aeeeeeep <[email protected]>

UlyssesSP: TiledMLP doc - recomputes forward twice (deepspeedai#7664)

5ce2c6c

Make it very clear that `TiledMLP`'s memory saving has a cost of recomputing forward. Signed-off-by: aeeeeeep <[email protected]>

fix prefetch bucket size

d5f5525

Signed-off-by: aeeeeeep <[email protected]>

format

c943e0e

Signed-off-by: aeeeeeep <[email protected]>

aeeeeeep force-pushed the allgather_single_param branch from 52ae961 to c943e0e Compare November 13, 2025 12:40

Merge branch 'master' into allgather_single_param

aac9f5a

Merge branch 'master' into allgather_single_param

4f05e97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add single parameter allgather optimization for zero3 #7661

Add single parameter allgather optimization for zero3 #7661

Uh oh!

aeeeeeep commented Oct 31, 2025 •

edited by sfc-gh-truwase

Loading

Uh oh!

sfc-gh-truwase commented Nov 1, 2025

Uh oh!

aeeeeeep commented Nov 1, 2025

Uh oh!

aeeeeeep commented Dec 10, 2025

Uh oh!

sfc-gh-truwase commented Dec 10, 2025

Uh oh!

aeeeeeep commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add single parameter allgather optimization for zero3 #7661

Are you sure you want to change the base?

Add single parameter allgather optimization for zero3 #7661

Uh oh!

Conversation

aeeeeeep commented Oct 31, 2025 • edited by sfc-gh-truwase Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Nov 1, 2025

Uh oh!

aeeeeeep commented Nov 1, 2025

Uh oh!

aeeeeeep commented Dec 10, 2025

Uh oh!

sfc-gh-truwase commented Dec 10, 2025

Uh oh!

aeeeeeep commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aeeeeeep commented Oct 31, 2025 •

edited by sfc-gh-truwase

Loading