Gradient sync not work on some model architectures with FSDP in TorchTitan #1014

JohanSchalkwyk1 · 2025-03-24T21:23:33Z

JohanSchalkwyk1
Mar 24, 2025

I have the following model architecture (Speech Language Model), that I parallelize with FSDP. The basic design looks like this

FSDPWhisperEncoder -> FSDPAdapter -> FSDPLlamaLLM

Only the weights of the FSDPAdapter (26M) are trained. I set param.requires_grad accordingly.

I've noticed issues that the model never trains. I track both grad_norm and per layer and the weight_norm of the varaibles, and none of them change. (for example the weight-norm of the RMSNorm layer stays constant throughout training. After some time debugging I tried the following composition

FSDPWhisperEncoder -> DDPAdapter -> FSDPLlamaLLM

Which does train correctly (on a smaller model). Losses converge, grad norm goes down,etc, but on larger models seem unstable (nccl timeouts). I.e this doesnt feel like a robust solution. It feels like the standard FSDP approach should work. Any advice on how to pinpoint why the model is not updating under this condition will be much appreciated.

tianyu-l · 2025-03-24T22:32:50Z

tianyu-l
Mar 24, 2025
Collaborator

cc @mori360

0 replies

mori360 · 2025-03-24T22:45:30Z

mori360
Mar 24, 2025
Collaborator

@JohanSchalkwyk1 Thanks for the issue.
Could you provide a code example that demonstrates the issue you're experiencing?

1 reply

JohanSchalkwyk1 Mar 24, 2025
Author

A stand alone repeat will take some time to setup.

Are there any tips on trying to diagnose why or to force gradient sync?

JohanSchalkwyk1 · 2025-03-25T17:43:06Z

JohanSchalkwyk1
Mar 25, 2025
Author

Some update on things I've tried that works. The following change results in weights being updated

[rank0]: (adapter): OptimizedModule(
[rank0]: (_orig_mod): FSDPTitanMLPAdapterWithNorm(
[rank0]: (norm1): FSDPRMSNorm()
[rank0]: (linear1): FSDPLinear(in_features=1280, out_features=10240, bias=False)
[rank0]: (gelu): GatedLinearUnit()
[rank0]: (linear2): FSDPLinear(in_features=5120, out_features=2048, bias=False)
[rank0]: (norm2): FSDPRMSNorm()
[rank0]: )
[rank0]: )

Note here every layer became an FSDP Layer. The following does not work

[rank0]: (adapter): OptimizedModule(
[rank0]: (_orig_mod): FSDPTitanMLPAdapterWithNorm(
[rank0]: (norm1): RMSNorm()
[rank0]: (linear1): Linear(in_features=1280, out_features=10240, bias=False)
[rank0]: (gelu): GatedLinearUnit()
[rank0]: (linear2): Linear(in_features=5120, out_features=2048, bias=False)
[rank0]: (norm2): RMSNorm()
[rank0]: )
[rank0]: )

i.e FSDP is only on the outer layer. Looking at transformer FSDP it doesnt look like it wraps linear in an FSDPLinear

0 replies

JohanSchalkwyk1 · 2025-03-25T17:46:05Z

JohanSchalkwyk1
Mar 25, 2025
Author

transformer structure would be as follows

[rank0]: (0): FSDPOptimizedModule(
[rank0]: (_orig_mod): TransformerSelfAttentionLayer(
[rank0]: (attn): MultiHeadAttention(
[rank0]: (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
[rank0]: (k_proj): Linear(in_features=2048, out_features=512, bias=False)
[rank0]: (v_proj): Linear(in_features=2048, out_features=512, bias=False)
[rank0]: (output_proj): Linear(in_features=2048, out_features=2048, bias=False)
[rank0]: (pos_embeddings): Llama3ScaledRoPE()
[rank0]: )

FSDP is only on the top level

1 reply

mori360 Apr 1, 2025
Collaborator

Thanks for the structure, are there any way that I could repro your issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient sync not work on some model architectures with FSDP in TorchTitan #1014

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gradient sync not work on some model architectures with FSDP in TorchTitan #1014

JohanSchalkwyk1 Mar 24, 2025

Replies: 4 comments · 2 replies

tianyu-l Mar 24, 2025 Collaborator

mori360 Mar 24, 2025 Collaborator

JohanSchalkwyk1 Mar 24, 2025 Author

JohanSchalkwyk1 Mar 25, 2025 Author

JohanSchalkwyk1 Mar 25, 2025 Author

mori360 Apr 1, 2025 Collaborator

JohanSchalkwyk1
Mar 24, 2025

Replies: 4 comments 2 replies

tianyu-l
Mar 24, 2025
Collaborator

mori360
Mar 24, 2025
Collaborator

JohanSchalkwyk1 Mar 24, 2025
Author

JohanSchalkwyk1
Mar 25, 2025
Author

JohanSchalkwyk1
Mar 25, 2025
Author

mori360 Apr 1, 2025
Collaborator