some problems when running the code with multi-gpus

I have set accelerate to use deepspeed stage 2 (with cpu offloading) via "accelerate config", then using command:`accelerate launch --config_file /root/.cache/huggingface/accelerate/default_config.yaml --debug Train.py --model_name=final_multisubject_subj01 --subj=1 --max_lr=3e-4 --mixup_pct=.33 --num_epochs=150 --use_prior --prior_scale=30 --clip_scale=1 --blurry_recon --blur_scale=.5 --n_blocks=4 --hidden_dim=4096 --num_sessions=40 --batch_size=21`. However, it hangs on

`accelerator.backward(loss)` without any logs



Besides, when I tried to train an multi-subject model, another issue appears below

```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/autodl-tmp/MindEyeV2/src/Train.py", line 736, in <module>
[rank1]:     for behav0, past_behav0, future_behav0, old_behav0 in train_dl: 
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/data_loader.py", line 687, in __iter__
[rank1]:     batch = broadcast(batch, from_process=0)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 392, in wrapper
[rank1]:     raise DistributedOperationException(
[rank1]: accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.

[rank1]: Operation: `accelerate.utils.operations.broadcast`
[rank1]: Input shapes:
[rank1]:   - Process 0: TensorInformation(shape=torch.Size([6, 1, 17]), dtype=torch.float64)
[rank1]:   - Process 1: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 2: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 3: TensorInformation(shape=torch.Size([6, 3, 17]), dtype=torch.float64)
[rank1]:   - Process 4: [[6, 1, 17], [6, 15, 17], [6, 15, 17], [6, 3, 17]]
W0819 21:27:29.179000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 11424 closing signal SIGTERM
E0819 21:27:32.547000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 11425) of binary: /root/miniconda3/bin/python]
```
I wonder if you running the python file converted straightly by Train.ipynb? Have you met the same issues?Hoping for your reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some problems when running the code with multi-gpus #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

some problems when running the code with multi-gpus #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions