Skip to content

some problems when running the code with multi-gpus #31

@zzzt9

Description

@zzzt9

I have set accelerate to use deepspeed stage 2 (with cpu offloading) via "accelerate config", then using command:accelerate launch --config_file /root/.cache/huggingface/accelerate/default_config.yaml --debug Train.py --model_name=final_multisubject_subj01 --subj=1 --max_lr=3e-4 --mixup_pct=.33 --num_epochs=150 --use_prior --prior_scale=30 --clip_scale=1 --blurry_recon --blur_scale=.5 --n_blocks=4 --hidden_dim=4096 --num_sessions=40 --batch_size=21. However, it hangs on

accelerator.backward(loss) without any logs

Besides, when I tried to train an multi-subject model, another issue appears below

[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/autodl-tmp/MindEyeV2/src/Train.py", line 736, in <module>
[rank1]:     for behav0, past_behav0, future_behav0, old_behav0 in train_dl: 
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/data_loader.py", line 687, in __iter__
[rank1]:     batch = broadcast(batch, from_process=0)
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/root/miniconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 392, in wrapper
[rank1]:     raise DistributedOperationException(
[rank1]: accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.

[rank1]: Operation: `accelerate.utils.operations.broadcast`
[rank1]: Input shapes:
[rank1]:   - Process 0: TensorInformation(shape=torch.Size([6, 1, 17]), dtype=torch.float64)
[rank1]:   - Process 1: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 2: TensorInformation(shape=torch.Size([6, 15, 17]), dtype=torch.float64)
[rank1]:   - Process 3: TensorInformation(shape=torch.Size([6, 3, 17]), dtype=torch.float64)
[rank1]:   - Process 4: [[6, 1, 17], [6, 15, 17], [6, 15, 17], [6, 3, 17]]
W0819 21:27:29.179000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 11424 closing signal SIGTERM
E0819 21:27:32.547000 140245539309376 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 11425) of binary: /root/miniconda3/bin/python]

I wonder if you running the python file converted straightly by Train.ipynb? Have you met the same issues?Hoping for your reply

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions