Fixed multi-GPU training #6

EnriqueGlv · 2025-03-18T16:22:56Z

Hi ! This PR aims at enabling multi-GPU support for training DEIM model.

Indeed, current deimkit version doesn't support multi-GPU because of the use of the backend "gloo" for torch.distributed. (https://pytorch.org/docs/stable/distributed.html)
The way I enabled multi-GPU support in my local environment was by calling dist_utils.setup_distributed() (as in original DEIM repo).
I also disabled logger for processes with rank != 0 for more readable output.

This PR also adds support for torchvision >= 0.21, as described in Intellindust-AI-Lab#47.

Thank you very much for your amazing work on this DEIM wrapper, do not hesitate to ask me for modifications in this PR if needed.

…le torchvision>=0.21 support

dnth · 2025-03-19T01:42:01Z

Hi @EnriqueGlv thanks for this PR! Have you tested this feature?

EnriqueGlv · 2025-03-19T09:40:26Z

Hi ! For sure, I tested this feature on a 4-GPU environment (4x Tesla T4) with the following script:

from deimkit import Trainer, Config, configure_dataset

conf = Config.from_model_name("deim_hgnetv2_s")

conf = configure_dataset(
    config=conf,
    image_size=[640, 640],
    train_ann_file="/path/to/coco/annotations/instances_train.json",
    train_img_folder="/path/to/coco/images/train",
    val_ann_file="/path/to/coco/annotations/instances_val.json",
    val_img_folder="/path/to/coco/images/val",
    train_batch_size=16,
    val_batch_size=16,
    num_classes=7, # I used my own dataset with only 7 classes, modify it to fit your dataset 
    output_dir="./outputs/deim_hgnetv2_s_pcb",
)

trainer = Trainer(conf)

trainer.fit(
    epochs=100,
    flat_epoch=50,
    no_aug_epoch=3,
    warmup_iter=50,
    ema_warmups=50
)

And I run this script with the command: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7778 --nproc_per_node=4 train.py.

However, my original commit was breaking the support of scripts called without using torchrun. I just fixed that by adding a condition to check whether the script is called with torchrun or not, in which case I call your original implementation.

This is probably not the best way to handle distributed environments in deimkit, but it is the easiest way I found to make it work for my experiments and this is why I wanted to share it here.

A more robust solution that could allow running distributed training from notebooks might be to use torch.multiprocessing to spawn processes without running with torchrun.

dnth · 2025-03-20T03:08:53Z

I tried the branch and manage to get the torchrun on multi gpu working!

But I'm getting error when I'm not using torchrun, ie just plain python train.py. Do you know what might be the cause?

Epoch 0:   0%|          | 0/263 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-5edf0f6432d2> in <cell line: 1>()
----> 1 trainer.fit(epochs=20, save_best_only=True)

/usr/local/lib/python3.10/dist-packages/deimkit/trainer.py in fit(self, epochs, flat_epoch, no_aug_epoch, warmup_iter, ema_warmups, lr, stop_epoch, mixup_epochs, save_best_only)
    380 
    381             # Train for one epoch
--> 382             train_stats = train_one_epoch(
    383                 self_lr_scheduler,
    384                 self.lr_scheduler,

/usr/local/lib/python3.10/dist-packages/deimkit/engine/solver/det_engine.py in train_one_epoch(self_lr_scheduler, lr_scheduler, model, criterion, data_loader, optimizer, device, epoch, max_norm, **kwargs)
     52         if scaler is not None:
     53             with torch.autocast(device_type=str(device), cache_enabled=True):
---> 54                 outputs = model(samples, targets=targets)
     55 
     56             if torch.isnan(outputs['pred_boxes']).any() or torch.isinf(outputs['pred_boxes']).any():

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in forward(self, *inputs, **kwargs)
   1641                 self.module.forward(*inputs, **kwargs)
   1642                 if self._delay_all_reduce_all_params
-> 1643                 else self._run_ddp_forward(*inputs, **kwargs)
   1644             )
   1645             return self._post_forward(output)

/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in _run_ddp_forward(self, *inputs, **kwargs)
   1457         else:
   1458             with self._inside_ddp_forward():
-> 1459                 return self.module(*inputs, **kwargs)  # type: ignore[index]
   1460 
   1461     def _clear_grad_buffer(self):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/deim.py in forward(self, x, targets)
     27         x = self.backbone(x)
     28         x = self.encoder(x)
---> 29         x = self.decoder(x, targets)
     30 
     31         return x

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in forward(self, feats, targets)
    722 
    723         init_ref_contents, init_ref_points_unact, enc_topk_bboxes_list, enc_topk_logits_list = \
--> 724             self._get_decoder_input(memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
    725 
    726         # decoder

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _get_decoder_input(self, memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
    638         # prepare input for decoder
    639         if self.training or self.eval_spatial_size is None:
--> 640             anchors, valid_mask = self._generate_anchors(spatial_shapes, device=memory.device)
    641         else:
    642             anchors = self.anchors

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _generate_anchors(self, spatial_shapes, grid_size, dtype, device)
    622             anchors.append(lvl_anchors)
    623 
--> 624         anchors = torch.concat(anchors, dim=1).to(device)
    625         valid_mask = ((anchors > self.eps) * (anchors < 1 - self.eps)).all(-1, keepdim=True)
    626         anchors = torch.log(anchors / (1 - anchors))

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

EnriqueGlv · 2025-03-20T09:11:56Z

Hi, thanks for trying the feature ! In my device, I get no errors when I run the code only with python train.py or when I execute the code from a notebook.

Indeed, the condition I added in the last commit ensures that if the script is not ran with torchrun, the original deimkit code is called.

However, I managed to reproduce your error by running CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7778 --nproc_per_node=4 train.py in a notebook cell before executing trainer.fit() in the notebook.

The reason of the issue is that notebooks keep the environment between cells execution, and since torchrun initializes torch.distributed you get this error because environment variables do not correspond to the distributed initialization you have.

To avoid having this issue, just restart your notebook's kernel.

If this is not the issue you had, do not hesitate to tell me !

dnth · 2025-03-21T09:51:17Z

Very cool! Thanks a lot for the contribution @EnriqueGlv

EnriqueGlv and others added 3 commits March 14, 2025 14:37

Added torchvision.transforms.v2.Transform.transform overrides to enab…

3249d2e

…le torchvision>=0.21 support

Merge remote-tracking branch 'deimkit/main' into deimkit

dd1fc79

Fixed distributed training for Multi-GPU support

9e390c1

EnriqueGlv added 2 commits March 19, 2025 09:50

Get rank after torch.distributed initialization

620b461

Added back support for running without torchrun

82306ae

dnth merged commit 529cb86 into dnth:main Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed multi-GPU training #6

Fixed multi-GPU training #6

Uh oh!

EnriqueGlv commented Mar 18, 2025 •

edited

Loading

Uh oh!

dnth commented Mar 19, 2025

Uh oh!

EnriqueGlv commented Mar 19, 2025

Uh oh!

dnth commented Mar 20, 2025

Uh oh!

EnriqueGlv commented Mar 20, 2025

Uh oh!

dnth commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixed multi-GPU training #6

Fixed multi-GPU training #6

Uh oh!

Conversation

EnriqueGlv commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnth commented Mar 19, 2025

Uh oh!

EnriqueGlv commented Mar 19, 2025

Uh oh!

dnth commented Mar 20, 2025

Uh oh!

EnriqueGlv commented Mar 20, 2025

Uh oh!

dnth commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EnriqueGlv commented Mar 18, 2025 •

edited

Loading