Skip to content

Conversation

@EnriqueGlv
Copy link

@EnriqueGlv EnriqueGlv commented Mar 18, 2025

Hi ! This PR aims at enabling multi-GPU support for training DEIM model.

Indeed, current deimkit version doesn't support multi-GPU because of the use of the backend "gloo" for torch.distributed. (https://pytorch.org/docs/stable/distributed.html)
The way I enabled multi-GPU support in my local environment was by calling dist_utils.setup_distributed() (as in original DEIM repo).
I also disabled logger for processes with rank != 0 for more readable output.

This PR also adds support for torchvision >= 0.21, as described in Intellindust-AI-Lab#47.

Thank you very much for your amazing work on this DEIM wrapper, do not hesitate to ask me for modifications in this PR if needed.

@dnth
Copy link
Owner

dnth commented Mar 19, 2025

Hi @EnriqueGlv thanks for this PR! Have you tested this feature?

@EnriqueGlv
Copy link
Author

Hi ! For sure, I tested this feature on a 4-GPU environment (4x Tesla T4) with the following script:

from deimkit import Trainer, Config, configure_dataset

conf = Config.from_model_name("deim_hgnetv2_s")

conf = configure_dataset(
    config=conf,
    image_size=[640, 640],
    train_ann_file="/path/to/coco/annotations/instances_train.json",
    train_img_folder="/path/to/coco/images/train",
    val_ann_file="/path/to/coco/annotations/instances_val.json",
    val_img_folder="/path/to/coco/images/val",
    train_batch_size=16,
    val_batch_size=16,
    num_classes=7, # I used my own dataset with only 7 classes, modify it to fit your dataset 
    output_dir="./outputs/deim_hgnetv2_s_pcb",
)

trainer = Trainer(conf)

trainer.fit(
    epochs=100,
    flat_epoch=50,
    no_aug_epoch=3,
    warmup_iter=50,
    ema_warmups=50
)

And I run this script with the command: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7778 --nproc_per_node=4 train.py.

However, my original commit was breaking the support of scripts called without using torchrun. I just fixed that by adding a condition to check whether the script is called with torchrun or not, in which case I call your original implementation.

This is probably not the best way to handle distributed environments in deimkit, but it is the easiest way I found to make it work for my experiments and this is why I wanted to share it here.

A more robust solution that could allow running distributed training from notebooks might be to use torch.multiprocessing to spawn processes without running with torchrun.

@dnth
Copy link
Owner

dnth commented Mar 20, 2025

I tried the branch and manage to get the torchrun on multi gpu working!

But I'm getting error when I'm not using torchrun, ie just plain python train.py. Do you know what might be the cause?

Epoch 0:   0%|          | 0/263 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-5edf0f6432d2> in <cell line: 1>()
----> 1 trainer.fit(epochs=20, save_best_only=True)

/usr/local/lib/python3.10/dist-packages/deimkit/trainer.py in fit(self, epochs, flat_epoch, no_aug_epoch, warmup_iter, ema_warmups, lr, stop_epoch, mixup_epochs, save_best_only)
    380 
    381             # Train for one epoch
--> 382             train_stats = train_one_epoch(
    383                 self_lr_scheduler,
    384                 self.lr_scheduler,

/usr/local/lib/python3.10/dist-packages/deimkit/engine/solver/det_engine.py in train_one_epoch(self_lr_scheduler, lr_scheduler, model, criterion, data_loader, optimizer, device, epoch, max_norm, **kwargs)
     52         if scaler is not None:
     53             with torch.autocast(device_type=str(device), cache_enabled=True):
---> 54                 outputs = model(samples, targets=targets)
     55 
     56             if torch.isnan(outputs['pred_boxes']).any() or torch.isinf(outputs['pred_boxes']).any():

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in forward(self, *inputs, **kwargs)
   1641                 self.module.forward(*inputs, **kwargs)
   1642                 if self._delay_all_reduce_all_params
-> 1643                 else self._run_ddp_forward(*inputs, **kwargs)
   1644             )
   1645             return self._post_forward(output)

/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in _run_ddp_forward(self, *inputs, **kwargs)
   1457         else:
   1458             with self._inside_ddp_forward():
-> 1459                 return self.module(*inputs, **kwargs)  # type: ignore[index]
   1460 
   1461     def _clear_grad_buffer(self):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/deim.py in forward(self, x, targets)
     27         x = self.backbone(x)
     28         x = self.encoder(x)
---> 29         x = self.decoder(x, targets)
     30 
     31         return x

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in forward(self, feats, targets)
    722 
    723         init_ref_contents, init_ref_points_unact, enc_topk_bboxes_list, enc_topk_logits_list = \
--> 724             self._get_decoder_input(memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
    725 
    726         # decoder

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _get_decoder_input(self, memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
    638         # prepare input for decoder
    639         if self.training or self.eval_spatial_size is None:
--> 640             anchors, valid_mask = self._generate_anchors(spatial_shapes, device=memory.device)
    641         else:
    642             anchors = self.anchors

/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _generate_anchors(self, spatial_shapes, grid_size, dtype, device)
    622             anchors.append(lvl_anchors)
    623 
--> 624         anchors = torch.concat(anchors, dim=1).to(device)
    625         valid_mask = ((anchors > self.eps) * (anchors < 1 - self.eps)).all(-1, keepdim=True)
    626         anchors = torch.log(anchors / (1 - anchors))

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@EnriqueGlv
Copy link
Author

Hi, thanks for trying the feature ! In my device, I get no errors when I run the code only with python train.py or when I execute the code from a notebook.

Indeed, the condition I added in the last commit ensures that if the script is not ran with torchrun, the original deimkit code is called.

However, I managed to reproduce your error by running CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=7778 --nproc_per_node=4 train.py in a notebook cell before executing trainer.fit() in the notebook.

The reason of the issue is that notebooks keep the environment between cells execution, and since torchrun initializes torch.distributed you get this error because environment variables do not correspond to the distributed initialization you have.

To avoid having this issue, just restart your notebook's kernel.

If this is not the issue you had, do not hesitate to tell me !

@dnth dnth merged commit 529cb86 into dnth:main Mar 21, 2025
@dnth
Copy link
Owner

dnth commented Mar 21, 2025

Very cool! Thanks a lot for the contribution @EnriqueGlv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants