-
Notifications
You must be signed in to change notification settings - Fork 7
Fixed multi-GPU training #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @EnriqueGlv thanks for this PR! Have you tested this feature? |
|
Hi ! For sure, I tested this feature on a 4-GPU environment (4x Tesla T4) with the following script: from deimkit import Trainer, Config, configure_dataset
conf = Config.from_model_name("deim_hgnetv2_s")
conf = configure_dataset(
config=conf,
image_size=[640, 640],
train_ann_file="/path/to/coco/annotations/instances_train.json",
train_img_folder="/path/to/coco/images/train",
val_ann_file="/path/to/coco/annotations/instances_val.json",
val_img_folder="/path/to/coco/images/val",
train_batch_size=16,
val_batch_size=16,
num_classes=7, # I used my own dataset with only 7 classes, modify it to fit your dataset
output_dir="./outputs/deim_hgnetv2_s_pcb",
)
trainer = Trainer(conf)
trainer.fit(
epochs=100,
flat_epoch=50,
no_aug_epoch=3,
warmup_iter=50,
ema_warmups=50
)And I run this script with the command: However, my original commit was breaking the support of scripts called without using This is probably not the best way to handle distributed environments in deimkit, but it is the easiest way I found to make it work for my experiments and this is why I wanted to share it here. A more robust solution that could allow running distributed training from notebooks might be to use |
|
I tried the branch and manage to get the torchrun on multi gpu working! But I'm getting error when I'm not using torchrun, ie just plain Epoch 0: 0%| | 0/263 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-5-5edf0f6432d2> in <cell line: 1>()
----> 1 trainer.fit(epochs=20, save_best_only=True)
/usr/local/lib/python3.10/dist-packages/deimkit/trainer.py in fit(self, epochs, flat_epoch, no_aug_epoch, warmup_iter, ema_warmups, lr, stop_epoch, mixup_epochs, save_best_only)
380
381 # Train for one epoch
--> 382 train_stats = train_one_epoch(
383 self_lr_scheduler,
384 self.lr_scheduler,
/usr/local/lib/python3.10/dist-packages/deimkit/engine/solver/det_engine.py in train_one_epoch(self_lr_scheduler, lr_scheduler, model, criterion, data_loader, optimizer, device, epoch, max_norm, **kwargs)
52 if scaler is not None:
53 with torch.autocast(device_type=str(device), cache_enabled=True):
---> 54 outputs = model(samples, targets=targets)
55
56 if torch.isnan(outputs['pred_boxes']).any() or torch.isinf(outputs['pred_boxes']).any():
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
1737
1738 # torchrec tests the code consistency with the following code
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1748
1749 result = None
/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in forward(self, *inputs, **kwargs)
1641 self.module.forward(*inputs, **kwargs)
1642 if self._delay_all_reduce_all_params
-> 1643 else self._run_ddp_forward(*inputs, **kwargs)
1644 )
1645 return self._post_forward(output)
/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py in _run_ddp_forward(self, *inputs, **kwargs)
1457 else:
1458 with self._inside_ddp_forward():
-> 1459 return self.module(*inputs, **kwargs) # type: ignore[index]
1460
1461 def _clear_grad_buffer(self):
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
1737
1738 # torchrec tests the code consistency with the following code
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1748
1749 result = None
/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/deim.py in forward(self, x, targets)
27 x = self.backbone(x)
28 x = self.encoder(x)
---> 29 x = self.decoder(x, targets)
30
31 return x
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
1737
1738 # torchrec tests the code consistency with the following code
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1748
1749 result = None
/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in forward(self, feats, targets)
722
723 init_ref_contents, init_ref_points_unact, enc_topk_bboxes_list, enc_topk_logits_list = \
--> 724 self._get_decoder_input(memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
725
726 # decoder
/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _get_decoder_input(self, memory, spatial_shapes, denoising_logits, denoising_bbox_unact)
638 # prepare input for decoder
639 if self.training or self.eval_spatial_size is None:
--> 640 anchors, valid_mask = self._generate_anchors(spatial_shapes, device=memory.device)
641 else:
642 anchors = self.anchors
/usr/local/lib/python3.10/dist-packages/deimkit/engine/deim/dfine_decoder.py in _generate_anchors(self, spatial_shapes, grid_size, dtype, device)
622 anchors.append(lvl_anchors)
623
--> 624 anchors = torch.concat(anchors, dim=1).to(device)
625 valid_mask = ((anchors > self.eps) * (anchors < 1 - self.eps)).all(-1, keepdim=True)
626 anchors = torch.log(anchors / (1 - anchors))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. |
|
Hi, thanks for trying the feature ! In my device, I get no errors when I run the code only with Indeed, the condition I added in the last commit ensures that if the script is not ran with torchrun, the original deimkit code is called. However, I managed to reproduce your error by running The reason of the issue is that notebooks keep the environment between cells execution, and since torchrun initializes To avoid having this issue, just restart your notebook's kernel. If this is not the issue you had, do not hesitate to tell me ! |
|
Very cool! Thanks a lot for the contribution @EnriqueGlv |
Hi ! This PR aims at enabling multi-GPU support for training DEIM model.
Indeed, current deimkit version doesn't support multi-GPU because of the use of the backend "gloo" for
torch.distributed. (https://pytorch.org/docs/stable/distributed.html)The way I enabled multi-GPU support in my local environment was by calling
dist_utils.setup_distributed()(as in original DEIM repo).I also disabled logger for processes with rank != 0 for more readable output.
This PR also adds support for torchvision >= 0.21, as described in Intellindust-AI-Lab#47.
Thank you very much for your amazing work on this DEIM wrapper, do not hesitate to ask me for modifications in this PR if needed.