Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

Open
jobmart opened this issue Dec 19, 2024 · 1 comment

Comments

@jobmart
Copy link

jobmart commented Dec 19, 2024

Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:

Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:

During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:

The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:

OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:

bash
Copy code
wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py
Note:
No private dataset is required to reproduce this bug. All required resources are publicly available.

@micedevai
Copy link

After running the PreciseBN hook, I observed that different DDP ranks have varying Batch Normalization (BN) statistics. This occurs because precise_bn in fvcore does not synchronize the batch size across DDP ranks, leading to inconsistent BN statistics across ranks.

Steps to Reproduce:

  1. Clone the repository:
    git clone https://github.com/facebookresearch/moco.git
    cd moco/detection
  2. Run the training script:
    python train_net.py --config-file configs/pascal_voc_R_50_C4_24k.yaml --num-gpus 8 OUTPUT_DIR "temp/train" SEED 0 SOLVER.MAX_ITER 1

Observations:

During the update_bn_stats phase, the batch sizes across DDP ranks are inconsistent. This lack of synchronization causes the BN statistics to differ after the PreciseBN hook.

Expected Behavior:

Batch Normalization statistics should be synchronized across all ranks after the PreciseBN hook, ensuring consistent BN statistics across all GPUs.

Environment:

  • OS: Linux
  • Python Version: 3.8.0
  • Framework: Detectron2 v0.6
  • PyTorch Version: 2.4.1+cu124
  • GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
  • CUDA Version: 12.4
  • fvcore: 0.1.5.post20221221
  • iopath: 0.1.9

Steps to Collect Full Environment Details:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py

This should help explain the issue and provide clear steps for reproduction and expectations for a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants