Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

jobmart · 2024-12-19T16:09:17Z

Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:

Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:

During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:

The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:

OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:

bash
Copy code
wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py
Note:
No private dataset is required to reproduce this bug. All required resources are publicly available.

micedevai · 2024-12-19T17:12:17Z

After running the PreciseBN hook, I observed that different DDP ranks have varying Batch Normalization (BN) statistics. This occurs because precise_bn in fvcore does not synchronize the batch size across DDP ranks, leading to inconsistent BN statistics across ranks.

Steps to Reproduce:

Clone the repository:

git clone https://github.com/facebookresearch/moco.git
cd moco/detection

Run the training script:

python train_net.py --config-file configs/pascal_voc_R_50_C4_24k.yaml --num-gpus 8 OUTPUT_DIR "temp/train" SEED 0 SOLVER.MAX_ITER 1

Observations:

During the update_bn_stats phase, the batch sizes across DDP ranks are inconsistent. This lack of synchronization causes the BN statistics to differ after the PreciseBN hook.

Expected Behavior:

Batch Normalization statistics should be synchronized across all ranks after the PreciseBN hook, ensuring consistent BN statistics across all GPUs.

Environment:

OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch Version: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA Version: 12.4
fvcore: 0.1.5.post20221221
iopath: 0.1.9

Steps to Collect Full Environment Details:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py

This should help explain the issue and provide clear steps for reproduction and expectations for a fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

jobmart commented Dec 19, 2024

micedevai commented Dec 19, 2024

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

Comments

jobmart commented Dec 19, 2024

micedevai commented Dec 19, 2024

Steps to Reproduce:

Observations:

Expected Behavior:

Environment:

Steps to Collect Full Environment Details: