You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks.
#5409
Open
jobmart opened this issue
Dec 19, 2024
· 1 comment
Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:
Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:
During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:
The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:
OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:
After running the PreciseBN hook, I observed that different DDP ranks have varying Batch Normalization (BN) statistics. This occurs because precise_bn in fvcore does not synchronize the batch size across DDP ranks, leading to inconsistent BN statistics across ranks.
Steps to Reproduce:
Clone the repository:
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
During the update_bn_stats phase, the batch sizes across DDP ranks are inconsistent. This lack of synchronization causes the BN statistics to differ after the PreciseBN hook.
Expected Behavior:
Batch Normalization statistics should be synchronized across all ranks after the PreciseBN hook, ensuring consistent BN statistics across all GPUs.
Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:
Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:
During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:
The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:
OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:
bash
Copy code
wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py
Note:
No private dataset is required to reproduce this bug. All required resources are publicly available.
The text was updated successfully, but these errors were encountered: