Commit 5aa65f6
add gpu_health_check for communication test (flagos-ai#1051)
### PR Category
Others
### PR Types
New Features
### PR Description
- Integrate healthy node detection function:
- Tensor parallel communication testing
- Data parallel communication testing
- Pipeline parallel communication testing
- GPU hardware validation
- Computation capability verification
- Usage:
```
python run.py \
--config-path ./examples/aquila/conf \
--config-name train \
train.data.data_path=/root/FlagScale/data/pile_wikipedia_demo \
+experiment.runner.enable_gpu_health_check=true \
action=run
```
```
============================================================
COMPREHENSIVE GPU HEALTH CHECK
============================================================
Configuration:
• World Size: 8
• Tensor Parallel Size: 1
• Pipeline Parallel Size: 1
• Backend: nccl
• Timeout: 5 minutes
• Need Distributed: True
============================================================
Multi-process distributed mode detected
Initializing distributed environment...
...
============================================================
PHASE 1: PARALLEL COMMUNICATION TESTING
============================================================
...
Tensor parallel communication test completed successfully
✓ tensor_parallel: PASSED
Testing data parallel communication (World size: 8)
...
Data parallel communication test completed successfully
✓ data_parallel: PASSED
Testing pipeline parallel communication (PP size: 1)
...
Data parallel communication test completed successfully
✓ data_parallel: PASSED
Testing pipeline parallel communication (PP size: 1)
...
Pipeline parallel communication test completed successfully
✓ pipeline_parallel: PASSED
Parallel communication testing phase completed
============================================================
============================================================
PHASE 2: GPU HARDWARE TESTING
============================================================
Testing GPU hardware
=== Checking GPU 0: NVIDIA A800-SXM4-80GB ===
Current GPU temperature: 38°C
Power usage: 79.67W / 400.00W
Total memory: 81920.00 MB
Used memory: 6309.62 MB
GPU 0 memory utilization rate: 7.70%
...
✓ gpu_hardware: PASSED
GPU hardware testing phase completed
============================================================
============================================================
PHASE 3: GPU COMPUTATION TESTING
============================================================
Testing Float calculation...
Float calculation passed
Testing Double calculation...
Double calculation passed
Testing Half calculation...
Half calculation passed
Testing Endurance test (60s)...
Endurance test (60s) passed
Testing ECC Error Detection...
ECC Error Detection: No errors detected
✓ ecc_error: PASSED
✓ computation: PASSED
GPU computation testing phase completed
============================================================
============================================================
ALL TEST PHASES COMPLETED
============================================================
============================================================
GPU HEALTH CHECK SUMMARY
============================================================
✓ Tensor Parallel: PASSED
✓ Data Parallel: PASSED
✓ Pipeline Parallel: PASSED
✓ Gpu Hardware: PASSED
✓ Computation: PASSED
✓ Ecc Error: PASSED
Results: 6 passed, 0 failed, 0 skipped out of 6 total
🎉 All GPU health checks PASSED!
============================================================
```
for multiple machine:
```
[2025-10-23 15:02:04,387 FlagScale logger.py:25 INFO] Starting GPU health check before training setup...
[2025-10-23 15:02:04,387 FlagScale logger.py:25 INFO] Running GPU health check across 2 nodes
[2025-10-23 15:02:04,388 FlagScale logger.py:25 INFO] Checking node 0 (10.1.15.141) with 2 GPUs
[2025-10-23 15:02:04,388 FlagScale logger.py:25 INFO] Running GPU health check on 10.1.15.141 (node_rank=0)
[2025-10-23 15:02:04,528 FlagScale logger.py:25 INFO] Checking node 1 (10.1.15.237) with 2 GPUs
[2025-10-23 15:02:04,528 FlagScale logger.py:25 INFO] Running GPU health check on 10.1.15.237 (node_rank=1)
[2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] GPU health check passed on all nodes
[2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] GPU health check passed successfully!
[2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] Proceeding with training script generation...
```
---------
Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>1 parent 4d1059a commit 5aa65f6
4 files changed
Lines changed: 897 additions & 3 deletions
0 commit comments