Skip to content

Commit 5aa65f6

Browse files
add gpu_health_check for communication test (flagos-ai#1051)
### PR Category Others ### PR Types New Features ### PR Description - Integrate healthy node detection function: - Tensor parallel communication testing - Data parallel communication testing - Pipeline parallel communication testing - GPU hardware validation - Computation capability verification - Usage: ``` python run.py \ --config-path ./examples/aquila/conf \ --config-name train \ train.data.data_path=/root/FlagScale/data/pile_wikipedia_demo \ +experiment.runner.enable_gpu_health_check=true \ action=run ``` ``` ============================================================ COMPREHENSIVE GPU HEALTH CHECK ============================================================ Configuration: • World Size: 8 • Tensor Parallel Size: 1 • Pipeline Parallel Size: 1 • Backend: nccl • Timeout: 5 minutes • Need Distributed: True ============================================================ Multi-process distributed mode detected Initializing distributed environment... ... ============================================================ PHASE 1: PARALLEL COMMUNICATION TESTING ============================================================ ... Tensor parallel communication test completed successfully ✓ tensor_parallel: PASSED Testing data parallel communication (World size: 8) ... Data parallel communication test completed successfully ✓ data_parallel: PASSED Testing pipeline parallel communication (PP size: 1) ... Data parallel communication test completed successfully ✓ data_parallel: PASSED Testing pipeline parallel communication (PP size: 1) ... Pipeline parallel communication test completed successfully ✓ pipeline_parallel: PASSED Parallel communication testing phase completed ============================================================ ============================================================ PHASE 2: GPU HARDWARE TESTING ============================================================ Testing GPU hardware === Checking GPU 0: NVIDIA A800-SXM4-80GB === Current GPU temperature: 38°C Power usage: 79.67W / 400.00W Total memory: 81920.00 MB Used memory: 6309.62 MB GPU 0 memory utilization rate: 7.70% ... ✓ gpu_hardware: PASSED GPU hardware testing phase completed ============================================================ ============================================================ PHASE 3: GPU COMPUTATION TESTING ============================================================ Testing Float calculation... Float calculation passed Testing Double calculation... Double calculation passed Testing Half calculation... Half calculation passed Testing Endurance test (60s)... Endurance test (60s) passed Testing ECC Error Detection... ECC Error Detection: No errors detected ✓ ecc_error: PASSED ✓ computation: PASSED GPU computation testing phase completed ============================================================ ============================================================ ALL TEST PHASES COMPLETED ============================================================ ============================================================ GPU HEALTH CHECK SUMMARY ============================================================ ✓ Tensor Parallel: PASSED ✓ Data Parallel: PASSED ✓ Pipeline Parallel: PASSED ✓ Gpu Hardware: PASSED ✓ Computation: PASSED ✓ Ecc Error: PASSED Results: 6 passed, 0 failed, 0 skipped out of 6 total 🎉 All GPU health checks PASSED! ============================================================ ``` for multiple machine: ``` [2025-10-23 15:02:04,387 FlagScale logger.py:25 INFO] Starting GPU health check before training setup... [2025-10-23 15:02:04,387 FlagScale logger.py:25 INFO] Running GPU health check across 2 nodes [2025-10-23 15:02:04,388 FlagScale logger.py:25 INFO] Checking node 0 (10.1.15.141) with 2 GPUs [2025-10-23 15:02:04,388 FlagScale logger.py:25 INFO] Running GPU health check on 10.1.15.141 (node_rank=0) [2025-10-23 15:02:04,528 FlagScale logger.py:25 INFO] Checking node 1 (10.1.15.237) with 2 GPUs [2025-10-23 15:02:04,528 FlagScale logger.py:25 INFO] Running GPU health check on 10.1.15.237 (node_rank=1) [2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] GPU health check passed on all nodes [2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] GPU health check passed successfully! [2025-10-23 15:02:04,655 FlagScale logger.py:25 INFO] Proceeding with training script generation... ``` --------- Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com>
1 parent 4d1059a commit 5aa65f6

4 files changed

Lines changed: 897 additions & 3 deletions

File tree

0 commit comments

Comments
 (0)