Skip to content

add straggler detection integration for the training#1179

Open
wanglei19991004 wants to merge 2 commits intoflagos-ai:mainfrom
wanglei19991004:straggler_new
Open

add straggler detection integration for the training#1179
wanglei19991004 wants to merge 2 commits intoflagos-ai:mainfrom
wanglei19991004:straggler_new

Conversation

@wanglei19991004
Copy link
Copy Markdown
Contributor

PR Category

Train

PR Types

New Features

PR Description

Straggler Detection is used to monitor the performance of each node and GPU during distributed training, and to detect whether there are any “stragglers” (i.e., workers that are significantly slower than others). If a GPU or node runs noticeably slower, it will become a bottleneck and slow down the overall distributed training process.

Quick Start

Single-node example (using GPT-2)

When running run.py, enable the Straggler Detection feature by overriding system configuration parameters:

python run.py \
  --config-path ./examples/gpt2/conf \
  --config-name train_single \
  action=run \
  +train.system.enable_straggler_detection=true \
  +train.system.straggler_profiling_interval=5 \
  +train.system.straggler_report_interval=10 \
  +train.system.straggler_log_dir=./outputs_gpt2/logs/straggler

Multi-node example (2 nodes × 4 GPUs)

The multi-node setup is similar to the single-node case. You just need to additionally specify parameters such as hostfile and the master node address in the runner configuration:

python run.py \
  --config-path ./examples/gpt2/conf \
  --config-name train_single \
  action=run \
  +train.system.enable_straggler_detection=true \
  +train.system.straggler_profiling_interval=5 \
  +train.system.straggler_report_interval=10 \
  +train.system.straggler_log_dir=./outputs_gpt2/logs/straggler

Note

The program automatically identifies the physical machine where each rank is located (by retrieving the node name via os.environ.get('HOSTNAME') or socket.gethostname()).

Core Configuration Parameters

These parameters can be overridden via the command line (as shown above) or modified in the YAML configuration file:

  • enable_straggler_detection (bool): Whether to enable Straggler detection (default: False).
  • straggler_profiling_interval (int): Profiling interval, i.e., how many steps between recording runtime statistics (default: 10).
  • straggler_report_interval (int): Reporting interval, i.e., how many steps between generating and saving a statistical analysis report (default: 100).
  • straggler_threshold (float): Relative latency threshold for identifying a straggler. For example, 1.5 means a node is considered a straggler if it is 50% slower or more than the fastest node (default: 1.5).
  • straggler_log_dir (str): Directory where Straggler JSON report files are saved.

Interpreting Output Reports

When the configured straggler_report_interval is reached:

A text-based report is printed to the console on Rank 0.
A JSON report file is periodically generated in the specified straggler_log_dir.

  1. Console Output Example
=== Straggler Report at Step 10 ===

✓ No stragglers detected.

Section Timings (ms):

  optimizer:
    Min: 9.17ms, Max: 10.12ms, Avg: 9.55ms, Slowdown: 1.10x
    Rank 0: 9.98ms
    Rank 1: 10.12ms
    ...

  forward_backward:
    Min: 281.94ms, Max: 343.19ms, Avg: 314.23ms, Slowdown: 1.22x
    Rank 0: 343.19ms
    Rank 1: 341.51ms
    ...

GPU Performance Scores (higher = faster):
  Rank 0 (p-phy-zy-daxing-kt-lc-a800-node-prod-15-128): 2.9138
  Rank 1 (p-phy-zy-daxing-kt-lc-a800-node-prod-15-128): 2.9282
  ...
  1. JSON Report Format

A file such as straggler_report_step_10.json contains detailed aggregated data:

{
  "step": 10,
  "section_scores": {
    "optimizer": {
      "0": 0.004562,
      "1": 0.004560
    },
    "forward_backward": {
      "0": 0.391165,
      "1": 0.391496
    }
  },
  "comm_stats": {},
  "gpu_scores": {
    "0": 2.556464,
    "1": 2.554298
  },
  "straggler_ranks": [],
  "node_names": {
    "0": "p-phy-zy-daxing-kt-lc-a800-node-prod-15-128",
    "1": "p-phy-zy-daxing-kt-lc-a800-node-prod-15-128"
  },
  "timestamp": 1764885816.7089095
}

JSON Field Descriptions

  • step: The iteration/step index at which the report is generated.
  • section_scores: Average execution time (in seconds) for each rank (i.e., GPU/node) across specific code sections (e.g., forward_backward, optimizer).
  • gpu_scores: Aggregated GPU performance scores, computed based on inverse execution times of different sections.
    Higher values indicate better performance.
  • straggler_ranks: Ranks identified as severe stragglers based on the threshold. [] → no stragglers detected
    e.g., [2, 7] → these ranks are significantly slower
  • node_names: Mapping from rank to machine hostname. This is critical for diagnosing issues in multi-node multi-GPU training, as it pinpoints which server a problematic rank resides on.
  • timestamp: Timestamp when the report was generated.

Custom Monitoring (for Advanced / Secondary Development)

In addition to the default monitored sections (forward_backward and optimizer), users can instrument custom code regions for profiling.

Use case:
If you implement complex logic (e.g., custom loss functions or data preprocessing) and want to analyze its performance across GPUs or detect stragglers, you can manually instrument the code as follows:

from flagscale.runner.straggler import get_fs_straggler_detector
import time

fs_straggler = get_fs_straggler_detector()

# Start
if fs_straggler is not None and fs_straggler.should_profile():
    my_section_start = time.perf_counter()
    
# ... your code logic ...

# Stop & Profiling
if fs_straggler is not None and fs_straggler.should_profile():
    my_section_end = time.perf_counter()
    fs_straggler.record_section(
        name="my_custom_processing",
        cpu_time=my_section_end - my_section_start
    )

Alternatively, you can use lower-level utilities such as SectionContext or decorators.

The recorded section data will be collected and reported when the report_interval is reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant