Add perf_monitor integration for the training by wanglei19991004 · Pull Request #1180 · flagos-ai/FlagScale

wanglei19991004 · 2026-04-01T01:32:45Z

PR Category

Train

PR Types

New Features

PR Description

This feature extends the original distributed training framework with a comprehensive stability monitoring and performance tracing system.

Quick Start

When launching run.py, you can enable the system by adding configuration overrides via command-line flags:

Example launch command:

python run.py \
  --config-path ./examples/gpt2/conf \
  --config-name train_single \
  action=run \
  +experiment.runner.enable_gpu_health_check=true \
  +experiment.runner.enable_monitoring=true \
  train.data.data_path=../pile_wikipedia_demo/pile_wikipedia_demo

Key execution notes:

After running the above run.py command, the main console will output INFO-level logs, including the actual startup script used to launch training on each node (e.g., bash outputs/.../host_0_localhost_run.sh).
You must manually copy and execute the generated .sh startup script. Only after executing it will the training process and background monitoring services be officially launched.
Once started, monitoring and performance logs can be found under the corresponding outputs/logs/ directory.

Performance Analysis (Perf Monitor)

Trigger condition:

Enabled by default or activated via configuration.

Working mechanism:

During large-scale model training, the system continuously collects runtime statistics and periodically (or at completion) writes detailed JSON performance reports. This enables intuitive visibility into hardware utilization and training efficiency.

Output artifacts:

Located in:

outputs/logs/perf_monitor/

Example file:
perf_summary_<timestamp>.json

{
  "session_info": {
    "start_time": "...",
    "end_time": "...",
    "total_iterations": 4
  },
  "final_statistics": {
    "avg_tflops_per_gpu": 2.30,
    "avg_step_time_ms": 213.88,
    "peak_memory_gb": 0.73
  },
  "iteration_logs": [
    {
      "iteration": 15,
      "TFLOPS_per_GPU": 2.30,
      "tokens_per_sec": 19150.87,
      "memory_GB": 0.73
    }
  ]
}

Metric explanation:

final_statistics: Aggregated performance metrics over the monitoring period, primarily focusing on TFLOPS (compute throughput) and peak memory usage.
iteration_logs: Per-iteration sampling data, including:
- samples_per_sec (samples per second)
- tokens_per_sec (token throughput)
- and other key optimization indicators.

perf_monitor

6d27b37

wanglei19991004 requested review from aoyulong, heavyrain-lzy and zhaoyinglia as code owners April 1, 2026 01:32

pre-commit

4c1bef9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add perf_monitor integration for the training#1180

Add perf_monitor integration for the training#1180
wanglei19991004 wants to merge 2 commits intoflagos-ai:mainfrom
wanglei19991004:perf_monitor

wanglei19991004 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wanglei19991004 commented Apr 1, 2026

PR Category

PR Types

PR Description

Quick Start

Performance Analysis (Perf Monitor)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant