Skip to content

Add perf_monitor integration for the training#1180

Open
wanglei19991004 wants to merge 2 commits intoflagos-ai:mainfrom
wanglei19991004:perf_monitor
Open

Add perf_monitor integration for the training#1180
wanglei19991004 wants to merge 2 commits intoflagos-ai:mainfrom
wanglei19991004:perf_monitor

Conversation

@wanglei19991004
Copy link
Copy Markdown
Contributor

PR Category

Train

PR Types

New Features

PR Description

This feature extends the original distributed training framework with a comprehensive stability monitoring and performance tracing system.

Quick Start

When launching run.py, you can enable the system by adding configuration overrides via command-line flags:

Example launch command:

python run.py \
  --config-path ./examples/gpt2/conf \
  --config-name train_single \
  action=run \
  +experiment.runner.enable_gpu_health_check=true \
  +experiment.runner.enable_monitoring=true \
  train.data.data_path=../pile_wikipedia_demo/pile_wikipedia_demo

Key execution notes:

  • After running the above run.py command, the main console will output INFO-level logs, including the actual startup script used to launch training on each node (e.g., bash outputs/.../host_0_localhost_run.sh).
  • You must manually copy and execute the generated .sh startup script. Only after executing it will the training process and background monitoring services be officially launched.
  • Once started, monitoring and performance logs can be found under the corresponding outputs/logs/ directory.

Performance Analysis (Perf Monitor)

Trigger condition:

Enabled by default or activated via configuration.

Working mechanism:

During large-scale model training, the system continuously collects runtime statistics and periodically (or at completion) writes detailed JSON performance reports. This enables intuitive visibility into hardware utilization and training efficiency.

Output artifacts:

Located in:

outputs/logs/perf_monitor/

Example file:
perf_summary_<timestamp>.json

{
  "session_info": {
    "start_time": "...",
    "end_time": "...",
    "total_iterations": 4
  },
  "final_statistics": {
    "avg_tflops_per_gpu": 2.30,
    "avg_step_time_ms": 213.88,
    "peak_memory_gb": 0.73
  },
  "iteration_logs": [
    {
      "iteration": 15,
      "TFLOPS_per_GPU": 2.30,
      "tokens_per_sec": 19150.87,
      "memory_GB": 0.73
    }
  ]
}

Metric explanation:

  • final_statistics: Aggregated performance metrics over the monitoring period, primarily focusing on TFLOPS (compute throughput) and peak memory usage.
  • iteration_logs: Per-iteration sampling data, including:
    • samples_per_sec (samples per second)
    • tokens_per_sec (token throughput)
    • and other key optimization indicators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant