From 1d389590a886a7b067bbd46264ab2e6fb88865d5 Mon Sep 17 00:00:00 2001 From: sonbol_y Date: Thu, 11 Dec 2025 10:13:17 -0600 Subject: [PATCH 1/2] Update GEMM analysis documentation - Reorganize README.md for better clarity - Add rocprof_guide.md with profiling workflow --- scripts/gemm_analysis/README.md | 285 ++++++++----------------- scripts/gemm_analysis/rocprof_guide.md | 133 ++++++++++++ 2 files changed, 224 insertions(+), 194 deletions(-) create mode 100644 scripts/gemm_analysis/rocprof_guide.md diff --git a/scripts/gemm_analysis/README.md b/scripts/gemm_analysis/README.md index dbb5d6a..69998e6 100644 --- a/scripts/gemm_analysis/README.md +++ b/scripts/gemm_analysis/README.md @@ -1,13 +1,14 @@ -# GEMM Sweep Profiling +# GEMM Sweep Profiling and Analysis Profile GEMM kernel performance across multiple NCCL configurations. ## Prerequisites - Docker with ROCm support -- TraceLens installed +- TraceLens installed (optional for some scripts) +- Python packages: pandas, openpyxl, matplotlib, seaborn -## Pipeline Steps +## Workflow ### 1. Build Docker Container @@ -20,6 +21,7 @@ docker exec -it training-overlap-bugs-rocm70_9-1 bash ### 2. Run Training Sweep +Basic sweep: ```bash bash scripts/gemm_analysis/run_train_various_channels.sh \ --channels 28,42,56,70 \ @@ -27,158 +29,38 @@ bash scripts/gemm_analysis/run_train_various_channels.sh \ --config config/gemm_overlap/gemm_test_1.yaml ``` -#### rocprof Tracing Options - -Add rocprofv3 kernel tracing to capture detailed GEMM performance: - -Simple YAML-based tracing (recommended) ---------------------------------------- - -Use the `rocprof_cu_only.yaml` configuration file for CU utilization metrics: - -```yaml -jobs: - - kernel_include_regex: "(gemm|Cijk_.*)" # pattern for kernels to trace - kernel_trace: true # enable kernel tracing - stats: true # timing statistics only (not CU utilization) - output_format: [json, csv] # add perfetto for Chrome tracing - sys_trace: false - advanced_thread_trace: false # leave false unless ATT decoder is installed -``` - -Run the sweep with the CU-only YAML: +With rocprof tracing: ```bash bash scripts/gemm_analysis/run_train_various_channels.sh \ --rocprof \ --rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \ - --channels 28,42,56 --threads 256,512 \ + --channels 28,42,56,70 \ + --threads 256,512 \ --config config/gemm_overlap/gemm_test_1.yaml ``` -Notes: -- Kernel filtering/stats come from the YAML. The current rocprofv3 build ignores CLI kernel filters, so use the YAML to include/exclude kernels. -- Remove `advanced_thread_trace` or keep it `false` unless the ATT decoder debs are installed. -- **Important**: `stats: true` only collects timing statistics, NOT CU utilization metrics. -- **Output Files**: rocprof generates 5 files per rank/process: - - `PID_agent_info.csv`: Hardware information about CPUs and GPUs - - `PID_counter_collection.csv`: **Main file with CU utilization metrics** (focus on this) - - `PID_kernel_trace.csv`: Kernel execution timeline data - - `PID_results.json`: Chrome trace format for visualization - - `PID_results.csv`: Summary statistics - -**Analyzing Unique GEMM Kernels (counter_collection.csv columns):** -- `Grid_Size`: Total number of workgroups in the kernel launch -- `Kernel_Name`: Name of the GEMM kernel (e.g., Cijk_Alik_Bljk_SB_MT128x128x32_MI32x32x1x2) -- `Workgroup_Size`: Number of work-items per workgroup -- `LDS_Block_Size`: Local Data Share memory allocation per workgroup -- `Scratch_Size`: Private memory allocation per work-item -- `VGPR_Count`: Vector General Purpose Registers used -- `Accum_VGPR_Count`: Accumulator VGPRs (for matrix operations) -- `SGPR_Count`: Scalar General Purpose Registers used -- `Counter_Name`: Performance counter being measured (e.g., SQ_BUSY_CU_CYCLES) -- `Counter_Value`: Value of the performance counter -- `Start_Timestamp` / `End_Timestamp`: Kernel execution timing - -**Key Options:** -- `--rocprof` : Enable rocprofv3 tracing -- `--stats` : Include timing statistics (not CU utilization) -- `--channels VALUES` : Comma-separated NCCL channel values -- `--threads VALUES` : Comma-separated thread values - -**Output:** Traces saved to `rocprof_traces/` in each run directory. - -**Key Performance Counters (found in counter_collection.csv files):** -- `SQ_BUSY_CU_CYCLES`: Percentage of time CUs are active (CU utilization) -- `SQ_WAVES`: Number of active wavefronts (occupancy indicator) -- `SQ_INSTS_MFMA`: Matrix FMA instructions (critical for GEMM performance) -- `SQ_INSTS_VALU`: Vector ALU instructions (general compute) + +For rocprof configuration details, see [rocprof_guide.md](rocprof_guide.md). ### 3. Generate TraceLens Reports ```bash -bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_20251124_222204 +bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS ``` ### 4. Extract Top GEMM Kernels ```bash python scripts/gemm_analysis/analyze_gemm_reports.py \ - --base-path experiments/sweep_20251124_222204/tracelens_analysis \ + --base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \ --threads 256 512 \ --channels 28 42 56 70 \ --ranks 0 1 2 3 4 5 6 7 \ --top-k 5 ``` -This generates `top5_gemm_kernels_time_variance.csv` with the kernels showing highest time variance across runs. - -## Output Structure - -``` -experiments/sweep_YYYYMMDD_HHMMSS/ -├── 256thread/ -│ └── nccl_XXchannels/ -│ ├── torch_profiler/rank*/ -│ ├── rocprof_traces/ # if --rocprof flag used -│ │ ├── PID_agent_info.csv # Hardware info for each rank -│ │ ├── PID_counter_collection.csv # CU utilization metrics (main focus) -│ │ ├── PID_kernel_trace.csv # Kernel execution timeline -│ │ ├── PID_results.json # Chrome trace format -│ │ └── PID_results.csv # Summary statistics -│ └── run_output.log -├── 512thread/ -│ └── nccl_XXchannels/ -└── tracelens_analysis/ - ├── 256thread/ - │ ├── individual_reports/perf_*ch_rank*.xlsx - │ └── collective_reports/collective_*ch.xlsx - ├── 512thread/ - └── top5_gemm_kernels_time_variance.csv -``` - -## Quick Reference +Output: `top5_gemm_kernels_time_variance.csv` -```bash -# Run complete sweep -bash scripts/gemm_analysis/run_train_various_channels.sh \ - --channels 28,42,56,70 \ - --threads 256,512 \ - --config config/gemm_overlap/gemm_test_1.yaml - -# Run with rocprof tracing (all kernels with stats) -bash scripts/gemm_analysis/run_train_various_channels.sh \ - --rocprof --stats \ - --channels 28,42,56,70 \ - --threads 256,512 - -# Run with rocprof using CU-only YAML (recommended) -bash scripts/gemm_analysis/run_train_various_channels.sh \ - --rocprof --stats \ - --rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \ - --channels 28,42,56,70 \ - --threads 256,512 - -# Generate TraceLens reports -bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS - -# Extract top GEMM kernels -python scripts/gemm_analysis/analyze_gemm_reports.py \ - --base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \ - --threads 256 512 --channels 28 42 56 70 --top-k 5 -``` -# GEMM Visualization and Reporting - -Visualization, overlap analysis, and reporting tools for GEMM kernel performance data. - -## Prerequisites - -- Python packages: pandas, openpyxl, matplotlib, seaborn -- Completed GEMM sweep profiling with generated `top5_gemm_kernels_time_variance.csv` - -## Pipeline Steps - -### 1. Generate Variance Plots - -Create comprehensive visualization of GEMM kernel variance: +### 5. Generate Variance Plots ```bash python scripts/gemm_analysis/plot_gemm_variance.py \ @@ -186,14 +68,9 @@ python scripts/gemm_analysis/plot_gemm_variance.py \ --output-dir experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/plots ``` -Generates: -- Box plots by thread count, channel count, and rank -- Violin plots showing distribution -- Thread-channel interaction plots - -### 2. Add Timestamp Information +Generates box plots, violin plots, and interaction plots. -Enhance variance data with kernel execution timestamps: +### 6. Add Timestamp Information ```bash python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \ @@ -203,9 +80,7 @@ python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \ Output: `top5_gemm_kernels_time_variance_with_timestamps.csv` -### 3. Analyze Collective Overlap - -Identify NCCL collective operations overlapping with GEMM kernels: +### 7. Analyze Collective Overlap ```bash python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \ @@ -215,61 +90,53 @@ python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \ Output: `top5_gemm_kernels_time_variance_with_collective_overlap.csv` -### 4. Create Comparison HTML Report - -Generate side-by-side comparison of two experiment sweeps: +### 8. Process GPU Timeline ```bash -python scripts/gemm_analysis/create_embeded_html_report.py \ - --sweep1 experiments/sweep_20251121_155219 \ - --sweep2 experiments/sweep_20251124_222204 \ - --label1 "Base ROCm" \ - --label2 "ROCm 7.0" \ - --output sweep_comparison.html +python scripts/gemm_analysis/process_gpu_timeline.py \ + --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS ``` -Creates self-contained HTML with embedded images. +Output: `gpu_timeline_all_configs_mean.xlsx` -Note: Currently supports pairwise (2-sweep) comparison. For comparing multiple sweeps, -run multiple pairwise comparisons or aggregate data using the process_gpu_timeline.py -and process_comms.py scripts. - -## Additional Analysis Tools - -### Process GPU Timeline - -Aggregate GPU timeline data across all ranks and configurations: +### 9. Process NCCL Communication Data ```bash -python scripts/gemm_analysis/process_gpu_timeline.py \ +python scripts/gemm_analysis/process_comms.py \ --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS ``` -Output: `gpu_timeline_all_configs_mean.xlsx` with multiple sheets: -- All_Data - Complete dataset -- Pivot_Time_ms - Matrix view of time -- Pivot_Percent - Matrix view of percentages -- Summary_By_Config - Key metrics per configuration - -### Process NCCL Communication Data +Output: `nccl_master_all_configs.xlsx` -Extract and aggregate NCCL collective operation data: +### 10. Create Comparison HTML Report ```bash -python scripts/gemm_analysis/process_comms.py \ - --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS +python scripts/gemm_analysis/create_embeded_html_report.py \ + --sweep1 experiments/sweep_20251121_155219 \ + --sweep2 experiments/sweep_20251124_222204 \ + --label1 "Base ROCm" \ + --label2 "ROCm 7.0" \ + --output sweep_comparison.html ``` -Output: `nccl_master_all_configs.xlsx` and `.csv` with: -- Communication latency statistics -- Bandwidth metrics -- Time skew analysis +Creates self-contained HTML with embedded images. ## Output Structure ``` experiments/sweep_YYYYMMDD_HHMMSS/ +├── 256thread/ +│ └── nccl_XXchannels/ +│ ├── torch_profiler/rank*/ +│ ├── rocprof_traces/ # if --rocprof used +│ └── run_output.log +├── 512thread/ +│ └── nccl_XXchannels/ └── tracelens_analysis/ + ├── 256thread/ + │ ├── individual_reports/perf_*ch_rank*.xlsx + │ └── collective_reports/collective_*ch.xlsx + ├── 512thread/ ├── top5_gemm_kernels_time_variance.csv ├── top5_gemm_kernels_time_variance_with_timestamps.csv ├── top5_gemm_kernels_time_variance_with_collective_overlap.csv @@ -283,28 +150,58 @@ experiments/sweep_YYYYMMDD_HHMMSS/ └── variance_thread_channel_interaction.png ``` -## Quick Reference +## Script Reference + +### Core Pipeline Scripts + +- `run_train_various_channels.sh` - Execute training sweep across configurations +- `run_tracelens_analysis.sh` - Generate TraceLens Excel reports +- `analyze_gemm_reports.py` - Extract top GEMM kernels by variance +- `plot_gemm_variance.py` - Generate visualization plots + +### Enhancement Scripts + +- `enhance_gemm_variance_with_timestamps.py` - Add kernel execution timestamps +- `gemm_report_with_collective_overlap.py` - Identify NCCL overlap with GEMM +- `process_gpu_timeline.py` - Aggregate GPU timeline across configurations +- `process_comms.py` - Extract NCCL communication statistics + +### Reporting Scripts +- `create_embeded_html_report.py` - Generate HTML comparison reports (pairwise) + +## Regression Testing + +The pipeline includes automated regression tests to ensure script changes don't break functionality. + +Setup and run tests: ```bash -# Generate all visualizations -python scripts/gemm_analysis/plot_gemm_variance.py \ - --csv-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv +source ~/venvs/aorta/bin/activate +pytest tests/gemm_analysis/test_gemm_regression.py -v +``` -# Add timestamps and analyze overlap -python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \ - --input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ - --base-path experiments/sweep_YYYYMMDD_HHMMSS +For details on test architecture and adding new tests, see [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md). -python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \ - --input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \ - --tracelens-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis +## Troubleshooting -# Process Excel reports -python scripts/gemm_analysis/process_gpu_timeline.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS -python scripts/gemm_analysis/process_comms.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS +### TraceLens Not Installed -# Create comparison report -python scripts/gemm_analysis/create_embeded_html_report.py \ - --sweep1 experiments/sweep1 --sweep2 experiments/sweep2 \ - --label1 "Baseline" --label2 "Optimized" --output comparison.html +If TraceLens is not available, analysis scripts will skip TraceLens-dependent processing. + +### Missing Dependencies + +```bash +pip install pandas openpyxl matplotlib seaborn ``` + +### Import Errors + +Ensure virtual environment is activated: +```bash +source ~/venvs/aorta/bin/activate +``` + +## Additional Documentation + +- [rocprof_guide.md](rocprof_guide.md) - Detailed rocprof configuration and performance counters +- [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md) - Test architecture and development diff --git a/scripts/gemm_analysis/rocprof_guide.md b/scripts/gemm_analysis/rocprof_guide.md new file mode 100644 index 0000000..3fbee6f --- /dev/null +++ b/scripts/gemm_analysis/rocprof_guide.md @@ -0,0 +1,133 @@ +# rocprof Tracing Configuration Guide + +Detailed guide for configuring rocprofv3 kernel tracing to capture GEMM performance metrics. + +## Configuration Methods + +### YAML-Based Configuration (Recommended) + +Use YAML configuration file for CU utilization metrics: + +```yaml +jobs: + - kernel_include_regex: "(gemm|Cijk_.*)" # pattern for kernels to trace + kernel_trace: true # enable kernel tracing + stats: true # timing statistics only (not CU utilization) + output_format: [json, csv] # add perfetto for Chrome tracing + sys_trace: false + advanced_thread_trace: false # leave false unless ATT decoder is installed +``` + +Example configuration: `scripts/gemm_analysis/rocprof_cu_only.yaml` + +### Running with YAML Config + +```bash +bash scripts/gemm_analysis/run_train_various_channels.sh \ + --rocprof \ + --rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \ + --channels 28,42,56 \ + --threads 256,512 \ + --config config/gemm_overlap/gemm_test_1.yaml +``` + +## Configuration Notes + +- Kernel filtering/stats come from the YAML +- Current rocprofv3 build ignores CLI kernel filters, use YAML to include/exclude kernels +- Remove `advanced_thread_trace` or keep it `false` unless ATT decoder debs are installed +- `stats: true` only collects timing statistics, NOT CU utilization metrics + +## Output Files + +rocprof generates 5 files per rank/process: + +| File | Description | +|------|-------------| +| `PID_agent_info.csv` | Hardware information about CPUs and GPUs | +| `PID_counter_collection.csv` | **CU utilization metrics (main focus)** | +| `PID_kernel_trace.csv` | Kernel execution timeline data | +| `PID_results.json` | Chrome trace format for visualization | +| `PID_results.csv` | Summary statistics | + +## counter_collection.csv Columns + +### Kernel Configuration + +- `Grid_Size` - Total number of workgroups in the kernel launch +- `Kernel_Name` - Name of the GEMM kernel (e.g., Cijk_Alik_Bljk_SB_MT128x128x32_MI32x32x1x2) +- `Workgroup_Size` - Number of work-items per workgroup + +### Memory Configuration + +- `LDS_Block_Size` - Local Data Share memory allocation per workgroup +- `Scratch_Size` - Private memory allocation per work-item + +### Register Usage + +- `VGPR_Count` - Vector General Purpose Registers used +- `Accum_VGPR_Count` - Accumulator VGPRs (for matrix operations) +- `SGPR_Count` - Scalar General Purpose Registers used + +### Performance Metrics + +- `Counter_Name` - Performance counter being measured +- `Counter_Value` - Value of the performance counter +- `Start_Timestamp` / `End_Timestamp` - Kernel execution timing + +## Key Performance Counters + +Focus on these counters in `counter_collection.csv`: + +| Counter | Description | Purpose | +|---------|-------------|---------| +| `SQ_BUSY_CU_CYCLES` | Percentage of time CUs are active | CU utilization | +| `SQ_WAVES` | Number of active wavefronts | Occupancy indicator | +| `SQ_INSTS_MFMA` | Matrix FMA instructions | Critical for GEMM performance | +| `SQ_INSTS_VALU` | Vector ALU instructions | General compute | + +## Command Line Options + +- `--rocprof` - Enable rocprofv3 tracing +- `--rocprof-input FILE` - Use YAML configuration file +- `--stats` - Include timing statistics (not CU utilization) +- `--channels VALUES` - Comma-separated NCCL channel values +- `--threads VALUES` - Comma-separated thread values + +## Output Location + +Traces saved to `rocprof_traces/` in each run directory: + +``` +experiments/sweep_YYYYMMDD_HHMMSS/ +└── 256thread/ + └── nccl_XXchannels/ + └── rocprof_traces/ + ├── PID_agent_info.csv + ├── PID_counter_collection.csv + ├── PID_kernel_trace.csv + ├── PID_results.json + └── PID_results.csv +``` + +## Analysis Workflow + +1. Run sweep with rocprof enabled +2. Focus on `counter_collection.csv` files +3. Extract CU utilization metrics (`SQ_BUSY_CU_CYCLES`) +4. Correlate with GEMM kernel performance variance +5. Identify bottlenecks (low CU utilization, register pressure) + +## Common Issues + +### ATT Decoder Not Found + +If you see warnings about ATT decoder, set `advanced_thread_trace: false` in YAML. + +### Missing Counter Data + +Ensure `stats: true` is set in YAML configuration. + +### Large Output Files + +Use `kernel_include_regex` to filter only GEMM kernels and reduce output size. From c65864e4184d64ba3ad0c57ec3cf9cf64ff959b6 Mon Sep 17 00:00:00 2001 From: sonbol_y Date: Thu, 11 Dec 2025 10:25:39 -0600 Subject: [PATCH 2/2] Fix trailing whitespace in comprehensive_report.html --- docs/comprehensive_report.html | 314 ++++++++++++++++----------------- 1 file changed, 157 insertions(+), 157 deletions(-) diff --git a/docs/comprehensive_report.html b/docs/comprehensive_report.html index 803cb60..445aa0f 100644 --- a/docs/comprehensive_report.html +++ b/docs/comprehensive_report.html @@ -201,7 +201,7 @@

GEMM RCCL Overlap Comprehensive Performance Report

Complete Analysis: rocm-7.0.8-meta (Baseline) vs rocm-7.0.10-meta (Test)
Generated: 2025-12-08 14:59:23
- + - +

Executive Summary

This comprehensive report contains all performance analysis plots and metrics for RCCL comparisons across multiple configurations.

- +

Test Configuration

    @@ -232,650 +232,650 @@

    Test Configuration

  • Total Plots: 96 visualizations
- +
- +

Configuration: 256 Threads, 28 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 256 Threads, 42 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 256 Threads, 56 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 256 Threads, 70 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 512 Threads, 28 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 512 Threads, 42 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 512 Threads, 56 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +

Configuration: 512 Threads, 70 Channels

- +
Base: rocm-7.0.8-meta
Test: rocm-7.0.10-meta
- +

Overview Plots

- +
Percentage Change Overview
Percentage Change Overview
- +
Absolute Time Comparison
Absolute Time Comparison
- +
Performance Heatmap
Performance Heatmap
- +
Total Execution Time by Rank
Total Execution Time by Rank
- +
- +

Detailed Metrics

- +
Computation Time Across Ranks
Computation Time Across Ranks
- +
Communication Time Across Ranks
Communication Time Across Ranks
- +
Idle Time Across Ranks
Idle Time Across Ranks
- +
Percentage Difference All Metrics
Percentage Difference All Metrics
- +
- +

NCCL Analysis

- +
NCCL Latency Analysis
NCCL Latency Analysis
- +
NCCL Summary Analysis
NCCL Summary Analysis
- +
- +
Back to Summary
- +