Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
314 changes: 157 additions & 157 deletions docs/comprehensive_report.html

Large diffs are not rendered by default.

285 changes: 91 additions & 194 deletions scripts/gemm_analysis/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# GEMM Sweep Profiling
# GEMM Sweep Profiling and Analysis

Profile GEMM kernel performance across multiple NCCL configurations.

## Prerequisites

- Docker with ROCm support
- TraceLens installed
- TraceLens installed (optional for some scripts)
- Python packages: pandas, openpyxl, matplotlib, seaborn

## Pipeline Steps
## Workflow

### 1. Build Docker Container

Expand All @@ -20,180 +21,56 @@ docker exec -it training-overlap-bugs-rocm70_9-1 bash

### 2. Run Training Sweep

Basic sweep:
```bash
bash scripts/gemm_analysis/run_train_various_channels.sh \
--channels 28,42,56,70 \
--threads 256,512 \
--config config/gemm_overlap/gemm_test_1.yaml
```

#### rocprof Tracing Options

Add rocprofv3 kernel tracing to capture detailed GEMM performance:

Simple YAML-based tracing (recommended)
---------------------------------------

Use the `rocprof_cu_only.yaml` configuration file for CU utilization metrics:

```yaml
jobs:
- kernel_include_regex: "(gemm|Cijk_.*)" # pattern for kernels to trace
kernel_trace: true # enable kernel tracing
stats: true # timing statistics only (not CU utilization)
output_format: [json, csv] # add perfetto for Chrome tracing
sys_trace: false
advanced_thread_trace: false # leave false unless ATT decoder is installed
```

Run the sweep with the CU-only YAML:
With rocprof tracing:
```bash
bash scripts/gemm_analysis/run_train_various_channels.sh \
--rocprof \
--rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \
--channels 28,42,56 --threads 256,512 \
--channels 28,42,56,70 \
--threads 256,512 \
--config config/gemm_overlap/gemm_test_1.yaml
```
Notes:
- Kernel filtering/stats come from the YAML. The current rocprofv3 build ignores CLI kernel filters, so use the YAML to include/exclude kernels.
- Remove `advanced_thread_trace` or keep it `false` unless the ATT decoder debs are installed.
- **Important**: `stats: true` only collects timing statistics, NOT CU utilization metrics.
- **Output Files**: rocprof generates 5 files per rank/process:
- `PID_agent_info.csv`: Hardware information about CPUs and GPUs
- `PID_counter_collection.csv`: **Main file with CU utilization metrics** (focus on this)
- `PID_kernel_trace.csv`: Kernel execution timeline data
- `PID_results.json`: Chrome trace format for visualization
- `PID_results.csv`: Summary statistics

**Analyzing Unique GEMM Kernels (counter_collection.csv columns):**
- `Grid_Size`: Total number of workgroups in the kernel launch
- `Kernel_Name`: Name of the GEMM kernel (e.g., Cijk_Alik_Bljk_SB_MT128x128x32_MI32x32x1x2)
- `Workgroup_Size`: Number of work-items per workgroup
- `LDS_Block_Size`: Local Data Share memory allocation per workgroup
- `Scratch_Size`: Private memory allocation per work-item
- `VGPR_Count`: Vector General Purpose Registers used
- `Accum_VGPR_Count`: Accumulator VGPRs (for matrix operations)
- `SGPR_Count`: Scalar General Purpose Registers used
- `Counter_Name`: Performance counter being measured (e.g., SQ_BUSY_CU_CYCLES)
- `Counter_Value`: Value of the performance counter
- `Start_Timestamp` / `End_Timestamp`: Kernel execution timing

**Key Options:**
- `--rocprof` : Enable rocprofv3 tracing
- `--stats` : Include timing statistics (not CU utilization)
- `--channels VALUES` : Comma-separated NCCL channel values
- `--threads VALUES` : Comma-separated thread values

**Output:** Traces saved to `rocprof_traces/` in each run directory.

**Key Performance Counters (found in counter_collection.csv files):**
- `SQ_BUSY_CU_CYCLES`: Percentage of time CUs are active (CU utilization)
- `SQ_WAVES`: Number of active wavefronts (occupancy indicator)
- `SQ_INSTS_MFMA`: Matrix FMA instructions (critical for GEMM performance)
- `SQ_INSTS_VALU`: Vector ALU instructions (general compute)

For rocprof configuration details, see [rocprof_guide.md](rocprof_guide.md).

### 3. Generate TraceLens Reports

```bash
bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_20251124_222204
bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS
```

### 4. Extract Top GEMM Kernels

```bash
python scripts/gemm_analysis/analyze_gemm_reports.py \
--base-path experiments/sweep_20251124_222204/tracelens_analysis \
--base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \
--threads 256 512 \
--channels 28 42 56 70 \
--ranks 0 1 2 3 4 5 6 7 \
--top-k 5
```

This generates `top5_gemm_kernels_time_variance.csv` with the kernels showing highest time variance across runs.

## Output Structure

```
experiments/sweep_YYYYMMDD_HHMMSS/
├── 256thread/
│ └── nccl_XXchannels/
│ ├── torch_profiler/rank*/
│ ├── rocprof_traces/ # if --rocprof flag used
│ │ ├── PID_agent_info.csv # Hardware info for each rank
│ │ ├── PID_counter_collection.csv # CU utilization metrics (main focus)
│ │ ├── PID_kernel_trace.csv # Kernel execution timeline
│ │ ├── PID_results.json # Chrome trace format
│ │ └── PID_results.csv # Summary statistics
│ └── run_output.log
├── 512thread/
│ └── nccl_XXchannels/
└── tracelens_analysis/
├── 256thread/
│ ├── individual_reports/perf_*ch_rank*.xlsx
│ └── collective_reports/collective_*ch.xlsx
├── 512thread/
└── top5_gemm_kernels_time_variance.csv
```

## Quick Reference
Output: `top5_gemm_kernels_time_variance.csv`

```bash
# Run complete sweep
bash scripts/gemm_analysis/run_train_various_channels.sh \
--channels 28,42,56,70 \
--threads 256,512 \
--config config/gemm_overlap/gemm_test_1.yaml

# Run with rocprof tracing (all kernels with stats)
bash scripts/gemm_analysis/run_train_various_channels.sh \
--rocprof --stats \
--channels 28,42,56,70 \
--threads 256,512

# Run with rocprof using CU-only YAML (recommended)
bash scripts/gemm_analysis/run_train_various_channels.sh \
--rocprof --stats \
--rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \
--channels 28,42,56,70 \
--threads 256,512

# Generate TraceLens reports
bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS

# Extract top GEMM kernels
python scripts/gemm_analysis/analyze_gemm_reports.py \
--base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \
--threads 256 512 --channels 28 42 56 70 --top-k 5
```
# GEMM Visualization and Reporting

Visualization, overlap analysis, and reporting tools for GEMM kernel performance data.

## Prerequisites

- Python packages: pandas, openpyxl, matplotlib, seaborn
- Completed GEMM sweep profiling with generated `top5_gemm_kernels_time_variance.csv`

## Pipeline Steps

### 1. Generate Variance Plots

Create comprehensive visualization of GEMM kernel variance:
### 5. Generate Variance Plots

```bash
python scripts/gemm_analysis/plot_gemm_variance.py \
--csv-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--output-dir experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/plots
```

Generates:
- Box plots by thread count, channel count, and rank
- Violin plots showing distribution
- Thread-channel interaction plots

### 2. Add Timestamp Information
Generates box plots, violin plots, and interaction plots.

Enhance variance data with kernel execution timestamps:
### 6. Add Timestamp Information

```bash
python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
Expand All @@ -203,9 +80,7 @@ python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \

Output: `top5_gemm_kernels_time_variance_with_timestamps.csv`

### 3. Analyze Collective Overlap

Identify NCCL collective operations overlapping with GEMM kernels:
### 7. Analyze Collective Overlap

```bash
python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
Expand All @@ -215,61 +90,53 @@ python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \

Output: `top5_gemm_kernels_time_variance_with_collective_overlap.csv`

### 4. Create Comparison HTML Report

Generate side-by-side comparison of two experiment sweeps:
### 8. Process GPU Timeline

```bash
python scripts/gemm_analysis/create_embeded_html_report.py \
--sweep1 experiments/sweep_20251121_155219 \
--sweep2 experiments/sweep_20251124_222204 \
--label1 "Base ROCm" \
--label2 "ROCm 7.0" \
--output sweep_comparison.html
python scripts/gemm_analysis/process_gpu_timeline.py \
--sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
```

Creates self-contained HTML with embedded images.
Output: `gpu_timeline_all_configs_mean.xlsx`

Note: Currently supports pairwise (2-sweep) comparison. For comparing multiple sweeps,
run multiple pairwise comparisons or aggregate data using the process_gpu_timeline.py
and process_comms.py scripts.

## Additional Analysis Tools

### Process GPU Timeline

Aggregate GPU timeline data across all ranks and configurations:
### 9. Process NCCL Communication Data

```bash
python scripts/gemm_analysis/process_gpu_timeline.py \
python scripts/gemm_analysis/process_comms.py \
--sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
```

Output: `gpu_timeline_all_configs_mean.xlsx` with multiple sheets:
- All_Data - Complete dataset
- Pivot_Time_ms - Matrix view of time
- Pivot_Percent - Matrix view of percentages
- Summary_By_Config - Key metrics per configuration

### Process NCCL Communication Data
Output: `nccl_master_all_configs.xlsx`

Extract and aggregate NCCL collective operation data:
### 10. Create Comparison HTML Report

```bash
python scripts/gemm_analysis/process_comms.py \
--sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
python scripts/gemm_analysis/create_embeded_html_report.py \
--sweep1 experiments/sweep_20251121_155219 \
--sweep2 experiments/sweep_20251124_222204 \
--label1 "Base ROCm" \
--label2 "ROCm 7.0" \
--output sweep_comparison.html
```

Output: `nccl_master_all_configs.xlsx` and `.csv` with:
- Communication latency statistics
- Bandwidth metrics
- Time skew analysis
Creates self-contained HTML with embedded images.

## Output Structure

```
experiments/sweep_YYYYMMDD_HHMMSS/
├── 256thread/
│ └── nccl_XXchannels/
│ ├── torch_profiler/rank*/
│ ├── rocprof_traces/ # if --rocprof used
│ └── run_output.log
├── 512thread/
│ └── nccl_XXchannels/
└── tracelens_analysis/
├── 256thread/
│ ├── individual_reports/perf_*ch_rank*.xlsx
│ └── collective_reports/collective_*ch.xlsx
├── 512thread/
├── top5_gemm_kernels_time_variance.csv
├── top5_gemm_kernels_time_variance_with_timestamps.csv
├── top5_gemm_kernels_time_variance_with_collective_overlap.csv
Expand All @@ -283,28 +150,58 @@ experiments/sweep_YYYYMMDD_HHMMSS/
└── variance_thread_channel_interaction.png
```

## Quick Reference
## Script Reference

### Core Pipeline Scripts

- `run_train_various_channels.sh` - Execute training sweep across configurations
- `run_tracelens_analysis.sh` - Generate TraceLens Excel reports
- `analyze_gemm_reports.py` - Extract top GEMM kernels by variance
- `plot_gemm_variance.py` - Generate visualization plots

### Enhancement Scripts

- `enhance_gemm_variance_with_timestamps.py` - Add kernel execution timestamps
- `gemm_report_with_collective_overlap.py` - Identify NCCL overlap with GEMM
- `process_gpu_timeline.py` - Aggregate GPU timeline across configurations
- `process_comms.py` - Extract NCCL communication statistics

### Reporting Scripts

- `create_embeded_html_report.py` - Generate HTML comparison reports (pairwise)

## Regression Testing

The pipeline includes automated regression tests to ensure script changes don't break functionality.

Setup and run tests:
```bash
# Generate all visualizations
python scripts/gemm_analysis/plot_gemm_variance.py \
--csv-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv
source ~/venvs/aorta/bin/activate
pytest tests/gemm_analysis/test_gemm_regression.py -v
```

# Add timestamps and analyze overlap
python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
--input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--base-path experiments/sweep_YYYYMMDD_HHMMSS
For details on test architecture and adding new tests, see [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md).

python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
--input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \
--tracelens-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis
## Troubleshooting

# Process Excel reports
python scripts/gemm_analysis/process_gpu_timeline.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
python scripts/gemm_analysis/process_comms.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
### TraceLens Not Installed

# Create comparison report
python scripts/gemm_analysis/create_embeded_html_report.py \
--sweep1 experiments/sweep1 --sweep2 experiments/sweep2 \
--label1 "Baseline" --label2 "Optimized" --output comparison.html
If TraceLens is not available, analysis scripts will skip TraceLens-dependent processing.

### Missing Dependencies

```bash
pip install pandas openpyxl matplotlib seaborn
```

### Import Errors

Ensure virtual environment is activated:
```bash
source ~/venvs/aorta/bin/activate
```

## Additional Documentation

- [rocprof_guide.md](rocprof_guide.md) - Detailed rocprof configuration and performance counters
- [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md) - Test architecture and development
Loading