ROCm · oyazdanb · Dec 11, 2025 · Dec 11, 2025
diff --git a/docs/comprehensive_report.html b/docs/comprehensive_report.html
diff --git a/scripts/gemm_analysis/README.md b/scripts/gemm_analysis/README.md
@@ -1,13 +1,14 @@
-# GEMM Sweep Profiling
+# GEMM Sweep Profiling and Analysis
 
 Profile GEMM kernel performance across multiple NCCL configurations.
 
 ## Prerequisites
 
 - Docker with ROCm support
-- TraceLens installed
+- TraceLens installed (optional for some scripts)
+- Python packages: pandas, openpyxl, matplotlib, seaborn
 
-## Pipeline Steps
+## Workflow
 
 ### 1. Build Docker Container
 
@@ -20,180 +21,56 @@ docker exec -it training-overlap-bugs-rocm70_9-1 bash
 
 ### 2. Run Training Sweep
 
+Basic sweep:
 ```bash
 bash scripts/gemm_analysis/run_train_various_channels.sh \
   --channels 28,42,56,70 \
   --threads 256,512 \
   --config config/gemm_overlap/gemm_test_1.yaml
 ```
 
-#### rocprof Tracing Options
-
-Add rocprofv3 kernel tracing to capture detailed GEMM performance:
-
-Simple YAML-based tracing (recommended)
----------------------------------------
-
-Use the `rocprof_cu_only.yaml` configuration file for CU utilization metrics:
-
-```yaml
-jobs:
-  - kernel_include_regex: "(gemm|Cijk_.*)"  # pattern for kernels to trace
-    kernel_trace: true                      # enable kernel tracing
-    stats: true                             # timing statistics only (not CU utilization)
-    output_format: [json, csv]              # add perfetto for Chrome tracing
-    sys_trace: false
-    advanced_thread_trace: false            # leave false unless ATT decoder is installed
-```
-
-Run the sweep with the CU-only YAML:
+With rocprof tracing:
 ```bash
 bash scripts/gemm_analysis/run_train_various_channels.sh \
   --rocprof \
   --rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \
-  --channels 28,42,56 --threads 256,512 \
+  --channels 28,42,56,70 \
+  --threads 256,512 \
   --config config/gemm_overlap/gemm_test_1.yaml
 ```
-Notes:
-- Kernel filtering/stats come from the YAML. The current rocprofv3 build ignores CLI kernel filters, so use the YAML to include/exclude kernels.
-- Remove `advanced_thread_trace` or keep it `false` unless the ATT decoder debs are installed.
-- **Important**: `stats: true` only collects timing statistics, NOT CU utilization metrics.
-- **Output Files**: rocprof generates 5 files per rank/process:
-  - `PID_agent_info.csv`: Hardware information about CPUs and GPUs
-  - `PID_counter_collection.csv`: **Main file with CU utilization metrics** (focus on this)
-  - `PID_kernel_trace.csv`: Kernel execution timeline data
-  - `PID_results.json`: Chrome trace format for visualization
-  - `PID_results.csv`: Summary statistics
-
-**Analyzing Unique GEMM Kernels (counter_collection.csv columns):**
-- `Grid_Size`: Total number of workgroups in the kernel launch
-- `Kernel_Name`: Name of the GEMM kernel (e.g., Cijk_Alik_Bljk_SB_MT128x128x32_MI32x32x1x2)
-- `Workgroup_Size`: Number of work-items per workgroup
-- `LDS_Block_Size`: Local Data Share memory allocation per workgroup
-- `Scratch_Size`: Private memory allocation per work-item
-- `VGPR_Count`: Vector General Purpose Registers used
-- `Accum_VGPR_Count`: Accumulator VGPRs (for matrix operations)
-- `SGPR_Count`: Scalar General Purpose Registers used
-- `Counter_Name`: Performance counter being measured (e.g., SQ_BUSY_CU_CYCLES)
-- `Counter_Value`: Value of the performance counter
-- `Start_Timestamp` / `End_Timestamp`: Kernel execution timing
-
-**Key Options:**
-- `--rocprof` : Enable rocprofv3 tracing
-- `--stats` : Include timing statistics (not CU utilization)
-- `--channels VALUES` : Comma-separated NCCL channel values
-- `--threads VALUES` : Comma-separated thread values
-
-**Output:** Traces saved to `rocprof_traces/` in each run directory.
-
-**Key Performance Counters (found in counter_collection.csv files):**
-- `SQ_BUSY_CU_CYCLES`: Percentage of time CUs are active (CU utilization)
-- `SQ_WAVES`: Number of active wavefronts (occupancy indicator)
-- `SQ_INSTS_MFMA`: Matrix FMA instructions (critical for GEMM performance)
-- `SQ_INSTS_VALU`: Vector ALU instructions (general compute)
+
+For rocprof configuration details, see [rocprof_guide.md](rocprof_guide.md).
 
 ### 3. Generate TraceLens Reports
 
 ```bash
-bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_20251124_222204
+bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS
 ```
 
 ### 4. Extract Top GEMM Kernels
 
 ```bash
 python scripts/gemm_analysis/analyze_gemm_reports.py \
-  --base-path experiments/sweep_20251124_222204/tracelens_analysis \
+  --base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \
   --threads 256 512 \
   --channels 28 42 56 70 \
   --ranks 0 1 2 3 4 5 6 7 \
   --top-k 5
 ```
 
-This generates `top5_gemm_kernels_time_variance.csv` with the kernels showing highest time variance across runs.
-
-## Output Structure
-
-```
-experiments/sweep_YYYYMMDD_HHMMSS/
-├── 256thread/
-│   └── nccl_XXchannels/
-│       ├── torch_profiler/rank*/
-│       ├── rocprof_traces/           # if --rocprof flag used
-│       │   ├── PID_agent_info.csv    # Hardware info for each rank
-│       │   ├── PID_counter_collection.csv  # CU utilization metrics (main focus)
-│       │   ├── PID_kernel_trace.csv  # Kernel execution timeline
-│       │   ├── PID_results.json      # Chrome trace format
-│       │   └── PID_results.csv       # Summary statistics
-│       └── run_output.log
-├── 512thread/
-│   └── nccl_XXchannels/
-└── tracelens_analysis/
-    ├── 256thread/
-    │   ├── individual_reports/perf_*ch_rank*.xlsx
-    │   └── collective_reports/collective_*ch.xlsx
-    ├── 512thread/
-    └── top5_gemm_kernels_time_variance.csv
-```
-
-## Quick Reference
+Output: `top5_gemm_kernels_time_variance.csv`
 
-```bash
-# Run complete sweep
-bash scripts/gemm_analysis/run_train_various_channels.sh \
-  --channels 28,42,56,70 \
-  --threads 256,512 \
-  --config config/gemm_overlap/gemm_test_1.yaml
-
-# Run with rocprof tracing (all kernels with stats)
-bash scripts/gemm_analysis/run_train_various_channels.sh \
-  --rocprof --stats \
-  --channels 28,42,56,70 \
-  --threads 256,512
-
-# Run with rocprof using CU-only YAML (recommended)
-bash scripts/gemm_analysis/run_train_various_channels.sh \
-  --rocprof --stats \
-  --rocprof-input scripts/gemm_analysis/rocprof_cu_only.yaml \
-  --channels 28,42,56,70 \
-  --threads 256,512
-
-# Generate TraceLens reports
-bash scripts/gemm_analysis/run_tracelens_analysis.sh experiments/sweep_YYYYMMDD_HHMMSS
-
-# Extract top GEMM kernels
-python scripts/gemm_analysis/analyze_gemm_reports.py \
-  --base-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis \
-  --threads 256 512 --channels 28 42 56 70 --top-k 5
-```
-# GEMM Visualization and Reporting
-
-Visualization, overlap analysis, and reporting tools for GEMM kernel performance data.
-
-## Prerequisites
-
-- Python packages: pandas, openpyxl, matplotlib, seaborn
-- Completed GEMM sweep profiling with generated `top5_gemm_kernels_time_variance.csv`
-
-## Pipeline Steps
-
-### 1. Generate Variance Plots
-
-Create comprehensive visualization of GEMM kernel variance:
+### 5. Generate Variance Plots
 
 ```bash
 python scripts/gemm_analysis/plot_gemm_variance.py \
   --csv-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
   --output-dir experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/plots
 ```
 
-Generates:
-- Box plots by thread count, channel count, and rank
-- Violin plots showing distribution
-- Thread-channel interaction plots
-
-### 2. Add Timestamp Information
+Generates box plots, violin plots, and interaction plots.
 
-Enhance variance data with kernel execution timestamps:
+### 6. Add Timestamp Information
 
 ```bash
 python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
@@ -203,9 +80,7 @@ python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
 
 Output: `top5_gemm_kernels_time_variance_with_timestamps.csv`
 
-### 3. Analyze Collective Overlap
-
-Identify NCCL collective operations overlapping with GEMM kernels:
+### 7. Analyze Collective Overlap
 
 ```bash
 python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
@@ -215,61 +90,53 @@ python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
 
 Output: `top5_gemm_kernels_time_variance_with_collective_overlap.csv`
 
-### 4. Create Comparison HTML Report
-
-Generate side-by-side comparison of two experiment sweeps:
+### 8. Process GPU Timeline
 
 ```bash
-python scripts/gemm_analysis/create_embeded_html_report.py \
-  --sweep1 experiments/sweep_20251121_155219 \
-  --sweep2 experiments/sweep_20251124_222204 \
-  --label1 "Base ROCm" \
-  --label2 "ROCm 7.0" \
-  --output sweep_comparison.html
+python scripts/gemm_analysis/process_gpu_timeline.py \
+  --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
 ```
 
-Creates self-contained HTML with embedded images.
+Output: `gpu_timeline_all_configs_mean.xlsx`
 
-Note: Currently supports pairwise (2-sweep) comparison. For comparing multiple sweeps,
-run multiple pairwise comparisons or aggregate data using the process_gpu_timeline.py
-and process_comms.py scripts.
-
-## Additional Analysis Tools
-
-### Process GPU Timeline
-
-Aggregate GPU timeline data across all ranks and configurations:
+### 9. Process NCCL Communication Data
 
 ```bash
-python scripts/gemm_analysis/process_gpu_timeline.py \
+python scripts/gemm_analysis/process_comms.py \
   --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
 ```
 
-Output: `gpu_timeline_all_configs_mean.xlsx` with multiple sheets:
-- All_Data - Complete dataset
-- Pivot_Time_ms - Matrix view of time
-- Pivot_Percent - Matrix view of percentages
-- Summary_By_Config - Key metrics per configuration
-
-### Process NCCL Communication Data
+Output: `nccl_master_all_configs.xlsx`
 
-Extract and aggregate NCCL collective operation data:
+### 10. Create Comparison HTML Report
 
 ```bash
-python scripts/gemm_analysis/process_comms.py \
-  --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
+python scripts/gemm_analysis/create_embeded_html_report.py \
+  --sweep1 experiments/sweep_20251121_155219 \
+  --sweep2 experiments/sweep_20251124_222204 \
+  --label1 "Base ROCm" \
+  --label2 "ROCm 7.0" \
+  --output sweep_comparison.html
 ```
 
-Output: `nccl_master_all_configs.xlsx` and `.csv` with:
-- Communication latency statistics
-- Bandwidth metrics
-- Time skew analysis
+Creates self-contained HTML with embedded images.
 
 ## Output Structure
 
 ```
 experiments/sweep_YYYYMMDD_HHMMSS/
+├── 256thread/
+│   └── nccl_XXchannels/
+│       ├── torch_profiler/rank*/
+│       ├── rocprof_traces/           # if --rocprof used
+│       └── run_output.log
+├── 512thread/
+│   └── nccl_XXchannels/
 └── tracelens_analysis/
+    ├── 256thread/
+    │   ├── individual_reports/perf_*ch_rank*.xlsx
+    │   └── collective_reports/collective_*ch.xlsx
+    ├── 512thread/
     ├── top5_gemm_kernels_time_variance.csv
     ├── top5_gemm_kernels_time_variance_with_timestamps.csv
     ├── top5_gemm_kernels_time_variance_with_collective_overlap.csv
@@ -283,28 +150,58 @@ experiments/sweep_YYYYMMDD_HHMMSS/
         └── variance_thread_channel_interaction.png
 ```
 
-## Quick Reference
+## Script Reference
+
+### Core Pipeline Scripts
+
+- `run_train_various_channels.sh` - Execute training sweep across configurations
+- `run_tracelens_analysis.sh` - Generate TraceLens Excel reports
+- `analyze_gemm_reports.py` - Extract top GEMM kernels by variance
+- `plot_gemm_variance.py` - Generate visualization plots
+
+### Enhancement Scripts
+
+- `enhance_gemm_variance_with_timestamps.py` - Add kernel execution timestamps
+- `gemm_report_with_collective_overlap.py` - Identify NCCL overlap with GEMM
+- `process_gpu_timeline.py` - Aggregate GPU timeline across configurations
+- `process_comms.py` - Extract NCCL communication statistics
+
+### Reporting Scripts
 
+- `create_embeded_html_report.py` - Generate HTML comparison reports (pairwise)
+
+## Regression Testing
+
+The pipeline includes automated regression tests to ensure script changes don't break functionality.
+
+Setup and run tests:
 ```bash
-# Generate all visualizations
-python scripts/gemm_analysis/plot_gemm_variance.py \
-  --csv-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv
+source ~/venvs/aorta/bin/activate
+pytest tests/gemm_analysis/test_gemm_regression.py -v
+```
 
-# Add timestamps and analyze overlap
-python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
-  --input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
-  --base-path experiments/sweep_YYYYMMDD_HHMMSS
+For details on test architecture and adding new tests, see [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md).
 
-python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
-  --input-csv experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \
-  --tracelens-path experiments/sweep_YYYYMMDD_HHMMSS/tracelens_analysis
+## Troubleshooting
 
-# Process Excel reports
-python scripts/gemm_analysis/process_gpu_timeline.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
-python scripts/gemm_analysis/process_comms.py --sweep-dir experiments/sweep_YYYYMMDD_HHMMSS
+### TraceLens Not Installed
 
-# Create comparison report
-python scripts/gemm_analysis/create_embeded_html_report.py \
-  --sweep1 experiments/sweep1 --sweep2 experiments/sweep2 \
-  --label1 "Baseline" --label2 "Optimized" --output comparison.html
+If TraceLens is not available, analysis scripts will skip TraceLens-dependent processing.
+
+### Missing Dependencies
+
+```bash
+pip install pandas openpyxl matplotlib seaborn
 ```
+
+### Import Errors
+
+Ensure virtual environment is activated:
+```bash
+source ~/venvs/aorta/bin/activate
+```
+
+## Additional Documentation
+
+- [rocprof_guide.md](rocprof_guide.md) - Detailed rocprof configuration and performance counters
+- [tests/gemm_analysis/README.md](../../tests/gemm_analysis/README.md) - Test architecture and development