ROCm · prosenjitdhole · Dec 1, 2025 · Dec 1, 2025 · Dec 1, 2025 · Dec 2, 2025
diff --git a/docker/docker-compose.rocm70_9-1.yaml b/docker/docker-compose.rocm70_9-1.yaml
@@ -3,7 +3,7 @@ services:
     container_name: training-overlap-bugs-rocm70
     build:
       context: .
-      dockerfile: Dockerfile.rocm70
+      dockerfile: Dockerfile.rocm70_9-1
     user: root
     privileged: true
     network_mode: host
@@ -15,8 +15,6 @@ services:
     security_opt:
       - seccomp=unconfined
     environment:
-      - RCCL_FOLDER=/rccl
-      - LD_LIBRARY_PATH=/rccl/build/release:$LD_LIBRARY_PATH
       - TORCH_NCCL_HIGH_PRIORITY=1
 
     volumes:

diff --git a/scripts/tracelens_single_config/README.md b/scripts/tracelens_single_config/README.md
@@ -0,0 +1,166 @@
+# RCCL Warp Speed Performance Testing
+
+Test RCCL warp_speed_v1 branch from https://github.com/mustafabar/rccl.git
+
+## Prerequisites
+
+```bash
+pip install pandas openpyxl matplotlib seaborn numpy
+```
+
+## Run Tests
+
+### Step 1: Start Container and Build RCCL
+
+```bash
+cd docker
+docker-compose -f docker-compose.rocm70_9-1.yaml build
+docker-compose -f docker-compose.rocm70_9-1.yaml up -d
+docker-compose -f docker-compose.rocm70_9-1.yaml exec torchenv-rocm70 bash
+
+# Inside container - build warp_speed_v1 (always rebuild)
+# Note: Set --amdgpu_targets to match your GPU architecture
+# Run 'rocminfo | grep gfx' to find your GPU target (e.g., gfx942, gfx950)
+cd /opt
+if [ -d "rccl" ]; then
+    cd rccl
+    git checkout warp_speed_v1
+    git pull
+else
+    git clone --recursive https://github.com/mustafabar/rccl.git
+    cd rccl
+    git checkout warp_speed_v1
+fi
+./install.sh -l --amdgpu_targets=gfx950
+
+cd /workspace/aorta
+```
+
+### Step 2: Run RCCL Tests
+
+```bash
+# Default 3 configurations
+./scripts/tracelens_single_config/run_rccl_warp_speed_comparison.sh
+
+# Custom configurations (CU_count,threads pairs)
+./scripts/tracelens_single_config/run_rccl_warp_speed_comparison.sh -p "56,256 37,384 32,512" -c ./config/single_node/gemm_overlap_comm.yaml
+```
+
+Output structure:
+```
+experiments/
+  rccl_warp_speed_YYYYMMDD_HHMMSS/
+    56cu_256threads/
+      torch_profiler/       # Raw profiler traces
+      run_output.log       # Training output log
+    37cu_384threads/
+    32cu_512threads/
+    rccl_warp_speed_summary_YYYYMMDD_HHMMSS.txt
+```
+
+### Step 3: Generate Reports (Outside Container)
+
+```bash
+# Exit container
+exit
+
+# Run complete analysis
+python scripts/tracelens_single_config/run_full_analysis.py \
+  --baseline experiments/rccl_warp_speed_YYYYMMDD/56cu_256threads \
+  --test experiments/rccl_warp_speed_YYYYMMDD/37cu_384threads \
+  --output comparison_results \
+  --all
+
+# Or skip TraceLens if already done
+python scripts/tracelens_single_config/run_full_analysis.py \
+  --baseline experiments/rccl_warp_speed_YYYYMMDD/56cu_256threads \
+  --test experiments/rccl_warp_speed_YYYYMMDD/37cu_384threads \
+  --output comparison_results \
+  --all --skip-tracelens
+```
+
+## Generated Excel Reports
+
+### Individual TraceLens Reports (per configuration)
+Each configuration generates:
+- `tracelens_analysis/individual_reports/perf_rank*.xlsx` - Per-rank performance breakdown
+- `tracelens_analysis/collective_reports/collective_all_ranks.xlsx` - Collective operations summary
+- `tracelens_analysis/gpu_timeline_summary_mean.xlsx` - GPU timeline averages
+
+### Final Analysis Report (`final_analysis_report.xlsx`)
+
+Contains multiple sheets:
+
+**Summary Sheets:**
+- `Summary_Dashboard` - High-level comparison metrics with percentage changes
+- `Summary_Comparison` - Side-by-side summary comparison
+- `GPU_ByRank_Comparison` - Detailed per-rank performance comparison
+- `Comparison_By_Rank` - Rank-wise metric comparison with differences
+
+**GPU Timeline Sheets:**
+- `All_Ranks_Combined` - Combined GPU timeline data from all ranks
+- `Summary` - Aggregated GPU timeline summary
+- `Rank_*` - Individual rank GPU timelines
+
+**Collective/NCCL Sheets:**
+- `nccl_summary_implicit_sync` - NCCL operations with implicit synchronization
+- `nccl_summary_long` - Long-running NCCL operations
+- `nccl_summary_implicit_sync_comparison` - Comparison of implicit sync operations
+- `nccl_summary_long_comparison` - Comparison of long operations
+
+**Raw Data Sheets (hidden by default):**
+- `gpu_timeline_combined` - Raw combined GPU timeline data
+- `gpu_timeline_comparison` - Raw GPU timeline comparison data
+- `collective_combined` - Raw collective operations data
+- `collective_comparison` - Raw collective comparison data
+
+### Comparison Reports
+
+- `gpu_timeline_combined.xlsx` - Baseline and test GPU metrics combined
+- `gpu_timeline_comparison.xlsx` - GPU metrics with comparison analysis
+- `collective_combined.xlsx` - Baseline and test collective operations combined
+- `collective_comparison.xlsx` - Collective operations with comparison analysis
+
+## Generated Visualizations
+
+### HTML Report
+- `performance_analysis_report.html` - Complete report with all embedded plots
+
+### Individual Plot Files (12 Total)
+1. `plot1_percentage_change_overview.png` - Horizontal bar chart showing performance changes
+2. `plot2_absolute_time_comparison.png` - Bar chart comparing absolute times
+3. `plot3_performance_heatmap.png` - Heatmap of performance by rank
+4. `plot4_total_execution_time.png` - Line plot of total execution time per rank
+5. `plot5_computation_time.png` - Line plot of computation time across ranks
+6. `plot6_communication_time.png` - Line plot of communication time across ranks
+7. `plot7_idle_time.png` - Line plot of idle time across ranks
+8. `plot8_percentage_difference_all_metrics.png` - Bar plot showing percentage differences for all metrics
+9. `plot9_nccl_latency.png` - Line plot of latency vs message size
+10. `plot10_algorithm_bandwidth.png` - Line plot of algorithm bandwidth vs message size
+11. `plot11_bus_bandwidth.png` - Line plot of bus bandwidth vs message size
+12. `plot12_nccl_summary.png` - Combined percentage summary and total latency
+
+## Key Metrics Analyzed
+
+**GPU Metrics:**
+- `computation_time` - Time spent in computation
+- `total_comm_time` - Total communication time
+- `exposed_comm_time` - Non-overlapped communication time
+- `idle_time` - GPU idle time
+- `total_memcpy_time` - Memory copy time
+- `exposed_memcpy_time` - Non-overlapped memory copy time
+- `busy_time` - Total GPU busy time
+- `total_time` - Total execution time
+
+**NCCL Metrics:**
+- `comm_latency_mean` - Average communication latency
+- `algo bw (GB/s)_mean` - Algorithm bandwidth
+- `bus bw (GB/s)_mean` - Bus bandwidth
+- `Total comm latency (ms)` - Total communication latency
+- `count` - Number of operations
+
+## Convert to PDF
+
+1. Open `performance_analysis_report.html` in browser
+2. Print to PDF (Ctrl+P or Cmd+P)
+3. Choose landscape orientation for better plot visibility
diff --git a/scripts/tracelens_single_config/add_collective_comparison.py b/scripts/tracelens_single_config/add_collective_comparison.py
@@ -0,0 +1,170 @@
+#!/usr/bin/env python3
+import pandas as pd
+import argparse
+from openpyxl.styles import Color
+from openpyxl.formatting.rule import ColorScaleRule
+
+
+def add_collective_comparison_sheets(input_path, output_path, baseline_label='baseline', test_label='test'):
+    print(f"Loading: {input_path}")
+    print(f"  Baseline label: {baseline_label}")
+    print(f"  Test label: {test_label}")
+
+    xl = pd.ExcelFile(input_path)
+
+    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
+        # Copy only summary sheets
+        for sheet_name in xl.sheet_names:
+            # Only keep sheets with 'summary' in the name
+            if 'summary' not in sheet_name.lower():
+                print(f"  Skip {sheet_name} (keeping only summary sheets)")
+                continue
+            df = pd.read_excel(input_path, sheet_name=sheet_name)
+            df.to_excel(writer, sheet_name=sheet_name, index=False)
+            print(f"  Copied {sheet_name}")
+
+        # Process summary sheets for comparison
+        for sheet_name in ['nccl_summary_implicit_sync', 'nccl_summary_long']:
+            if sheet_name not in xl.sheet_names:
+                continue
+
+            df = pd.read_excel(input_path, sheet_name=sheet_name)
+
+            # Get actual source values from the dataframe
+            sources = df['source'].unique()
+            # Determine which is baseline and which is test (baseline should be first)
+            if len(sources) >= 2:
+                actual_baseline = sources[0]
+                actual_test = sources[1]
+            else:
+                actual_baseline = baseline_label
+                actual_test = test_label
+
+            # Separate baseline and test
+            baseline_df = df[df['source'] == actual_baseline].copy()
+            test_df = df[df['source'] == actual_test].copy()
+
+            if len(baseline_df) == 0 or len(test_df) == 0:
+                print(f"  Skip {sheet_name} - missing data")
+                continue
+
+            # Create comparison dataframe
+            comparison = pd.DataFrame()
+
+            # Identify key columns for grouping
+            group_cols = ['Collective name', 'dtype', 'In msg nelems']
+            if not all(col in baseline_df.columns for col in group_cols):
+                group_cols = ['Collective name']
+
+            # Group and compare
+            baseline_grouped = baseline_df.groupby(group_cols, as_index=False)
+            test_grouped = test_df.groupby(group_cols, as_index=False)
+
+            for name, base_group in baseline_grouped:
+                # Find matching test group
+                if isinstance(name, tuple):
+                    mask = pd.Series([True] * len(test_df), index=test_df.index)
+                    for col, val in zip(group_cols, name):
+                        mask = mask & (test_df[col] == val)
+                else:
+                    mask = (test_df[group_cols[0]] == name)
+
+                test_group = test_df.loc[mask]
+
+                if len(test_group) == 0:
+                    continue
+
+                # Create comparison row
+                comp_row = {}
+
+                # Copy grouping columns
+                if isinstance(name, tuple):
+                    for col, val in zip(group_cols, name):
+                        comp_row[col] = val
+                else:
+                    comp_row[group_cols[0]] = name
+
+                # Compare numeric columns
+                numeric_cols = ['comm_latency_mean', 'algo bw (GB/s)_mean', 'bus bw (GB/s)_mean',
+                               'Total comm latency (ms)', 'count']
+
+                for col in numeric_cols:
+                    if col not in base_group.columns or col not in test_group.columns:
+                        continue
+
+                    base_val = base_group[col].values[0]
+                    test_val = test_group[col].values[0]
+
+                    comp_row[f'{baseline_label}_{col}'] = base_val
+                    comp_row[f'{test_label}_{col}'] = test_val
+                    comp_row[f'diff_{col}'] = test_val - base_val
+
+                    # For latency/time: positive percent_change means faster (less time)
+                    # For bandwidth: positive percent_change means better (more bandwidth)
+                    if 'latency' in col.lower() or 'time' in col.lower():
+                        # Lower is better - positive when saleelk is faster
+                        pct_change = (base_val - test_val) / base_val * 100 if base_val != 0 else 0
+                        comp_row[f'percent_change_{col}'] = pct_change
+                    elif 'bw' in col.lower() or 'bandwidth' in col.lower():
+                        # Higher is better - positive when saleelk is better
+                        pct_change = (test_val - base_val) / base_val * 100 if base_val != 0 else 0
+                        comp_row[f'percent_change_{col}'] = pct_change
+
+                    comp_row[f'ratio_{col}'] = test_val / base_val if base_val != 0 else 0
+
+                comparison = pd.concat([comparison, pd.DataFrame([comp_row])], ignore_index=True)
+
+            # Write comparison sheet (shorten name to fit Excel's 31 char limit)
+            # Replace 'nccl_summary_' with 'nccl_' and '_comparison' with '_cmp'
+            comparison_sheet_name = sheet_name.replace('nccl_summary_', 'nccl_') + '_cmp'
+            comparison.to_excel(writer, sheet_name=comparison_sheet_name, index=False)
+            print(f"  Added {comparison_sheet_name}")
+
+            # Add conditional formatting to percent_change columns
+            print(f"    Applying conditional formatting to {comparison_sheet_name}...")
+
+            ws = writer.sheets[comparison_sheet_name]
+
+            # Format all percent_change columns with color scale
+            for col_idx, col in enumerate(comparison.columns, start=1):
+                if 'percent_change' in col:
+                    # Convert column index to Excel letter (A, B, C, ...)
+                    if col_idx <= 26:
+                        col_letter = chr(64 + col_idx)
+                    else:
+                        col_letter = chr(64 + (col_idx // 26)) + chr(64 + (col_idx % 26))
+
+                    data_range = f'{col_letter}2:{col_letter}{len(comparison)+1}'
+
+                    # Color scale: red (min/negative) -> white (0) -> green (max/positive)
+                    ws.conditional_formatting.add(data_range,
+                        ColorScaleRule(
+                            start_type='min', start_color='F8696B',  # Red
+                            mid_type='num', mid_value=0, mid_color='FFFFFF',  # White
+                            end_type='max', end_color='63BE7B'  # Green
+                        ))
+
+                    print(f"      Formatted {col}")
+
+    print(f"\nSaved: {output_path}")
+    print("\nNew comparison sheets added")
+    print("percent_change interpretation:")
+    print("  For latency/time: Positive = faster (less time)")
+    print("  For bandwidth: Positive = better (more bandwidth)")
+    return 0
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Add comparison sheets to combined collective reports')
+    parser.add_argument('--input', required=True, help='Input combined collective Excel file')
+    parser.add_argument('--output', required=True, help='Output Excel file with comparison sheets')
+    parser.add_argument('--baseline-label', default='baseline', help='Label for baseline data')
+    parser.add_argument('--test-label', default='test', help='Label for test data')
+
+    args = parser.parse_args()
+
+    return add_collective_comparison_sheets(args.input, args.output, args.baseline_label, args.test_label)
+
+
+if __name__ == '__main__':
+    exit(main())