Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions docker/docker-compose.rocm70_9-1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ services:
container_name: training-overlap-bugs-rocm70
build:
context: .
dockerfile: Dockerfile.rocm70
dockerfile: Dockerfile.rocm70_9-1
user: root
privileged: true
network_mode: host
Expand All @@ -15,8 +15,6 @@ services:
security_opt:
- seccomp=unconfined
environment:
- RCCL_FOLDER=/rccl
- LD_LIBRARY_PATH=/rccl/build/release:$LD_LIBRARY_PATH
- TORCH_NCCL_HIGH_PRIORITY=1

volumes:
Expand Down
166 changes: 166 additions & 0 deletions scripts/tracelens_single_config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# RCCL Warp Speed Performance Testing

Test RCCL warp_speed_v1 branch from https://github.com/mustafabar/rccl.git

## Prerequisites

```bash
pip install pandas openpyxl matplotlib seaborn numpy
```

## Run Tests

### Step 1: Start Container and Build RCCL

```bash
cd docker
docker-compose -f docker-compose.rocm70_9-1.yaml build
docker-compose -f docker-compose.rocm70_9-1.yaml up -d
docker-compose -f docker-compose.rocm70_9-1.yaml exec torchenv-rocm70 bash

# Inside container - build warp_speed_v1 (always rebuild)
# Note: Set --amdgpu_targets to match your GPU architecture
# Run 'rocminfo | grep gfx' to find your GPU target (e.g., gfx942, gfx950)
cd /opt
if [ -d "rccl" ]; then
cd rccl
git checkout warp_speed_v1
git pull
else
git clone --recursive https://github.com/mustafabar/rccl.git
cd rccl
git checkout warp_speed_v1
fi
./install.sh -l --amdgpu_targets=gfx950

cd /workspace/aorta
```

### Step 2: Run RCCL Tests

```bash
# Default 3 configurations
./scripts/tracelens_single_config/run_rccl_warp_speed_comparison.sh

# Custom configurations (CU_count,threads pairs)
./scripts/tracelens_single_config/run_rccl_warp_speed_comparison.sh -p "56,256 37,384 32,512" -c ./config/single_node/gemm_overlap_comm.yaml
```

Output structure:
```
experiments/
rccl_warp_speed_YYYYMMDD_HHMMSS/
56cu_256threads/
torch_profiler/ # Raw profiler traces
run_output.log # Training output log
37cu_384threads/
32cu_512threads/
rccl_warp_speed_summary_YYYYMMDD_HHMMSS.txt
```

### Step 3: Generate Reports (Outside Container)

```bash
# Exit container
exit

# Run complete analysis
python scripts/tracelens_single_config/run_full_analysis.py \
--baseline experiments/rccl_warp_speed_YYYYMMDD/56cu_256threads \
--test experiments/rccl_warp_speed_YYYYMMDD/37cu_384threads \
--output comparison_results \
--all

# Or skip TraceLens if already done
python scripts/tracelens_single_config/run_full_analysis.py \
--baseline experiments/rccl_warp_speed_YYYYMMDD/56cu_256threads \
--test experiments/rccl_warp_speed_YYYYMMDD/37cu_384threads \
--output comparison_results \
--all --skip-tracelens
```

## Generated Excel Reports

### Individual TraceLens Reports (per configuration)
Each configuration generates:
- `tracelens_analysis/individual_reports/perf_rank*.xlsx` - Per-rank performance breakdown
- `tracelens_analysis/collective_reports/collective_all_ranks.xlsx` - Collective operations summary
- `tracelens_analysis/gpu_timeline_summary_mean.xlsx` - GPU timeline averages

### Final Analysis Report (`final_analysis_report.xlsx`)

Contains multiple sheets:

**Summary Sheets:**
- `Summary_Dashboard` - High-level comparison metrics with percentage changes
- `Summary_Comparison` - Side-by-side summary comparison
- `GPU_ByRank_Comparison` - Detailed per-rank performance comparison
- `Comparison_By_Rank` - Rank-wise metric comparison with differences

**GPU Timeline Sheets:**
- `All_Ranks_Combined` - Combined GPU timeline data from all ranks
- `Summary` - Aggregated GPU timeline summary
- `Rank_*` - Individual rank GPU timelines

**Collective/NCCL Sheets:**
- `nccl_summary_implicit_sync` - NCCL operations with implicit synchronization
- `nccl_summary_long` - Long-running NCCL operations
- `nccl_summary_implicit_sync_comparison` - Comparison of implicit sync operations
- `nccl_summary_long_comparison` - Comparison of long operations

**Raw Data Sheets (hidden by default):**
- `gpu_timeline_combined` - Raw combined GPU timeline data
- `gpu_timeline_comparison` - Raw GPU timeline comparison data
- `collective_combined` - Raw collective operations data
- `collective_comparison` - Raw collective comparison data

### Comparison Reports

- `gpu_timeline_combined.xlsx` - Baseline and test GPU metrics combined
- `gpu_timeline_comparison.xlsx` - GPU metrics with comparison analysis
- `collective_combined.xlsx` - Baseline and test collective operations combined
- `collective_comparison.xlsx` - Collective operations with comparison analysis

## Generated Visualizations

### HTML Report
- `performance_analysis_report.html` - Complete report with all embedded plots

### Individual Plot Files (12 Total)
1. `plot1_percentage_change_overview.png` - Horizontal bar chart showing performance changes
2. `plot2_absolute_time_comparison.png` - Bar chart comparing absolute times
3. `plot3_performance_heatmap.png` - Heatmap of performance by rank
4. `plot4_total_execution_time.png` - Line plot of total execution time per rank
5. `plot5_computation_time.png` - Line plot of computation time across ranks
6. `plot6_communication_time.png` - Line plot of communication time across ranks
7. `plot7_idle_time.png` - Line plot of idle time across ranks
8. `plot8_percentage_difference_all_metrics.png` - Bar plot showing percentage differences for all metrics
9. `plot9_nccl_latency.png` - Line plot of latency vs message size
10. `plot10_algorithm_bandwidth.png` - Line plot of algorithm bandwidth vs message size
11. `plot11_bus_bandwidth.png` - Line plot of bus bandwidth vs message size
12. `plot12_nccl_summary.png` - Combined percentage summary and total latency

## Key Metrics Analyzed

**GPU Metrics:**
- `computation_time` - Time spent in computation
- `total_comm_time` - Total communication time
- `exposed_comm_time` - Non-overlapped communication time
- `idle_time` - GPU idle time
- `total_memcpy_time` - Memory copy time
- `exposed_memcpy_time` - Non-overlapped memory copy time
- `busy_time` - Total GPU busy time
- `total_time` - Total execution time

**NCCL Metrics:**
- `comm_latency_mean` - Average communication latency
- `algo bw (GB/s)_mean` - Algorithm bandwidth
- `bus bw (GB/s)_mean` - Bus bandwidth
- `Total comm latency (ms)` - Total communication latency
- `count` - Number of operations

## Convert to PDF

1. Open `performance_analysis_report.html` in browser
2. Print to PDF (Ctrl+P or Cmd+P)
3. Choose landscape orientation for better plot visibility
170 changes: 170 additions & 0 deletions scripts/tracelens_single_config/add_collective_comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
#!/usr/bin/env python3
import pandas as pd
import argparse
from openpyxl.styles import Color
from openpyxl.formatting.rule import ColorScaleRule


def add_collective_comparison_sheets(input_path, output_path, baseline_label='baseline', test_label='test'):
print(f"Loading: {input_path}")
print(f" Baseline label: {baseline_label}")
print(f" Test label: {test_label}")

xl = pd.ExcelFile(input_path)

with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
# Copy only summary sheets
for sheet_name in xl.sheet_names:
# Only keep sheets with 'summary' in the name
if 'summary' not in sheet_name.lower():
print(f" Skip {sheet_name} (keeping only summary sheets)")
continue
df = pd.read_excel(input_path, sheet_name=sheet_name)
df.to_excel(writer, sheet_name=sheet_name, index=False)
print(f" Copied {sheet_name}")

# Process summary sheets for comparison
for sheet_name in ['nccl_summary_implicit_sync', 'nccl_summary_long']:
if sheet_name not in xl.sheet_names:
continue

df = pd.read_excel(input_path, sheet_name=sheet_name)

# Get actual source values from the dataframe
sources = df['source'].unique()
# Determine which is baseline and which is test (baseline should be first)
if len(sources) >= 2:
actual_baseline = sources[0]
actual_test = sources[1]
else:
actual_baseline = baseline_label
actual_test = test_label

# Separate baseline and test
baseline_df = df[df['source'] == actual_baseline].copy()
test_df = df[df['source'] == actual_test].copy()

if len(baseline_df) == 0 or len(test_df) == 0:
print(f" Skip {sheet_name} - missing data")
continue

# Create comparison dataframe
comparison = pd.DataFrame()

# Identify key columns for grouping
group_cols = ['Collective name', 'dtype', 'In msg nelems']
if not all(col in baseline_df.columns for col in group_cols):
group_cols = ['Collective name']

# Group and compare
baseline_grouped = baseline_df.groupby(group_cols, as_index=False)
test_grouped = test_df.groupby(group_cols, as_index=False)

for name, base_group in baseline_grouped:
# Find matching test group
if isinstance(name, tuple):
mask = pd.Series([True] * len(test_df), index=test_df.index)
for col, val in zip(group_cols, name):
mask = mask & (test_df[col] == val)
else:
mask = (test_df[group_cols[0]] == name)

test_group = test_df.loc[mask]

if len(test_group) == 0:
continue

# Create comparison row
comp_row = {}

# Copy grouping columns
if isinstance(name, tuple):
for col, val in zip(group_cols, name):
comp_row[col] = val
else:
comp_row[group_cols[0]] = name

# Compare numeric columns
numeric_cols = ['comm_latency_mean', 'algo bw (GB/s)_mean', 'bus bw (GB/s)_mean',
'Total comm latency (ms)', 'count']

for col in numeric_cols:
if col not in base_group.columns or col not in test_group.columns:
continue

base_val = base_group[col].values[0]
test_val = test_group[col].values[0]

comp_row[f'{baseline_label}_{col}'] = base_val
comp_row[f'{test_label}_{col}'] = test_val
comp_row[f'diff_{col}'] = test_val - base_val

# For latency/time: positive percent_change means faster (less time)
# For bandwidth: positive percent_change means better (more bandwidth)
if 'latency' in col.lower() or 'time' in col.lower():
# Lower is better - positive when saleelk is faster
pct_change = (base_val - test_val) / base_val * 100 if base_val != 0 else 0
comp_row[f'percent_change_{col}'] = pct_change
elif 'bw' in col.lower() or 'bandwidth' in col.lower():
# Higher is better - positive when saleelk is better
pct_change = (test_val - base_val) / base_val * 100 if base_val != 0 else 0
comp_row[f'percent_change_{col}'] = pct_change

comp_row[f'ratio_{col}'] = test_val / base_val if base_val != 0 else 0

comparison = pd.concat([comparison, pd.DataFrame([comp_row])], ignore_index=True)

# Write comparison sheet (shorten name to fit Excel's 31 char limit)
# Replace 'nccl_summary_' with 'nccl_' and '_comparison' with '_cmp'
comparison_sheet_name = sheet_name.replace('nccl_summary_', 'nccl_') + '_cmp'
comparison.to_excel(writer, sheet_name=comparison_sheet_name, index=False)
print(f" Added {comparison_sheet_name}")

# Add conditional formatting to percent_change columns
print(f" Applying conditional formatting to {comparison_sheet_name}...")

ws = writer.sheets[comparison_sheet_name]

# Format all percent_change columns with color scale
for col_idx, col in enumerate(comparison.columns, start=1):
if 'percent_change' in col:
# Convert column index to Excel letter (A, B, C, ...)
if col_idx <= 26:
col_letter = chr(64 + col_idx)
else:
col_letter = chr(64 + (col_idx // 26)) + chr(64 + (col_idx % 26))

data_range = f'{col_letter}2:{col_letter}{len(comparison)+1}'

# Color scale: red (min/negative) -> white (0) -> green (max/positive)
ws.conditional_formatting.add(data_range,
ColorScaleRule(
start_type='min', start_color='F8696B', # Red
mid_type='num', mid_value=0, mid_color='FFFFFF', # White
end_type='max', end_color='63BE7B' # Green
))

print(f" Formatted {col}")

print(f"\nSaved: {output_path}")
print("\nNew comparison sheets added")
print("percent_change interpretation:")
print(" For latency/time: Positive = faster (less time)")
print(" For bandwidth: Positive = better (more bandwidth)")
return 0


def main():
parser = argparse.ArgumentParser(description='Add comparison sheets to combined collective reports')
parser.add_argument('--input', required=True, help='Input combined collective Excel file')
parser.add_argument('--output', required=True, help='Output Excel file with comparison sheets')
parser.add_argument('--baseline-label', default='baseline', help='Label for baseline data')
parser.add_argument('--test-label', default='test', help='Label for test data')

args = parser.parse_args()

return add_collective_comparison_sheets(args.input, args.output, args.baseline_label, args.test_label)


if __name__ == '__main__':
exit(main())
Loading