Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9a230e4
Installing amdgpu from local path
Jan 5, 2026
c15d9b4
Adding runner on Aorta
Jan 5, 2026
48d168b
Changed piped password to use -p
Jan 5, 2026
f16b73e
Corrected volume mounts
Jan 5, 2026
a09697f
Changed runner for visualization workflow
Jan 6, 2026
db5e574
Merge vis and sweep run togather
Jan 6, 2026
49db3a0
Corrected yml
Jan 6, 2026
0ec0264
Corrected yml
Jan 6, 2026
e11d583
Commented docker down : Temporary
Jan 6, 2026
6413774
Moved script invocation inside docker : Temporary
Jan 6, 2026
6194ac8
Moved all script invocation inside docker : Temporary
Jan 6, 2026
be0560d
Rccl wrap speed test yml
Jan 7, 2026
8740ea0
Change rccl path in docker path
Jan 7, 2026
21eda7d
Stop already existing docker
Jan 7, 2026
cc61ec4
Invoke script using bash instead of directly calling the script
Jan 7, 2026
903a15a
Merged two jobs into 1 due to permission issues in local runner
Jan 7, 2026
f6d9f04
Fix bug in invoking tracelens analysis
Jan 7, 2026
07c6360
Adding missing script for tracelens single config
Jan 7, 2026
b9727c6
Fixed pre-commit issues
Jan 7, 2026
2ba912f
Adding reports to aorta-report
Jan 9, 2026
c028246
Addressed review comments for gemm-sweep
Jan 9, 2026
c679720
Push the rccl test results to aorta-report
Jan 10, 2026
f9b535b
Commented push/pull invocation and cleaning for pre-commit fails
Jan 10, 2026
5ca5b1b
Removed unused docker compose inside rccl_test
Jan 12, 2026
d90052d
Replaced manual argumentment parsing with argparser
Jan 12, 2026
c51ee61
Removed the push/pull dependency introduced for testing
Jan 12, 2026
a15079a
Pre-commit fixes
Jan 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 62 additions & 91 deletions .github/workflows/gemm-sweep-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,7 @@ on:
description: 'Number of top GEMM kernels to extract'
required: false
default: '5'
push:
branches:
- main
paths:
- 'scripts/gemm_analysis/**'
- 'config/gemm_overlap/**'


env:
DOCKER_COMPOSE_FILE: docker/docker-compose.rocm70_9-1.yaml
Expand All @@ -36,29 +31,35 @@ env:
jobs:
gemm-sweep:
name: Run GEMM Sweep Profiling
runs-on: [self-hosted, gpu, rocm]
runs-on: self-hosted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you are not planning to keep the self-hosted in the final merge. Please change it to the runner machine name once you get it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I am keeping this self-hosted as we have not got the runner.

timeout-minutes: 180
outputs:
sweep_dir: ${{ steps.setup.outputs.sweep_dir }}

steps:
- name: Checkout repository
- name: Checkout AORTA repository
uses: actions/checkout@v4
with:
repository: ROCm/aorta
ref : prosenj_gh_action
path: aorta

- name: Set up experiment directory
id: setup
working-directory: aorta
run: |
SWEEP_DIR="experiments/sweep_$(date +%Y%m%d_%H%M%S)"
echo "sweep_dir=$SWEEP_DIR" >> $GITHUB_OUTPUT
mkdir -p $SWEEP_DIR

- name: Build Docker container
working-directory: docker
working-directory: aorta
run: |
docker compose version
docker login -u rocmshared -p ${{ secrets.ROCM_SHARED_KEY }}
docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} build
docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} up -d

- name: Run training sweep
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
bash scripts/gemm_analysis/run_train_various_channels.sh \
Expand All @@ -69,13 +70,15 @@ jobs:
"

- name: Generate TraceLens reports
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
pip install -r requirements.txt && \
bash scripts/gemm_analysis/run_tracelens_analysis.sh ${{ steps.setup.outputs.sweep_dir }}
"

- name: Extract top GEMM kernels
working-directory: aorta
run: |
# Parse channels and threads into space-separated format
CHANNELS=$(echo "${{ github.event.inputs.channels || '28,56' }}" | tr ',' ' ')
Expand All @@ -94,128 +97,96 @@ jobs:
uses: actions/upload-artifact@v4
with:
name: gemm-sweep-results
path: ${{ steps.setup.outputs.sweep_dir }}
path: aorta/${{ steps.setup.outputs.sweep_dir }}
retention-days: 30

- name: Cleanup Docker container
if: always()
run: |
docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down || true

visualization:
name: Generate Visualizations and Reports
needs: gemm-sweep
runs-on: [self-hosted, gpu, rocm]
timeout-minutes: 60
env:
SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }}

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Download sweep results
uses: actions/download-artifact@v4
with:
name: gemm-sweep-results
path: ${{ env.SWEEP_DIR }}

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
cache: 'pip'

- name: Install Python dependencies
working-directory: aorta
run: |
pip install -r requirements.txt
docker exec ${{ env.CONTAINER_NAME }} bash -c "pip install -r requirements.txt"

- name: Generate variance plots
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
python scripts/gemm_analysis/plot_gemm_variance.py \
--csv-path ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--output-dir ${{ env.SWEEP_DIR }}/tracelens_analysis/plots
--csv-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--output-dir ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots"

- name: Add timestamp information
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \
--input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--base-path ${{ env.SWEEP_DIR }}
--input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \
--base-path ${{ steps.setup.outputs.sweep_dir }}"

- name: Analyze collective overlap
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \
--input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \
--tracelens-path ${{ env.SWEEP_DIR }}/tracelens_analysis
--input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \
--tracelens-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis"

- name: Process GPU timeline
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
python scripts/gemm_analysis/process_gpu_timeline.py \
--sweep-dir ${{ env.SWEEP_DIR }}
--sweep-dir ${{ steps.setup.outputs.sweep_dir }}"

- name: Process NCCL communication data
working-directory: aorta
run: |
docker exec ${{ env.CONTAINER_NAME }} bash -c "
python scripts/gemm_analysis/process_comms.py \
--sweep-dir ${{ env.SWEEP_DIR }}
--sweep-dir ${{ steps.setup.outputs.sweep_dir }}"

- name: Stop Docker container
if: always()
working-directory: aorta
run: |
docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down

- name: Upload analysis results
uses: actions/upload-artifact@v4
with:
name: gemm-analysis-results
path: |
${{ env.SWEEP_DIR }}/tracelens_analysis/plots/
${{ env.SWEEP_DIR }}/tracelens_analysis/*.csv
${{ env.SWEEP_DIR }}/tracelens_analysis/*.xlsx
aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots/
aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.csv
aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.xlsx
retention-days: 30

comparison-report:
name: Generate Comparison Report
needs: [gemm-sweep, visualization]
runs-on: ubuntu-latest
if: github.event_name == 'workflow_dispatch'
env:
SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }}

steps:
- name: Checkout repository
- name: Checkout aorta-report repository
uses: actions/checkout@v4

- name: Download analysis results
uses: actions/download-artifact@v4
with:
name: gemm-analysis-results
path: ${{ env.SWEEP_DIR }}/tracelens_analysis

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
cache: 'pip'
repository: ROCm/aorta-report
ref: main
token: ${{ secrets.AORTA_REPORT_GITHUB_TOKEN }}
path: aorta-report

- name: Install Python dependencies
- name: Create date directory and copy sweep results
run: |
pip install -r requirements.txt
date=$(date '+%Y-%m-%d')
mkdir -p aorta-report/${date}/gemm-sweep
cp -r aorta/${{ steps.setup.outputs.sweep_dir }}/* aorta-report/${date}/gemm-sweep/

- name: Generate summary report
- name: Push results to aorta-report
working-directory: aorta-report
run: |
echo "## GEMM Sweep Analysis Summary" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Configuration" >> $GITHUB_STEP_SUMMARY
echo "- **Sweep Directory**: ${{ env.SWEEP_DIR }}" >> $GITHUB_STEP_SUMMARY
echo "- **Channels**: ${{ github.event.inputs.channels || '28,56' }}" >> $GITHUB_STEP_SUMMARY
echo "- **Threads**: ${{ github.event.inputs.threads || '256,512' }}" >> $GITHUB_STEP_SUMMARY
echo "- **Top-K Kernels**: ${{ github.event.inputs.top_k || '5' }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### Generated Artifacts" >> $GITHUB_STEP_SUMMARY
echo "- Variance plots (box plots, violin plots)" >> $GITHUB_STEP_SUMMARY
echo "- GEMM kernels with timestamps" >> $GITHUB_STEP_SUMMARY
echo "- Collective overlap analysis" >> $GITHUB_STEP_SUMMARY
echo "- GPU timeline data" >> $GITHUB_STEP_SUMMARY
echo "- NCCL communication data" >> $GITHUB_STEP_SUMMARY

- name: Upload final report
uses: actions/upload-artifact@v4
with:
name: gemm-final-report
path: ${{ env.SWEEP_DIR }}/
retention-days: 90
git config user.name "GitHub Actions Bot"
git config user.email "<>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it allowed to keep email empty and still able to push to github?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is working .. And thats how we are doing it in other shark test suites.

git pull --rebase origin main
date=$(date '+%Y-%m-%d')
git add ${date}
git commit -m "Add GEMM sweep results for ${date}"
git push origin main
Loading