-
Notifications
You must be signed in to change notification settings - Fork 2
Enabling RCCL and GEMM sweep test on GH action #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 25 commits
9a230e4
c15d9b4
48d168b
f16b73e
a09697f
db5e574
49db3a0
0ec0264
e11d583
6413774
6194ac8
be0560d
8740ea0
21eda7d
cc61ec4
903a15a
f6d9f04
07c6360
b9727c6
2ba912f
c028246
c679720
f9b535b
5ca5b1b
d90052d
c51ee61
a15079a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,12 +22,13 @@ on: | |
| description: 'Number of top GEMM kernels to extract' | ||
| required: false | ||
| default: '5' | ||
| push: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - 'scripts/gemm_analysis/**' | ||
| - 'config/gemm_overlap/**' | ||
| #pull_request: | ||
| #push: | ||
| # branches: | ||
| # - main | ||
| # paths: | ||
| # - 'scripts/gemm_analysis/**' | ||
| # - 'config/gemm_overlap/**' | ||
|
|
||
| env: | ||
| DOCKER_COMPOSE_FILE: docker/docker-compose.rocm70_9-1.yaml | ||
|
|
@@ -36,29 +37,35 @@ env: | |
| jobs: | ||
| gemm-sweep: | ||
| name: Run GEMM Sweep Profiling | ||
| runs-on: [self-hosted, gpu, rocm] | ||
| runs-on: self-hosted | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hope you are not planning to keep the self-hosted in the final merge. Please change it to the runner machine name once you get it.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, I am keeping this self-hosted as we have not got the runner. |
||
| timeout-minutes: 180 | ||
| outputs: | ||
| sweep_dir: ${{ steps.setup.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| - name: Checkout AORTA repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| repository: ROCm/aorta | ||
| ref : prosenj_gh_action | ||
| path: aorta | ||
|
|
||
| - name: Set up experiment directory | ||
| id: setup | ||
| working-directory: aorta | ||
| run: | | ||
| SWEEP_DIR="experiments/sweep_$(date +%Y%m%d_%H%M%S)" | ||
| echo "sweep_dir=$SWEEP_DIR" >> $GITHUB_OUTPUT | ||
| mkdir -p $SWEEP_DIR | ||
|
|
||
| - name: Build Docker container | ||
| working-directory: docker | ||
| working-directory: aorta | ||
| run: | | ||
| docker compose version | ||
| docker login -u rocmshared -p ${{ secrets.ROCM_SHARED_KEY }} | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} build | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} up -d | ||
|
|
||
| - name: Run training sweep | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| bash scripts/gemm_analysis/run_train_various_channels.sh \ | ||
|
|
@@ -69,13 +76,15 @@ jobs: | |
| " | ||
|
|
||
| - name: Generate TraceLens reports | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| pip install -r requirements.txt && \ | ||
| bash scripts/gemm_analysis/run_tracelens_analysis.sh ${{ steps.setup.outputs.sweep_dir }} | ||
| " | ||
|
|
||
| - name: Extract top GEMM kernels | ||
| working-directory: aorta | ||
| run: | | ||
| # Parse channels and threads into space-separated format | ||
| CHANNELS=$(echo "${{ github.event.inputs.channels || '28,56' }}" | tr ',' ' ') | ||
|
|
@@ -94,128 +103,96 @@ jobs: | |
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-sweep-results | ||
| path: ${{ steps.setup.outputs.sweep_dir }} | ||
| path: aorta/${{ steps.setup.outputs.sweep_dir }} | ||
| retention-days: 30 | ||
|
|
||
| - name: Cleanup Docker container | ||
| if: always() | ||
| run: | | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down || true | ||
|
|
||
| visualization: | ||
| name: Generate Visualizations and Reports | ||
| needs: gemm-sweep | ||
| runs-on: [self-hosted, gpu, rocm] | ||
| timeout-minutes: 60 | ||
| env: | ||
| SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Download sweep results | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| name: gemm-sweep-results | ||
| path: ${{ env.SWEEP_DIR }} | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.10' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install Python dependencies | ||
| working-directory: aorta | ||
| run: | | ||
| pip install -r requirements.txt | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c "pip install -r requirements.txt" | ||
|
|
||
| - name: Generate variance plots | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/plot_gemm_variance.py \ | ||
| --csv-path ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --output-dir ${{ env.SWEEP_DIR }}/tracelens_analysis/plots | ||
| --csv-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --output-dir ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots" | ||
|
|
||
prosenjitdhole marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - name: Add timestamp information | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \ | ||
| --input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --base-path ${{ env.SWEEP_DIR }} | ||
| --input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --base-path ${{ steps.setup.outputs.sweep_dir }}" | ||
|
|
||
| - name: Analyze collective overlap | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \ | ||
| --input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \ | ||
| --tracelens-path ${{ env.SWEEP_DIR }}/tracelens_analysis | ||
| --input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \ | ||
| --tracelens-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis" | ||
|
|
||
| - name: Process GPU timeline | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/process_gpu_timeline.py \ | ||
| --sweep-dir ${{ env.SWEEP_DIR }} | ||
| --sweep-dir ${{ steps.setup.outputs.sweep_dir }}" | ||
|
|
||
| - name: Process NCCL communication data | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/process_comms.py \ | ||
| --sweep-dir ${{ env.SWEEP_DIR }} | ||
| --sweep-dir ${{ steps.setup.outputs.sweep_dir }}" | ||
prosenjitdhole marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - name: Stop Docker container | ||
| if: always() | ||
| working-directory: aorta | ||
| run: | | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down | ||
|
|
||
| - name: Upload analysis results | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-analysis-results | ||
| path: | | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/plots/ | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/*.csv | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/*.xlsx | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots/ | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.csv | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.xlsx | ||
| retention-days: 30 | ||
|
|
||
| comparison-report: | ||
| name: Generate Comparison Report | ||
| needs: [gemm-sweep, visualization] | ||
| runs-on: ubuntu-latest | ||
| if: github.event_name == 'workflow_dispatch' | ||
| env: | ||
| SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| - name: Checkout aorta-report repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Download analysis results | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| name: gemm-analysis-results | ||
| path: ${{ env.SWEEP_DIR }}/tracelens_analysis | ||
| repository: ROCm/aorta-report | ||
| ref: main | ||
| token: ${{ secrets.AORTA_REPORT_GITHUB_TOKEN }} | ||
| path: aorta-report | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.10' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install Python dependencies | ||
| - name: Create date directory and copy sweep results | ||
| run: | | ||
| pip install -r requirements.txt | ||
| date=$(date '+%Y-%m-%d') | ||
| mkdir -p aorta-report/${date}/gemm-sweep | ||
| cp -r aorta/${{ steps.setup.outputs.sweep_dir }}/* aorta-report/${date}/gemm-sweep/ | ||
|
|
||
| - name: Generate summary report | ||
| - name: Push results to aorta-report | ||
| working-directory: aorta-report | ||
| run: | | ||
| echo "## GEMM Sweep Analysis Summary" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "### Configuration" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Sweep Directory**: ${{ env.SWEEP_DIR }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Channels**: ${{ github.event.inputs.channels || '28,56' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Threads**: ${{ github.event.inputs.threads || '256,512' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Top-K Kernels**: ${{ github.event.inputs.top_k || '5' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "### Generated Artifacts" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Variance plots (box plots, violin plots)" >> $GITHUB_STEP_SUMMARY | ||
| echo "- GEMM kernels with timestamps" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Collective overlap analysis" >> $GITHUB_STEP_SUMMARY | ||
| echo "- GPU timeline data" >> $GITHUB_STEP_SUMMARY | ||
| echo "- NCCL communication data" >> $GITHUB_STEP_SUMMARY | ||
|
|
||
| - name: Upload final report | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-final-report | ||
| path: ${{ env.SWEEP_DIR }}/ | ||
| retention-days: 90 | ||
| git config user.name "GitHub Actions Bot" | ||
| git config user.email "<>" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it allowed to keep email empty and still able to push to github?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is working .. And thats how we are doing it in other shark test suites. |
||
| git pull --rebase origin main | ||
| date=$(date '+%Y-%m-%d') | ||
| git add ${date} | ||
| git commit -m "Add GEMM sweep results for ${date}" | ||
| git push origin main | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this code is not required, then please delete these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are keeping this for now as we are still testing the yml locally, and in near future we may need to do this testing. But in production, we do not want to kick in this yml for every PR. Hence commented. Code is there for local testing on branch, but not enable for production.