-
Notifications
You must be signed in to change notification settings - Fork 2
Enabling RCCL and GEMM sweep test on GH action #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
9a230e4
Installing amdgpu from local path
c15d9b4
Adding runner on Aorta
48d168b
Changed piped password to use -p
f16b73e
Corrected volume mounts
a09697f
Changed runner for visualization workflow
db5e574
Merge vis and sweep run togather
49db3a0
Corrected yml
0ec0264
Corrected yml
e11d583
Commented docker down : Temporary
6413774
Moved script invocation inside docker : Temporary
6194ac8
Moved all script invocation inside docker : Temporary
be0560d
Rccl wrap speed test yml
8740ea0
Change rccl path in docker path
21eda7d
Stop already existing docker
cc61ec4
Invoke script using bash instead of directly calling the script
903a15a
Merged two jobs into 1 due to permission issues in local runner
f6d9f04
Fix bug in invoking tracelens analysis
07c6360
Adding missing script for tracelens single config
b9727c6
Fixed pre-commit issues
2ba912f
Adding reports to aorta-report
c028246
Addressed review comments for gemm-sweep
c679720
Push the rccl test results to aorta-report
f9b535b
Commented push/pull invocation and cleaning for pre-commit fails
5ca5b1b
Removed unused docker compose inside rccl_test
d90052d
Replaced manual argumentment parsing with argparser
c51ee61
Removed the push/pull dependency introduced for testing
a15079a
Pre-commit fixes
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -22,12 +22,7 @@ on: | |
| description: 'Number of top GEMM kernels to extract' | ||
| required: false | ||
| default: '5' | ||
| push: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - 'scripts/gemm_analysis/**' | ||
| - 'config/gemm_overlap/**' | ||
|
|
||
|
|
||
| env: | ||
| DOCKER_COMPOSE_FILE: docker/docker-compose.rocm70_9-1.yaml | ||
|
|
@@ -36,29 +31,35 @@ env: | |
| jobs: | ||
| gemm-sweep: | ||
| name: Run GEMM Sweep Profiling | ||
| runs-on: [self-hosted, gpu, rocm] | ||
| runs-on: self-hosted | ||
| timeout-minutes: 180 | ||
| outputs: | ||
| sweep_dir: ${{ steps.setup.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| - name: Checkout AORTA repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| repository: ROCm/aorta | ||
| ref : prosenj_gh_action | ||
| path: aorta | ||
|
|
||
| - name: Set up experiment directory | ||
| id: setup | ||
| working-directory: aorta | ||
| run: | | ||
| SWEEP_DIR="experiments/sweep_$(date +%Y%m%d_%H%M%S)" | ||
| echo "sweep_dir=$SWEEP_DIR" >> $GITHUB_OUTPUT | ||
| mkdir -p $SWEEP_DIR | ||
|
|
||
| - name: Build Docker container | ||
| working-directory: docker | ||
| working-directory: aorta | ||
| run: | | ||
| docker compose version | ||
| docker login -u rocmshared -p ${{ secrets.ROCM_SHARED_KEY }} | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} build | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} up -d | ||
|
|
||
| - name: Run training sweep | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| bash scripts/gemm_analysis/run_train_various_channels.sh \ | ||
|
|
@@ -69,13 +70,15 @@ jobs: | |
| " | ||
|
|
||
| - name: Generate TraceLens reports | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| pip install -r requirements.txt && \ | ||
| bash scripts/gemm_analysis/run_tracelens_analysis.sh ${{ steps.setup.outputs.sweep_dir }} | ||
| " | ||
|
|
||
| - name: Extract top GEMM kernels | ||
| working-directory: aorta | ||
| run: | | ||
| # Parse channels and threads into space-separated format | ||
| CHANNELS=$(echo "${{ github.event.inputs.channels || '28,56' }}" | tr ',' ' ') | ||
|
|
@@ -94,128 +97,96 @@ jobs: | |
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-sweep-results | ||
| path: ${{ steps.setup.outputs.sweep_dir }} | ||
| path: aorta/${{ steps.setup.outputs.sweep_dir }} | ||
| retention-days: 30 | ||
|
|
||
| - name: Cleanup Docker container | ||
| if: always() | ||
| run: | | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down || true | ||
|
|
||
| visualization: | ||
| name: Generate Visualizations and Reports | ||
| needs: gemm-sweep | ||
| runs-on: [self-hosted, gpu, rocm] | ||
| timeout-minutes: 60 | ||
| env: | ||
| SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Download sweep results | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| name: gemm-sweep-results | ||
| path: ${{ env.SWEEP_DIR }} | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.10' | ||
| cache: 'pip' | ||
|
|
||
| - name: Install Python dependencies | ||
| working-directory: aorta | ||
| run: | | ||
| pip install -r requirements.txt | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c "pip install -r requirements.txt" | ||
|
|
||
| - name: Generate variance plots | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/plot_gemm_variance.py \ | ||
| --csv-path ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --output-dir ${{ env.SWEEP_DIR }}/tracelens_analysis/plots | ||
| --csv-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --output-dir ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots" | ||
|
|
||
prosenjitdhole marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - name: Add timestamp information | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/enhance_gemm_variance_with_timestamps.py \ | ||
| --input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --base-path ${{ env.SWEEP_DIR }} | ||
| --input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance.csv \ | ||
| --base-path ${{ steps.setup.outputs.sweep_dir }}" | ||
|
|
||
| - name: Analyze collective overlap | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/gemm_report_with_collective_overlap.py \ | ||
| --input-csv ${{ env.SWEEP_DIR }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \ | ||
| --tracelens-path ${{ env.SWEEP_DIR }}/tracelens_analysis | ||
| --input-csv ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/top5_gemm_kernels_time_variance_with_timestamps.csv \ | ||
| --tracelens-path ${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis" | ||
|
|
||
| - name: Process GPU timeline | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/process_gpu_timeline.py \ | ||
| --sweep-dir ${{ env.SWEEP_DIR }} | ||
| --sweep-dir ${{ steps.setup.outputs.sweep_dir }}" | ||
|
|
||
| - name: Process NCCL communication data | ||
| working-directory: aorta | ||
| run: | | ||
| docker exec ${{ env.CONTAINER_NAME }} bash -c " | ||
| python scripts/gemm_analysis/process_comms.py \ | ||
| --sweep-dir ${{ env.SWEEP_DIR }} | ||
| --sweep-dir ${{ steps.setup.outputs.sweep_dir }}" | ||
prosenjitdhole marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - name: Stop Docker container | ||
| if: always() | ||
| working-directory: aorta | ||
| run: | | ||
| docker compose -f ${{ env.DOCKER_COMPOSE_FILE }} down | ||
|
|
||
| - name: Upload analysis results | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-analysis-results | ||
| path: | | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/plots/ | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/*.csv | ||
| ${{ env.SWEEP_DIR }}/tracelens_analysis/*.xlsx | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/plots/ | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.csv | ||
| aorta/${{ steps.setup.outputs.sweep_dir }}/tracelens_analysis/*.xlsx | ||
| retention-days: 30 | ||
|
|
||
| comparison-report: | ||
| name: Generate Comparison Report | ||
| needs: [gemm-sweep, visualization] | ||
| runs-on: ubuntu-latest | ||
| if: github.event_name == 'workflow_dispatch' | ||
| env: | ||
| SWEEP_DIR: ${{ needs.gemm-sweep.outputs.sweep_dir }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository | ||
| - name: Checkout aorta-report repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Download analysis results | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| name: gemm-analysis-results | ||
| path: ${{ env.SWEEP_DIR }}/tracelens_analysis | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: '3.10' | ||
| cache: 'pip' | ||
| repository: ROCm/aorta-report | ||
| ref: main | ||
| token: ${{ secrets.AORTA_REPORT_GITHUB_TOKEN }} | ||
| path: aorta-report | ||
|
|
||
| - name: Install Python dependencies | ||
| - name: Create date directory and copy sweep results | ||
| run: | | ||
| pip install -r requirements.txt | ||
| date=$(date '+%Y-%m-%d') | ||
| mkdir -p aorta-report/${date}/gemm-sweep | ||
| cp -r aorta/${{ steps.setup.outputs.sweep_dir }}/* aorta-report/${date}/gemm-sweep/ | ||
|
|
||
| - name: Generate summary report | ||
| - name: Push results to aorta-report | ||
| working-directory: aorta-report | ||
| run: | | ||
| echo "## GEMM Sweep Analysis Summary" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "### Configuration" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Sweep Directory**: ${{ env.SWEEP_DIR }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Channels**: ${{ github.event.inputs.channels || '28,56' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Threads**: ${{ github.event.inputs.threads || '256,512' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "- **Top-K Kernels**: ${{ github.event.inputs.top_k || '5' }}" >> $GITHUB_STEP_SUMMARY | ||
| echo "" >> $GITHUB_STEP_SUMMARY | ||
| echo "### Generated Artifacts" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Variance plots (box plots, violin plots)" >> $GITHUB_STEP_SUMMARY | ||
| echo "- GEMM kernels with timestamps" >> $GITHUB_STEP_SUMMARY | ||
| echo "- Collective overlap analysis" >> $GITHUB_STEP_SUMMARY | ||
| echo "- GPU timeline data" >> $GITHUB_STEP_SUMMARY | ||
| echo "- NCCL communication data" >> $GITHUB_STEP_SUMMARY | ||
|
|
||
| - name: Upload final report | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: gemm-final-report | ||
| path: ${{ env.SWEEP_DIR }}/ | ||
| retention-days: 90 | ||
| git config user.name "GitHub Actions Bot" | ||
| git config user.email "<>" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is it allowed to keep email empty and still able to push to github?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is working .. And thats how we are doing it in other shark test suites. |
||
| git pull --rebase origin main | ||
| date=$(date '+%Y-%m-%d') | ||
| git add ${date} | ||
| git commit -m "Add GEMM sweep results for ${date}" | ||
| git push origin main | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you are not planning to keep the self-hosted in the final merge. Please change it to the runner machine name once you get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I am keeping this self-hosted as we have not got the runner.