-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a script to gather runner info when uploading benchmark results #6425
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
This depends on #6429 |
device_type = torch.cuda.get_device_name() | ||
|
||
except ImportError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging the error info to help debugging
runner_info["type"] = device_type | ||
runner_info["gpu_count"] = torch.cuda.device_count() | ||
runner_info["avail_gpu_mem_in_gb"] = int( | ||
torch.cuda.get_device_properties(0).total_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is each device has same memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's the regular setup
While working on #6425, I discover several bugs in the upload scripts: * If there is an invalid JSON file in the directory, the script returns instead of continue, skipping all records after. Covered by https://github.com/pytorch/test-infra/blob/main/.github/scripts/benchmark-results-dir-for-testing/v3/mock.json * The script didn't handle correctly JSONEachRow format with only one record. Covered by a new test JSON from https://github.com/pytorch/test-infra/pull/6425/files#diff-bff954994eb33173b7119ff8d280f3367117b2daa9b8c54888be5f48f183a280 * The script didn't handle correctly JSONEachRow format mix with list of records. Covered by https://github.com/pytorch/test-infra/blob/main/.github/scripts/benchmark-results-dir-for-testing/v3/json-each-row.json#L3 ### Testing https://github.com/pytorch/test-infra/actions/runs/13909203687/job/38919334944#step:5:125 looks correct now
Implement the logic to gather runner info for GPU. I adopt this logic from https://github.com/pytorch/pytorch-integration-testing/blob/master/vllm-benchmarks/upload_benchmark_results.py#L102
This also cleans up v2 logic which is not used anymore.
cc @yangw-dev Please let me know if you have a better approach in mind from the utilization monitoring project. Essentially, I want to get the device name, i.e. CUDA, ROCm, and the device type, i.e. H100, MI300X, so that they can be displayed on the dashboard. Before this change, these fields are set by the caller, now they can be set automatically by the GHA.