Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script to gather runner info when uploading benchmark results #6425

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Mar 17, 2025

Implement the logic to gather runner info for GPU. I adopt this logic from https://github.com/pytorch/pytorch-integration-testing/blob/master/vllm-benchmarks/upload_benchmark_results.py#L102

This also cleans up v2 logic which is not used anymore.

cc @yangw-dev Please let me know if you have a better approach in mind from the utilization monitoring project. Essentially, I want to get the device name, i.e. CUDA, ROCm, and the device type, i.e. H100, MI300X, so that they can be displayed on the dashboard. Before this change, these fields are set by the caller, now they can be set automatically by the GHA.

@huydhn huydhn requested a review from yangw-dev March 17, 2025 19:41
Copy link

vercel bot commented Mar 17, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Mar 17, 2025 7:49pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 17, 2025
@huydhn
Copy link
Contributor Author

huydhn commented Mar 17, 2025

This depends on #6429

device_type = torch.cuda.get_device_name()

except ImportError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging the error info to help debugging

runner_info["type"] = device_type
runner_info["gpu_count"] = torch.cuda.device_count()
runner_info["avail_gpu_mem_in_gb"] = int(
torch.cuda.get_device_properties(0).total_memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is each device has same memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's the regular setup

huydhn added a commit that referenced this pull request Mar 17, 2025
While working on #6425, I
discover several bugs in the upload scripts:

* If there is an invalid JSON file in the directory, the script returns
instead of continue, skipping all records after. Covered by
https://github.com/pytorch/test-infra/blob/main/.github/scripts/benchmark-results-dir-for-testing/v3/mock.json
* The script didn't handle correctly JSONEachRow format with only one
record. Covered by a new test JSON from
https://github.com/pytorch/test-infra/pull/6425/files#diff-bff954994eb33173b7119ff8d280f3367117b2daa9b8c54888be5f48f183a280
* The script didn't handle correctly JSONEachRow format mix with list of
records. Covered by
https://github.com/pytorch/test-infra/blob/main/.github/scripts/benchmark-results-dir-for-testing/v3/json-each-row.json#L3

### Testing


https://github.com/pytorch/test-infra/actions/runs/13909203687/job/38919334944#step:5:125
looks correct now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants