Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to automatically select Device with most flops/memory #41

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ProjectPhysX
Copy link

Added utility functions to automatically select the fastest Device or the Device with largest memory capacity from all available Devices.

For selecting the fastest Device, the TFLOPs/s performance of the Device is estimated. For Nvidia and AMD GPUs, the estimate is challenging due to the different number of cores per CU depending on the microarchitecture and even GPU model:

  • AMD GCN, CDNA: 64 cores/CU
  • AMD RDNA, RDNA2: 128 cores/CU (dual CUs are reported as CUs in OpenCL)
  • Nvidia Kepler: 192 cores/CU
  • Nvidia Maxwell, Pascal, Ampere: 128 cores/CU
  • Nvidia P100, Volta, Turing, A100, A30: 64 cores/CU

The vast majority of GPUs are captured with the correct estimate, but for some rare/old GPUs, the estimate could be wrong by a factor of 2.
For CPUs without SMT/HT as well as for very old CPUs with IPC<32 or very new CPUs with IPC=64 (AVX-512), the estimate is wrong.

Overall however, the estimated values are good enough to identify the fastest device in systems with one CPU and one or multiple GPUs.

Added utility functions to automatically select the fastest Device or the Device with largest memory capacity from all available Devices.

For selecting the fastest Device, the TFLOPs/s performance of the Device is estimated. For Nvidia and AMD GPUs, the estimate is challenging due to the different number of cores per CU depending on the microarchitecture and even GPU model:
- AMD GCN, CDNA: 64 cores/CU
- AMD RDNA, RDNA2: 128 cores/CU (dual CUs are reported as CUs in OpenCL)
- Nvidia Kepler: 192 cores/CU
- Nvidia Maxwell, Pascal, Ampere: 128 cores/CU
- Nvidia P100, Volta, Turing, A100, A30: 64 cores/CU

The vast majority of GPUs are captured with the correct estimate, but for some rare/old GPUs, the estimate could be wrong by a factor of 2. 
For CPUs without SMT/HT as well as for very old CPUs with IPC<32 or very new CPUs with IPC=64 (AVX-512), the estimate is wrong.

Overall however, the estimated values are good enough to identify the fastest device in systems with one CPU and one or multiple GPUs.
@CLAassistant
Copy link

CLAassistant commented Feb 10, 2022

CLA assistant check
All committers have signed the CLA.

@MathiasMagnus MathiasMagnus force-pushed the main branch 2 times, most recently from 66f643c to fc64822 Compare April 1, 2022 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants