Releases: ProjectPhysX/OpenCL-Benchmark
Releases · ProjectPhysX/OpenCL-Benchmark
OpenCL-Benchmark v1.8
- INT8 benchmark will now measure
dp4a
throughput on all supported AMD/Intel/Nvidia GPUs - fixed compiling on macOS with new OpenCL headers
- updated OpenCL-Wrapper
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA H100 80GB HBM3 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 565.57.01 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s) |
| Memory, Cache | 81105 MB VRAM, 4224 KB global / 48 KB local |
| Buffer Limits | 20276 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 31.184 TFLOPs/s (1/2 ) |
| FP32 compute 62.908 TFLOPs/s ( 1x ) |
| FP16 compute 123.749 TFLOPs/s ( 2x ) |
| INT64 compute 3.227 TIOPs/s (1/24) |
| INT32 compute 32.946 TIOPs/s (1/2 ) |
| INT16 compute 30.901 TIOPs/s (1/2 ) |
-| INT8 compute 30.582 TIOPs/s (1/2 ) |
+| INT8 compute 103.204 TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read ) 3025.53 GB/s |
| Memory Bandwidth ( coalesced write) 3055.98 GB/s |
| Memory Bandwidth (misaligned read ) 2102.44 GB/s |
| Memory Bandwidth (misaligned write) 314.25 GB/s |
| PCIe Bandwidth (send ) 10.53 GB/s |
| PCIe Bandwidth ( receive ) 11.47 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 10.91 GB/s |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | AMD Instinct MI300X |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3635.0 (HSA1.1,LC) (Linux) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 304 at 2100 MHz (19456 cores, 81.715 TFLOPs/s) |
| Memory, Cache | 196592 MB VRAM, 32 KB global / 64 KB local |
| Buffer Limits | 196592 MB global, 201310208 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 54.944 TFLOPs/s (2/3 ) |
| FP32 compute 130.000 TFLOPs/s ( 2x ) |
| FP16 compute 141.320 TFLOPs/s ( 2x ) |
| INT64 compute 3.666 TIOPs/s (1/24) |
| INT32 compute 47.736 TIOPs/s (2/3 ) |
| INT16 compute 69.022 TIOPs/s ( 1x ) |
-| INT8 compute 43.582 TIOPs/s (1/2 ) |
+| INT8 compute 106.178 TIOPs/s ( 1x ) |
| Memory Bandwidth ( coalesced read ) 3756.64 GB/s |
| Memory Bandwidth ( coalesced write) 4686.31 GB/s |
| Memory Bandwidth (misaligned read ) 3881.24 GB/s |
| Memory Bandwidth (misaligned write) 2491.25 GB/s |
| PCIe Bandwidth (send ) 54.57 GB/s |
| PCIe Bandwidth ( receive ) 55.79 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 55.21 GB/s |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Intel(R) Arc(TM) B580 Graphics |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 32.0.101.6559 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s) |
| Memory, Cache | 12187 MB VRAM, 18432 KB global / 128 KB local |
| Buffer Limits | 11944 MB global, 12230900 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.896 TFLOPs/s (1/16) |
| FP32 compute 14.249 TFLOPs/s ( 1x ) |
| FP16 compute 26.547 TFLOPs/s ( 2x ) |
| INT64 compute 0.636 TIOPs/s (1/24) |
| INT32 compute 4.556 TIOPs/s (1/3 ) |
| INT16 compute 37.082 TIOPs/s ( 2x ) |
-| INT8 compute 24.424 TIOPs/s ( 2x ) |
+| INT8 compute 48.668 TIOPs/s ( 4x ) |
| Memory Bandwidth ( coalesced read ) 574.09 GB/s |
| Memory Bandwidth ( coalesced write) 468.07 GB/s |
| Memory Bandwidth (misaligned read ) 796.23 GB/s |
| Memory Bandwidth (misaligned write) 383.15 GB/s |
| PCIe Bandwidth (send ) 4.99 GB/s |
| PCIe Bandwidth ( receive ) 4.87 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen3 x16) 5.11 GB/s |
|-----------------------------------------------------------------------------|
OpenCL-Benchmark v1.7
- faster
enqueueReadBuffer()
on modern CPUs with 64-Byte-alignedhost_buffer
- updated OpenCL headers
- better OpenCL device specs detection using vendor ID and Nvidia compute capability
- better VRAM capacity reporting correction for Intel dGPUs
- fixed wrong device name reporting for AMD GPUs (unlike every sane GPU vendor they don't report device name as
CL_DEVICE_NAME
but needCL_DEVICE_BOARD_NAME_AMD
extension instead) - fixed TFlops estimate for Intel Battlemage GPUs
|----------------.------------------------------------------------------------|
| Device ID | 1 |
-| Device Name | gfx90a:sramecc+:xnack- |
+| Device Name | AMD Instinct MI210 |
| Device Vendor | Advanced Micro Devices, Inc. |
| Device Driver | 3625.0 (HSA1.1,LC) |
| OpenCL Version | OpenCL C 2.0 |
| Compute Units | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s) |
| Memory, Cache | 65520 MB, 16 KB global / 64 KB local |
| Buffer Limits | 65520 MB global, 67092480 KB constant |
|----------------'------------------------------------------------------------|
OpenCL-Benchmark v1.6
- automatically use zero-copy buffers on CPUs/iGPUs to reduce memory footprint
- bandwidth kernels now write non-zero data, to avoid hardware optimizations for zero-initialized buffers
OpenCL-Benchmark v1.5
- enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
- removed
wait()
call at the end of the benchmark on Linux
|----------------.------------------------------------------------------------|
| Device ID | 9 |
| Device Name | NVIDIA GeForce RTX 2080 Ti |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 525.89.02 (Linux) |
| OpenCL Version | OpenCL C 1.2 |
| Compute Units | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s) |
| Memory, Cache | 11011 MB, 2176 KB global / 48 KB local |
| Buffer Limits | 2752 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 0.517 TFLOPs/s (1/24) |
| FP32 compute 16.597 TFLOPs/s ( 1x ) |
-| FP16 compute not supported |
+| FP16 compute 33.054 TFLOPs/s ( 2x ) |
| INT64 compute 3.563 TIOPs/s (1/4 ) |
| INT32 compute 16.385 TIOPs/s ( 1x ) |
| INT16 compute 13.286 TIOPs/s ( 1x ) |
| INT8 compute 10.502 TIOPs/s (2/3 ) |
| Memory Bandwidth ( coalesced read ) 532.76 GB/s |
| Memory Bandwidth ( coalesced write) 548.88 GB/s |
| Memory Bandwidth (misaligned read ) 534.43 GB/s |
| Memory Bandwidth (misaligned write) 157.78 GB/s |
| PCIe Bandwidth (send ) 12.86 GB/s |
| PCIe Bandwidth ( receive ) 12.99 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 6.30 GB/s |
|-----------------------------------------------------------------------------|
OpenCL-Benchmark v1.4
- updated OpenCL-Wrapper
- GPU Driver and OpenCL Runtime installation instructions will be printed to console if no OpenCL devices are available
OpenCL-Benchmark v1.3
- workaround for Nvidia driver bug:
enqueueFillBuffer
is broken for large buffers on Nvidia GPUs - fixed slow numeric drift issues
- fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (
fma
) witha*b+c
- added automatic OS detection in
make.sh
OpenCL-Benchmark v1.2
- corrected TFlops/s estimate for Intel Data Center GPU Max series
- made correction of wrong memory reporting on Intel Arc more robust
- made CPU/GPU buffer initialization significantly faster with
std::fill
andenqueueFillBuffer
- added operating system info to OpenCL device driver version printout
- bug fix in
print_message()
function inutilities.hpp
OpenCL-Benchmark v1.1
- fixed several issues with macOS
OpenCL-Benchmark v1.0
Initial Release. Have fun!