-
Notifications
You must be signed in to change notification settings - Fork 38
Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53
Description
The following kernel:
extern "C" __global__ void VkFFT_main(unsigned long long* g, unsigned long long* h) {
__shared__ unsigned long long c[8192];
asm volatile(";x: %0" : : "s"((unsigned)blockDim.x));
unsigned b =
threadIdx.y * (threadIdx.y + threadIdx.x) * 7 + blockIdx.z * 6384;
c[0] = g[b];
h[b] = c[threadIdx.x];
}becomes ~10% slower when the inline assembly for loading blockDim.x is removed.
This seems to happen only on RDNA2 (tested with V620 and RX6650XT) on Linux.
Observations
- Does not seem to reproduce on Windows with the RX6900XT
- The amount of shared memory used by the kernel is relevant, decreasing the size of
cthe slowdown
becomes less without loadingblockDim - The generated assembly with and without
blockDimis almost identical, differing only by the load
of the block size from the dispatch packet and the metadata that flags that the dispatch packet is in use.- What's more if this load instruction is removed from the faster kernel (without changing the meta-data) it becomes as slow as the other one.
Motivation
The kernel above was produced using c-reduce on a kernel extracted from vkFFT.
A possible optimization for VkFFT's HIP backend aimed to replace blockDim with it's values ahead of time as they are known when compilation happens.
This results in big improvements especially for small problem sizes, except select cases in RDNA2, where it leads to a slowdown as much as ~30%. This issue is the result of investigating the cause of this.
Environment
hipconfig
❯ hipconfig
HIP version : 5.3.22061-e8e78f1a
== hipconfig
HIP_PATH : /opt/rocm-5.3.0
ROCM_PATH : /opt/rocm-5.3.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.3.0/include -I/opt/rocm-5.3.0/llvm/bin/../lib/clang/15.0.0 -I/opt/rocm-5.3.0/hsa/include
== hip-clang
HSA_PATH : /opt/rocm-5.3.0/hsa
HIP_CLANG_PATH : /opt/rocm-5.3.0/llvm/bin
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.3.0/llvm/bin
AMD LLVM version 15.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver3
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -std=c++11 -isystem "/opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.." -isystem /opt/rocm-5.3.0/hsa/include -isystem "/opt/rocm-5.3.0/include" -O3
hip-clang-ldflags : -L"/opt/rocm-5.3.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt
=== Environment Variables
PATH=/home/gergely/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
== Linux Kernel
Hostname : nostromo
Linux nostromo 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focalclang version
❯ /opt/rocm/llvm/bin/clang++ --version
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/binrocminfo
❯ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7713P 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7713P 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 0
Compute Unit: 128
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 528082872(0x1f79e7b8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 528082872(0x1f79e7b8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 528082872(0x1f79e7b8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1030
Uuid: GPU-abcb45dca7663b11
Marketing Name: AMD Radeon PRO V620
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 4096(0x1000) KB
L3: 131072(0x20000) KB
Chip ID: 29601(0x73a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2570
BDFID: 33536
Internal Node ID: 1
Compute Unit: 72
SIMDs per CU: 2
Shader Engines: 8
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 31440896(0x1dfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: gfx1030
Uuid: GPU-2293a876b6331dff
Marketing Name: AMD Radeon PRO V620
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 4096(0x1000) KB
L3: 131072(0x20000) KB
Chip ID: 29601(0x73a1)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2570
BDFID: 34304
Internal Node ID: 2
Compute Unit: 72
SIMDs per CU: 2
Shader Engines: 8
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 31440896(0x1dfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1030
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***Attachments
Archive file blockdim-faster-linux-rdna2.tar.gz containing the original VkFFT kernel,
the host code used to do the speed tests (based on the launch params done by VkFFT), annotated assembly from the kernel and the script used for the test case reduction.
The script is useful to verify the slowdown. It compiles and runs the kernel reduced kernel source (test.hip) with and without loading blockDim up to 3 times and shows the difference in time taken.