Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded

The following kernel:

```cpp
extern "C" __global__ void VkFFT_main(unsigned long long* g, unsigned long long* h) {
  __shared__ unsigned long long c[8192];
  asm volatile(";x: %0" : : "s"((unsigned)blockDim.x));
  unsigned b =
      threadIdx.y * (threadIdx.y + threadIdx.x) * 7 + blockIdx.z * 6384;
  c[0] = g[b];
  h[b] = c[threadIdx.x];
}
```

becomes ~10% slower when the inline assembly for loading `blockDim.x` is **removed**.
This seems to happen only on RDNA2 (tested with V620 and RX6650XT) on Linux.

## Observations
- Does not seem to reproduce on Windows with the RX6900XT
- The amount of shared memory used by the kernel is relevant, decreasing the size of `c` the slowdown
becomes less without loading `blockDim`
- The generated assembly with and without `blockDim` is almost identical, differing only by the load
of the block size from the dispatch packet and the metadata that flags that the dispatch packet is in use.
  - What's more if this load instruction is removed from the faster kernel (without changing the meta-data) it becomes as slow as the other one.

## Motivation
The kernel above was produced using [c-reduce](https://embed.cs.utah.edu/creduce/) on a kernel extracted from [vkFFT](https://github.com/DTolm/VkFFT).
A possible optimization for VkFFT's HIP backend aimed to replace `blockDim` with it's values ahead of time as they are known when compilation happens.
This results in big improvements especially for small problem sizes, except select cases in RDNA2, where it leads to a slowdown as much as ~30%. This issue is the result of investigating the cause of this.

## Environment
<details>

<summary>hipconfig</summary>

```console
❯ hipconfig
HIP version  : 5.3.22061-e8e78f1a

== hipconfig
HIP_PATH     : /opt/rocm-5.3.0
ROCM_PATH    : /opt/rocm-5.3.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.3.0/include -I/opt/rocm-5.3.0/llvm/bin/../lib/clang/15.0.0 -I/opt/rocm-5.3.0/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-5.3.0/hsa
HIP_CLANG_PATH   : /opt/rocm-5.3.0/llvm/bin
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.3.0/llvm/bin
AMD LLVM version 15.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.." -isystem /opt/rocm-5.3.0/hsa/include -isystem "/opt/rocm-5.3.0/include" -O3
hip-clang-ldflags  :  -L"/opt/rocm-5.3.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/home/gergely/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : nostromo
Linux nostromo 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.5 LTS
Release:	20.04
Codename:	focal
```

</details>

<details>
<summary>clang version</summary>

```console
❯ /opt/rocm/llvm/bin/clang++ --version
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.3.0 22362 3cf23f77f8208174a2ee7c616f4be23674d7b081)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
```

</details>

<details>
<summary>rocminfo</summary>

```console
❯ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 7713P 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 7713P 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2000
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            128
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    528082872(0x1f79e7b8) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1030
  Uuid:                    GPU-abcb45dca7663b11
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   33536
  Internal Node ID:        1
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1030
  Uuid:                    GPU-2293a876b6331dff
  Marketing Name:          AMD Radeon PRO V620
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      4096(0x1000) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29601(0x73a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2570
  BDFID:                   34304
  Internal Node ID:        2
  Compute Unit:            72
  SIMDs per CU:            2
  Shader Engines:          8
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***
```

</details>

## Attachments

Archive file [`blockdim-faster-linux-rdna2.tar.gz`](https://github.com/ROCm-Developer-Tools/hipamd/files/9870753/blockdim-faster-linux-rdna2.tar.gz) containing the original VkFFT kernel,
the host code used to do the speed tests (based on the launch params done by VkFFT), annotated assembly from the kernel and the script used for the test case reduction.
The script is useful to verify the slowdown. It compiles and runs the kernel reduced kernel source (`test.hip`) with and without loading `blockDim` up to 3 times and shows the difference in time taken.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Observations

Motivation

Environment

Attachments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slowdown observed on Linux with RDNA2 when blockDim is NOT loaded #53

Description

Observations

Motivation

Environment

Attachments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions