Workarounds for the HIP port #210

mdewing · 2021-08-20T03:58:42Z

Fix the radixSort_t test.

Remove the launch bounds, which seem to cause the kernel to fail to run (and no error is raised)

Add __threadfence after the update of the ibs variable, which controls the enclosing while loop. Without this threadfence, the loop appears to keep running, and the preceding assert will trigger (c[bin] >= 0).

This PR replaces #195

Fix the radixSort_t test. Remove the launch bounds, which seem to cause the kernel to fail to run (and no error is raised) Add __threadfence after the update of the ibs variable, which controls the enclosing while loop. Without this threadfence, the loop appears to keep running, and the preceeding assert will trigger (c[bin] >= 0).

fwyzard · 2021-08-22T20:06:15Z

src/hip/CUDACore/radixSort.h

    template <typename T, int NS = sizeof(T)>
-    __global__ void __launch_bounds__(256, 4)
+    // The launch bounds seems to cause the kernel to silently fail to run (rocm 4.3)
+    //__global__ void __launch_bounds__(256, 4)


As I'm not familiar with HIP, I'm wondering: is the problem that these specific launch bounds do not work on the AMD GPU you tested -- or that launch bounds are not supported by HIP ?

OK, sorry, found it in the HIP documentation: launch bounds.

What are the launch parameters for radixSortMultiWrapper when it fails ?

fwyzard · 2021-08-22T20:27:38Z

src/hip/CUDACore/radixSort.h

-      if (threadIdx.x == 0)
+      if (threadIdx.x == 0) {
        ibs -= sb;
+        __threadfence();


Question: assuming the goal is to propage the updated value of ibs to the other threads in the block, __threadfence_block() should achieve the same result as __threadfence(); could you check if that is the case ?

According to the CUDA documentation for __syncthreads():

void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

So the __threadfence() should not be needed.

Is there any documentation of the __syncthreads() semantic for HIP ?
The only mention I found in the HIP Programming Guid is just:

The __syncthreads() built-in function is supported in HIP.

Do you have any contacts with AMD to whom you could ask for clarifications ?

At the GCN ISA level, sync on thread execution and sync on memory consistency are two separate instructions

S_BARRIER - Synchronize waves within a threadgroup.

S_WAITCNT - Wait for memory to complete. Vector memory (vmcnt) and global/local/contant/message (lgkmcnt) counts are given separately.

The __syncthreads and __threadfence functions are defined in /opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h

static void __threadfence() { __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_device); }

static void __threadfence_block() { __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_work_group); }

#define __CLK_LOCAL_MEM_FENCE 0x01 void __syncthreads() { __work_group_barrier((__cl_mem_fence_flags)__CLK_LOCAL_MEM_FENCE, __memory_scope_work_group); } static void __work_group_barrier(__cl_mem_fence_flags flags, __memory_scope scope) { if (flags) { __atomic_work_item_fence(flags, __memory_order_release, scope); __builtin_amdgcn_s_barrier(); // Produces s_barrier __atomic_work_item_fence(flags, __memory_order_acquire, scope); } else { __builtin_amdgcn_s_barrier(); } }

(I did some manual inlining on the __synchthreads definition to make it more compact)

And __atomic_work_item_fence is an OpenCL function, https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/atomic_work_item_fence.html

__syncthreads compiles to

s_waitcnt vmcnt(0) lgkmcnt(0) s_barrier s_waitcnt lgkmcnt(0)

__threadfence() compiles to

s_waitcnt vmcnt(0) lgkmcnt(0)

__threadfence_block() compiles to

s_waitcnt lgkmcnt(0)

This is all background technical info to try understand what's going on. I agree it that it looks like __syncthreads should be sufficient to create the barrier.

I think I need to create a reproducer simplifying the faulty while loop. Just the while loop without any memory accesses in it works fine.

fwyzard · 2021-08-28T07:08:57Z

@markdewing @makortel the HIP documentation has a section about __launch_bounds__, and specifically the different meaning that the two parameters have in CUDA and HIP:

Porting from CUDA __launch_bounds

CUDA defines a __launch_bounds which is also designed to control occupancy:
__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors ( This conversion is performed automatically by the clang hipify tools.)
MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK)/32

In the CUDA code we have

    template <typename T, int NS = sizeof(T)>
    __global__ void __launch_bounds__(256, 4)
        radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
      radixSortMulti<T, NS>(v, index, offsets, workspace);
    }

Could you try if making the change suggested in the documentation, that is changing the second parameter to 4 * 256 / 32 = 32 fixes the HIP test ?

    template <typename T, int NS = sizeof(T)>
    // The second parameter to __launch_bounds__ has a different meaning for CUDA and for HIP
    __global__ void __launch_bounds__(256, 32)
        radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
      radixSortMulti<T, NS>(v, index, offsets, workspace);
    }

mdewing · 2021-09-03T21:12:14Z

Further investigation of the launch_bounds shows non-local effects. (i.e. the test passing or failing can be affected by the presence of a function that is never called at runtime)

There are two top-level templated kernels:

radixSortMultiWrapper (this has the launch bounds, uses shared memory for workspace)
radixSortMultiWrapper2 (doesn't have launch bounds, uses global memory for workspace)

Whether the test passes or fails (get "not ordered at" errors) depends on whether radixSortMultiWrapper2 is compiled in or not.
First I set up the radixSort_t.cu test so that the first set of calls to go that use global memory are commented out (that is, radixSortMultiWrapper2 is not called during the test execution.)
The difference between the two cases occurs where the kernels are launched. Since radixSortMultiWrapper2 is not called, commenting out the call should make no different to the execution of the program.
However it does have an effect on whether the remaining tests pass or fail.
Once the launch of radixSortMultiWrapper2 is removed, the compiler can know that that kernel will not be called, and so all the templates related to that call can be removed. (This can verified looking at the .s output after using -save-temps=cwd to keep the temporary files around)

The assembly for the code actually being tested (radixSortMultiWrapper) is different between the two cases (or at least the kernel info summary is different)

Header row "is radixSortMultiWrapper2 present?":

	no	yes
test result	test passes	test fails
NumSgprs	80	84
SGPRBlocks	9	10

So apparently, the existence of multiple copies of a template has some effect on the code analysis in the compiler.

mdewing · 2021-09-21T20:48:58Z

Some more launch bounds investigation.
Main points

If the same launch_bounds is added to radixSortMultiWrapper2 the compile works. If the launch bounds values differ, it fails.
In the failing cases, the radixSortImpl code is simply not getting compiled into the function

Expanding on 2.
I added a result pointer to try tracking progress through execution. That is, int * result was added to the kernel argument list, and it is set at various points in the code (if (result) *result = 2;). In places, the result is set to uniquely identifiable hex values (0xdeadbeef) so these should be easily located in the assembly code.

Tracking the identifiable hex values, the code in the templates that translates the radixSort call to radixSortImpl is always compiled in. Depending on the launch bounds, the body of radixSortImpl does not get included in the failing cases.

The device compiler doesn't support functions, so everything gets inlined. Still unknown why the template expansion and inlining excludes some code.

makortel added the hip label Aug 20, 2021

makortel merged commit 0b625ee into cms-patatrack:master Aug 20, 2021

fwyzard reviewed Aug 22, 2021

View reviewed changes

fwyzard mentioned this pull request Mar 19, 2022

[alpaka] Support CUDA or ROCm/HIP #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workarounds for the HIP port #210

Workarounds for the HIP port #210

Uh oh!

mdewing commented Aug 20, 2021

Uh oh!

fwyzard Aug 22, 2021

Uh oh!

fwyzard Aug 22, 2021 •

edited

Loading

Uh oh!

fwyzard Aug 22, 2021

Uh oh!

fwyzard Aug 22, 2021

Uh oh!

fwyzard Aug 22, 2021

Uh oh!

mdewing Aug 28, 2021

Uh oh!

fwyzard commented Aug 28, 2021

Porting from CUDA __launch_bounds

Uh oh!

mdewing commented Sep 3, 2021

Uh oh!

mdewing commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Workarounds for the HIP port #210

Workarounds for the HIP port #210

Uh oh!

Conversation

mdewing commented Aug 20, 2021

Uh oh!

fwyzard Aug 22, 2021

Choose a reason for hiding this comment

Uh oh!

fwyzard Aug 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fwyzard Aug 22, 2021

Choose a reason for hiding this comment

Uh oh!

fwyzard Aug 22, 2021

Choose a reason for hiding this comment

Uh oh!

fwyzard Aug 22, 2021

Choose a reason for hiding this comment

Uh oh!

mdewing Aug 28, 2021

Choose a reason for hiding this comment

Uh oh!

fwyzard commented Aug 28, 2021

Porting from CUDA __launch_bounds

Uh oh!

mdewing commented Sep 3, 2021

Uh oh!

mdewing commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fwyzard Aug 22, 2021 •

edited

Loading