Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions src/hip/CUDACore/radixSort.h
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,10 @@ __device__ __forceinline__ void radixSortImpl(
__syncthreads();
if (bin >= 0)
assert(c[bin] >= 0);
if (threadIdx.x == 0)
if (threadIdx.x == 0) {
ibs -= sb;
__threadfence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: assuming the goal is to propage the updated value of ibs to the other threads in the block, __threadfence_block() should achieve the same result as __threadfence(); could you check if that is the case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the CUDA documentation for __syncthreads():

void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

So the __threadfence() should not be needed.

Is there any documentation of the __syncthreads() semantic for HIP ?
The only mention I found in the HIP Programming Guid is just:

The __syncthreads() built-in function is supported in HIP.

Do you have any contacts with AMD to whom you could ask for clarifications ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the GCN ISA level, sync on thread execution and sync on memory consistency are two separate instructions

  • S_BARRIER - Synchronize waves within a threadgroup.
  • S_WAITCNT - Wait for memory to complete. Vector memory (vmcnt) and global/local/contant/message (lgkmcnt) counts are given separately.

The __syncthreads and __threadfence functions are defined in /opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h

static void __threadfence()
{
  __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_device);
}
static void __threadfence_block()
{
  __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_work_group);
}
#define __CLK_LOCAL_MEM_FENCE    0x01
void __syncthreads()
{
   __work_group_barrier((__cl_mem_fence_flags)__CLK_LOCAL_MEM_FENCE, __memory_scope_work_group);
}

static void __work_group_barrier(__cl_mem_fence_flags flags, __memory_scope scope)
{
    if (flags) {
        __atomic_work_item_fence(flags, __memory_order_release, scope);
        __builtin_amdgcn_s_barrier();  // Produces s_barrier
        __atomic_work_item_fence(flags, __memory_order_acquire, scope);
    } else {
        __builtin_amdgcn_s_barrier();
    }
}

(I did some manual inlining on the __synchthreads definition to make it more compact)

And __atomic_work_item_fence is an OpenCL function, https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/atomic_work_item_fence.html

__syncthreads compiles to

    s_waitcnt vmcnt(0) lgkmcnt(0)
    s_barrier
    s_waitcnt lgkmcnt(0)

__threadfence() compiles to

s_waitcnt vmcnt(0) lgkmcnt(0)

__threadfence_block() compiles to

s_waitcnt lgkmcnt(0)

This is all background technical info to try understand what's going on. I agree it that it looks like __syncthreads should be sufficient to create the barrier.

I think I need to create a reproducer simplifying the faulty while loop. Just the while loop without any memory accesses in it works fine.

}
__syncthreads();
}

Expand Down Expand Up @@ -260,7 +262,9 @@ namespace cms {
namespace hip {

template <typename T, int NS = sizeof(T)>
__global__ void __launch_bounds__(256, 4)
// The launch bounds seems to cause the kernel to silently fail to run (rocm 4.3)
//__global__ void __launch_bounds__(256, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm not familiar with HIP, I'm wondering: is the problem that these specific launch bounds do not work on the AMD GPU you tested -- or that launch bounds are not supported by HIP ?

Copy link
Contributor

@fwyzard fwyzard Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sorry, found it in the HIP documentation: launch bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the launch parameters for radixSortMultiWrapper when it fails ?

__global__ void
radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
radixSortMulti<T, NS>(v, index, offsets, workspace);
}
Expand Down