-
Notifications
You must be signed in to change notification settings - Fork 47
Workarounds for the HIP port #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix the radixSort_t test. Remove the launch bounds, which seem to cause the kernel to fail to run (and no error is raised) Add __threadfence after the update of the ibs variable, which controls the enclosing while loop. Without this threadfence, the loop appears to keep running, and the preceeding assert will trigger (c[bin] >= 0).
| template <typename T, int NS = sizeof(T)> | ||
| __global__ void __launch_bounds__(256, 4) | ||
| // The launch bounds seems to cause the kernel to silently fail to run (rocm 4.3) | ||
| //__global__ void __launch_bounds__(256, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm not familiar with HIP, I'm wondering: is the problem that these specific launch bounds do not work on the AMD GPU you tested -- or that launch bounds are not supported by HIP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sorry, found it in the HIP documentation: launch bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the launch parameters for radixSortMultiWrapper when it fails ?
| if (threadIdx.x == 0) | ||
| if (threadIdx.x == 0) { | ||
| ibs -= sb; | ||
| __threadfence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: assuming the goal is to propage the updated value of ibs to the other threads in the block, __threadfence_block() should achieve the same result as __threadfence(); could you check if that is the case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the CUDA documentation for __syncthreads():
void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to__syncthreads()are visible to all threads in the block.
So the __threadfence() should not be needed.
Is there any documentation of the __syncthreads() semantic for HIP ?
The only mention I found in the HIP Programming Guid is just:
The
__syncthreads()built-in function is supported in HIP.
Do you have any contacts with AMD to whom you could ask for clarifications ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the GCN ISA level, sync on thread execution and sync on memory consistency are two separate instructions
- S_BARRIER - Synchronize waves within a threadgroup.
- S_WAITCNT - Wait for memory to complete. Vector memory (
vmcnt) and global/local/contant/message (lgkmcnt) counts are given separately.
The __syncthreads and __threadfence functions are defined in /opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h
static void __threadfence()
{
__atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_device);
}
static void __threadfence_block()
{
__atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_work_group);
}
#define __CLK_LOCAL_MEM_FENCE 0x01
void __syncthreads()
{
__work_group_barrier((__cl_mem_fence_flags)__CLK_LOCAL_MEM_FENCE, __memory_scope_work_group);
}
static void __work_group_barrier(__cl_mem_fence_flags flags, __memory_scope scope)
{
if (flags) {
__atomic_work_item_fence(flags, __memory_order_release, scope);
__builtin_amdgcn_s_barrier(); // Produces s_barrier
__atomic_work_item_fence(flags, __memory_order_acquire, scope);
} else {
__builtin_amdgcn_s_barrier();
}
}
(I did some manual inlining on the __synchthreads definition to make it more compact)
And __atomic_work_item_fence is an OpenCL function, https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/atomic_work_item_fence.html
__syncthreads compiles to
s_waitcnt vmcnt(0) lgkmcnt(0)
s_barrier
s_waitcnt lgkmcnt(0)
__threadfence() compiles to
s_waitcnt vmcnt(0) lgkmcnt(0)
__threadfence_block() compiles to
s_waitcnt lgkmcnt(0)
This is all background technical info to try understand what's going on. I agree it that it looks like __syncthreads should be sufficient to create the barrier.
I think I need to create a reproducer simplifying the faulty while loop. Just the while loop without any memory accesses in it works fine.
|
@markdewing @makortel the HIP documentation has a section about
In the CUDA code we have template <typename T, int NS = sizeof(T)>
__global__ void __launch_bounds__(256, 4)
radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
radixSortMulti<T, NS>(v, index, offsets, workspace);
}Could you try if making the change suggested in the documentation, that is changing the second parameter to template <typename T, int NS = sizeof(T)>
// The second parameter to __launch_bounds__ has a different meaning for CUDA and for HIP
__global__ void __launch_bounds__(256, 32)
radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
radixSortMulti<T, NS>(v, index, offsets, workspace);
} |
|
Further investigation of the launch_bounds shows non-local effects. (i.e. the test passing or failing can be affected by the presence of a function that is never called at runtime) There are two top-level templated kernels:
Whether the test passes or fails (get "not ordered at" errors) depends on whether The assembly for the code actually being tested ( Header row "is
So apparently, the existence of multiple copies of a template has some effect on the code analysis in the compiler. |
|
Some more launch bounds investigation.
Expanding on 2. Tracking the identifiable hex values, the code in the templates that translates the The device compiler doesn't support functions, so everything gets inlined. Still unknown why the template expansion and inlining excludes some code. |
Fix the radixSort_t test.
Remove the launch bounds, which seem to cause the kernel to fail to run (and no error is raised)
Add
__threadfenceafter the update of the ibs variable, which controls the enclosing while loop. Without this threadfence, the loop appears to keep running, and the preceding assert will trigger (c[bin] >= 0).This PR replaces #195