-
Notifications
You must be signed in to change notification settings - Fork 47
Workarounds for the HIP port #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,8 +177,10 @@ __device__ __forceinline__ void radixSortImpl( | |
| __syncthreads(); | ||
| if (bin >= 0) | ||
| assert(c[bin] >= 0); | ||
| if (threadIdx.x == 0) | ||
| if (threadIdx.x == 0) { | ||
| ibs -= sb; | ||
| __threadfence(); | ||
| } | ||
| __syncthreads(); | ||
| } | ||
|
|
||
|
|
@@ -260,7 +262,9 @@ namespace cms { | |
| namespace hip { | ||
|
|
||
| template <typename T, int NS = sizeof(T)> | ||
| __global__ void __launch_bounds__(256, 4) | ||
| // The launch bounds seems to cause the kernel to silently fail to run (rocm 4.3) | ||
| //__global__ void __launch_bounds__(256, 4) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I'm not familiar with HIP, I'm wondering: is the problem that these specific launch bounds do not work on the AMD GPU you tested -- or that launch bounds are not supported by HIP ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, sorry, found it in the HIP documentation: launch bounds.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the launch parameters for |
||
| __global__ void | ||
| radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) { | ||
| radixSortMulti<T, NS>(v, index, offsets, workspace); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: assuming the goal is to propage the updated value of
ibsto the other threads in the block,__threadfence_block()should achieve the same result as__threadfence(); could you check if that is the case ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the CUDA documentation for
__syncthreads():So the
__threadfence()should not be needed.Is there any documentation of the
__syncthreads()semantic for HIP ?The only mention I found in the HIP Programming Guid is just:
Do you have any contacts with AMD to whom you could ask for clarifications ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the GCN ISA level, sync on thread execution and sync on memory consistency are two separate instructions
vmcnt) and global/local/contant/message (lgkmcnt) counts are given separately.The __syncthreads and __threadfence functions are defined in
/opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h(I did some manual inlining on the
__synchthreadsdefinition to make it more compact)And
__atomic_work_item_fenceis an OpenCL function, https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/atomic_work_item_fence.html__syncthreadscompiles to__threadfence()compiles to__threadfence_block()compiles toThis is all background technical info to try understand what's going on. I agree it that it looks like
__syncthreadsshould be sufficient to create the barrier.I think I need to create a reproducer simplifying the faulty while loop. Just the while loop without any memory accesses in it works fine.