Skip to content

Conversation

@mdewing
Copy link
Contributor

@mdewing mdewing commented Aug 20, 2021

Fix the radixSort_t test.

Remove the launch bounds, which seem to cause the kernel to fail to run (and no error is raised)

Add __threadfence after the update of the ibs variable, which controls the enclosing while loop. Without this threadfence, the loop appears to keep running, and the preceding assert will trigger (c[bin] >= 0).

This PR replaces #195

Fix the radixSort_t test.

Remove the launch bounds, which seem to cause the kernel to fail to run
(and no error is raised)

Add  __threadfence after the update of the ibs variable, which controls
the enclosing while loop.  Without this threadfence, the loop appears to
keep running, and the preceeding assert will trigger (c[bin] >= 0).
@makortel makortel added the hip label Aug 20, 2021
@makortel makortel merged commit 0b625ee into cms-patatrack:master Aug 20, 2021
template <typename T, int NS = sizeof(T)>
__global__ void __launch_bounds__(256, 4)
// The launch bounds seems to cause the kernel to silently fail to run (rocm 4.3)
//__global__ void __launch_bounds__(256, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm not familiar with HIP, I'm wondering: is the problem that these specific launch bounds do not work on the AMD GPU you tested -- or that launch bounds are not supported by HIP ?

Copy link
Contributor

@fwyzard fwyzard Aug 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sorry, found it in the HIP documentation: launch bounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the launch parameters for radixSortMultiWrapper when it fails ?

if (threadIdx.x == 0)
if (threadIdx.x == 0) {
ibs -= sb;
__threadfence();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: assuming the goal is to propage the updated value of ibs to the other threads in the block, __threadfence_block() should achieve the same result as __threadfence(); could you check if that is the case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the CUDA documentation for __syncthreads():

void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.

So the __threadfence() should not be needed.

Is there any documentation of the __syncthreads() semantic for HIP ?
The only mention I found in the HIP Programming Guid is just:

The __syncthreads() built-in function is supported in HIP.

Do you have any contacts with AMD to whom you could ask for clarifications ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the GCN ISA level, sync on thread execution and sync on memory consistency are two separate instructions

  • S_BARRIER - Synchronize waves within a threadgroup.
  • S_WAITCNT - Wait for memory to complete. Vector memory (vmcnt) and global/local/contant/message (lgkmcnt) counts are given separately.

The __syncthreads and __threadfence functions are defined in /opt/rocm-4.3.0/hip/include/hip/amd_detail/device_functions.h

static void __threadfence()
{
  __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_device);
}
static void __threadfence_block()
{
  __atomic_work_item_fence(0, __memory_order_seq_cst, __memory_scope_work_group);
}
#define __CLK_LOCAL_MEM_FENCE    0x01
void __syncthreads()
{
   __work_group_barrier((__cl_mem_fence_flags)__CLK_LOCAL_MEM_FENCE, __memory_scope_work_group);
}

static void __work_group_barrier(__cl_mem_fence_flags flags, __memory_scope scope)
{
    if (flags) {
        __atomic_work_item_fence(flags, __memory_order_release, scope);
        __builtin_amdgcn_s_barrier();  // Produces s_barrier
        __atomic_work_item_fence(flags, __memory_order_acquire, scope);
    } else {
        __builtin_amdgcn_s_barrier();
    }
}

(I did some manual inlining on the __synchthreads definition to make it more compact)

And __atomic_work_item_fence is an OpenCL function, https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/atomic_work_item_fence.html

__syncthreads compiles to

    s_waitcnt vmcnt(0) lgkmcnt(0)
    s_barrier
    s_waitcnt lgkmcnt(0)

__threadfence() compiles to

s_waitcnt vmcnt(0) lgkmcnt(0)

__threadfence_block() compiles to

s_waitcnt lgkmcnt(0)

This is all background technical info to try understand what's going on. I agree it that it looks like __syncthreads should be sufficient to create the barrier.

I think I need to create a reproducer simplifying the faulty while loop. Just the while loop without any memory accesses in it works fine.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 28, 2021

@markdewing @makortel the HIP documentation has a section about __launch_bounds__, and specifically the different meaning that the two parameters have in CUDA and HIP:

Porting from CUDA __launch_bounds

CUDA defines a __launch_bounds which is also designed to control occupancy:

__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)

The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors ( This conversion is performed automatically by the clang hipify tools.)

MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK)/32

In the CUDA code we have

    template <typename T, int NS = sizeof(T)>
    __global__ void __launch_bounds__(256, 4)
        radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
      radixSortMulti<T, NS>(v, index, offsets, workspace);
    }

Could you try if making the change suggested in the documentation, that is changing the second parameter to 4 * 256 / 32 = 32 fixes the HIP test ?

    template <typename T, int NS = sizeof(T)>
    // The second parameter to __launch_bounds__ has a different meaning for CUDA and for HIP
    __global__ void __launch_bounds__(256, 32)
        radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
      radixSortMulti<T, NS>(v, index, offsets, workspace);
    }

@mdewing
Copy link
Contributor Author

mdewing commented Sep 3, 2021

Further investigation of the launch_bounds shows non-local effects. (i.e. the test passing or failing can be affected by the presence of a function that is never called at runtime)

There are two top-level templated kernels:

  • radixSortMultiWrapper (this has the launch bounds, uses shared memory for workspace)
  • radixSortMultiWrapper2 (doesn't have launch bounds, uses global memory for workspace)

Whether the test passes or fails (get "not ordered at" errors) depends on whether radixSortMultiWrapper2 is compiled in or not.
First I set up the radixSort_t.cu test so that the first set of calls to go that use global memory are commented out (that is, radixSortMultiWrapper2 is not called during the test execution.)
The difference between the two cases occurs where the kernels are launched. Since radixSortMultiWrapper2 is not called, commenting out the call should make no different to the execution of the program.
However it does have an effect on whether the remaining tests pass or fail.
Once the launch of radixSortMultiWrapper2 is removed, the compiler can know that that kernel will not be called, and so all the templates related to that call can be removed. (This can verified looking at the .s output after using -save-temps=cwd to keep the temporary files around)

The assembly for the code actually being tested (radixSortMultiWrapper) is different between the two cases (or at least the kernel info summary is different)

Header row "is radixSortMultiWrapper2 present?":

no yes
test result test passes test fails
NumSgprs 80 84
SGPRBlocks 9 10

So apparently, the existence of multiple copies of a template has some effect on the code analysis in the compiler.

@mdewing
Copy link
Contributor Author

mdewing commented Sep 21, 2021

Some more launch bounds investigation.
Main points

  1. If the same launch_bounds is added to radixSortMultiWrapper2 the compile works. If the launch bounds values differ, it fails.
  2. In the failing cases, the radixSortImpl code is simply not getting compiled into the function

Expanding on 2.
I added a result pointer to try tracking progress through execution. That is, int * result was added to the kernel argument list, and it is set at various points in the code (if (result) *result = 2;). In places, the result is set to uniquely identifiable hex values (0xdeadbeef) so these should be easily located in the assembly code.

Tracking the identifiable hex values, the code in the templates that translates the radixSort call to radixSortImpl is always compiled in. Depending on the launch bounds, the body of radixSortImpl does not get included in the failing cases.

The device compiler doesn't support functions, so everything gets inlined. Still unknown why the template expansion and inlining excludes some code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants