Fix shorten function and add HIP workarounds #195

mdewing · 2021-07-02T21:04:52Z

Add the workarounds listed in #178

Assign the ibs variable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder (ibs3)

Remove launch bounds - fixes radixSort_t where the kernel silently fails to run

Fix the shorten function to avoid overwriting other memory if the type size is less than 8 bytes.

If the size of type T is smaller than 8 bytes, the existing shorten function will modify memory beyond the location of the existing argument. The fix creates a temporary value of 8 bytes in the function to perform the operations on. It still reads too much memory (for T < 8 bytes), but it will no longer write to that memory. To be completely memory safe, the unsigned temporary type could be chosen with std::conditional, like using utype = std::conditional<sizeof(float) == 4, uint32_t, uint64_t>::type; It would need more branches to handle the size 1 and 2 case as well. The casts to and from uint64_t involve pointers rather than values because we want to transfer the bit patterns, not the values.

Workarounds listed in cms-patatrack#178 Assign the 'ibs' variable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder_t (ibs3) Remove launch bounds - fixes radixSort_t where the kernel silently fails to run

fwyzard · 2021-07-03T02:28:02Z

src/hip/CUDACore/radixSort.h


    // broadcast
    ibs = size - 1;
+    ibs3 = ibs;


could you add comments on the code about why the ibs2/ibs3 workaround is needed ?

where ibs3 is used?

fwyzard · 2021-07-03T02:35:32Z

src/hip/test/radixSort_t.cu

        auto sh = sizeof(uint64_t) - NS;
        sh *= 8;
        auto shorten = [sh](T& t) {
-          auto k = (uint64_t*)(&t);


is this approach used only in the test ? or also in some of the application code ?

The shorten function is only used in the radixSort_t test in the cuda* ports and the hip port. The radixSort function is used in plugin-PixelVertexFinding/gpuSortByPt2.h, and it doesn't seem that any such bit-level processing of the numbers is needed there.

(apologies for long delay)

I'm trying to understand what this lambda is supposed to do

auto sh = sizeof(uint64_t) - NS; sh *= 8; auto shorten = [sh](T& t) { auto k = (uint64_t*)(&t); *k = (*k >> sh) << sh; };

It gets run for T with sizeof equal to that of int8_t, int16_t, int32_t, int64_t, leading to sh values of 56, 48, 32. The code appears to zero the lowest sh number of bits, i.e. the portion of T of the uint64_t piece of memory. Adding a printout here (for cuda program) confirms that whenever sh != 0, both k1 and k2 are 0.

@VinInn, could you help here with the intention of this lambda? (the code is exactly the same in CMSSW)

to test sorting? (one way to speed up radix sort is to consider only the MSBs. for instance in ptsort only the 16MSBs are used.
most probably this lambda is used to verify that the result is the same (I need to check)

most probably this lambda is used to verify that the result is the same (I need to check)

Right, but setting the compared-to values to zero for all types T shorter than uint64_t doesn't sound very useful for that.

ok. most probably was written for 64bit values and never fix it...

Here is a version that keep GCC 11 happy and seems to work as intended for arbitrary types:

auto shorten = [](T& t) { // byte representation of t char* bytes = reinterpret_cast<char*>(&t); // bytes to zero out const int zeroes = static_cast<int>(sizeof(T)) - NS; // zero out the least significant bytes (assuming a little endian architecture) static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "This test assumes a little-endian architecture"); for (int i = 0; i < zeroes; ++i) { bytes[i] = 0x00; } };

And here is a version that is more similar to the original, and avoids the loop, but adds a lot of boilerplate to keep the compiler happy about "aliasing rules":

// A templated unsigned integer type with N bytes template <int N> struct uintN; template <> struct uintN<8> { using type = uint8_t; }; template <> struct uintN<16> { using type = uint16_t; }; template <> struct uintN<32> { using type = uint32_t; }; template <> struct uintN<64> { using type = uint64_t; }; template <int N> using uintN_t = typename uintN<N>::type; // A templated unsigned integer type with the same size as T template <typename T> using uintT_t = uintN_t<sizeof(T) * 8>; // Keep only the `N` most significant bytes of `t`, and set the others to zero template <int N, typename T, typename SFINAE = std::enable_if_t<N <= sizeof(T)>> void shorten(T& t) { const int shift = 8 * (sizeof(T) - N); union { T t; uintT_t<T> u; } c; c.t = t; c.u = c.u >> shift << shift; t = c.t; }

PR'ed in #209.

VinInn · 2021-07-28T08:14:35Z

src/hip/CUDACore/radixSort.h


    template <typename T, int NS = sizeof(T)>
-    __global__ void __launch_bounds__(256, 4)
+    // The launch bounds seems to cause the kernel to silently fail to run (rocm 4.2)


known issue?
what hippify tells you about?

VinInn · 2021-07-28T08:14:55Z

src/hip/CUDACore/radixSort.h

+      if (threadIdx.x == 0) {
        ibs -= sb;
+        // Workaround for problems in radixSort_t.
+        ibs2 = ibs;


which problem?

where ibs2 is used?

I suspect what is needed is a threadfence

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
RocM memory model maybe even weaker than CUDA

Using a threadfence instead of the temporary variable works.

VinInn · 2021-07-28T09:42:30Z

from HIP doc

The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int) and __syncthreads_or(int) functions are under development.

not sure what exaclly means

The workaround involving ibs3 for a hang in gpuVertexFinder is no longer needed in ROCm 4.3. The workaround invovling ibs2 can be replaced with a threadfence();

makortel · 2021-08-19T01:18:50Z

@markdewing Could you try if the version of shorten() merged in #209 would work for HIP too?

mdewing · 2021-08-19T16:37:28Z

@makortel The version of shorten from #209 works on HIP.

The remaining changes needed to make the radixSort_t test pass on HIP are

remove __launch_bounds__ from the function declaration near the end of radixSort.h
add __threadfence(); around line 182 of radixSort.h

makortel · 2021-08-19T17:24:52Z

Thanks @markdewing. Could you change this PR (or close this one and open a new one) to contain

The remaining changes needed to make the radixSort_t test pass on HIP are

remove __launch_bounds__ from the function declaration near the end of radixSort.h

add __threadfence(); around line 182 of radixSort.h

and then I'll merge?

mdewing · 2021-08-19T17:34:22Z

I might open a new PR with the simplified changes.

mdewing added 2 commits July 2, 2021 15:50

Workarounds for HIP port

b53d646

Workarounds listed in cms-patatrack#178 Assign the 'ibs' variable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder_t (ibs3) Remove launch bounds - fixes radixSort_t where the kernel silently fails to run

fwyzard reviewed Jul 3, 2021

View reviewed changes

Add comments about the workarounds

bc5dfc6

makortel added the hip label Jul 28, 2021

VinInn reviewed Jul 28, 2021

View reviewed changes

Updates to workarounds

3d57cac

The workaround involving ibs3 for a hang in gpuVertexFinder is no longer needed in ROCm 4.3. The workaround invovling ibs2 can be replaced with a threadfence();

makortel mentioned this pull request Aug 17, 2021

[GCC11] HeterogeneousCore/CUDAUtilities: array subscript is partly outside array bounds cms-sw/cmssw#34917

Closed

mdewing mentioned this pull request Aug 20, 2021

Workarounds for the HIP port #210

Merged

mdewing closed this Aug 20, 2021

Fix shorten function and add HIP workarounds #195

Fix shorten function and add HIP workarounds #195

Uh oh!

Conversation

mdewing commented Jul 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fwyzard Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VinInn commented Jul 28, 2021

Uh oh!

makortel commented Aug 19, 2021

Uh oh!

mdewing commented Aug 19, 2021

Uh oh!

makortel commented Aug 19, 2021

Uh oh!

mdewing commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fwyzard Aug 18, 2021 •

edited

Loading