Skip to content

Conversation

@mdewing
Copy link
Contributor

@mdewing mdewing commented Jul 2, 2021

Add the workarounds listed in #178

Assign the ibs variable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder (ibs3)

Remove launch bounds - fixes radixSort_t where the kernel silently fails to run

Fix the shorten function to avoid overwriting other memory if the type size is less than 8 bytes.

mdewing added 2 commits July 2, 2021 15:50
If the size of type T is smaller than 8 bytes, the existing shorten
function will modify memory beyond the location of the existing argument.

The fix creates a temporary value of 8 bytes in the function to perform the
operations on.  It still reads too much memory (for T < 8 bytes), but it
will no longer write to that memory.

To be completely memory safe, the unsigned temporary type could be
chosen with std::conditional, like
  using utype = std::conditional<sizeof(float) == 4, uint32_t, uint64_t>::type;
It would need more branches to handle the size 1 and 2 case as well.

The casts to and from uint64_t involve pointers rather than values
because we want to transfer the bit patterns, not the values.
Workarounds listed in cms-patatrack#178

Assign the 'ibs' variable to another variable - fixes radixSort_t (ibs2) and
gpuVertexFinder_t (ibs3)

Remove launch bounds - fixes radixSort_t where the kernel silently fails
to run

// broadcast
ibs = size - 1;
ibs3 = ibs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comments on the code about why the ibs2/ibs3 workaround is needed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where ibs3 is used?

auto sh = sizeof(uint64_t) - NS;
sh *= 8;
auto shorten = [sh](T& t) {
auto k = (uint64_t*)(&t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this approach used only in the test ? or also in some of the application code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shorten function is only used in the radixSort_t test in the cuda* ports and the hip port. The radixSort function is used in plugin-PixelVertexFinding/gpuSortByPt2.h, and it doesn't seem that any such bit-level processing of the numbers is needed there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for long delay)

I'm trying to understand what this lambda is supposed to do

        auto sh = sizeof(uint64_t) - NS;
        sh *= 8;
        auto shorten = [sh](T& t) {
          auto k = (uint64_t*)(&t);
          *k = (*k >> sh) << sh;
        };

It gets run for T with sizeof equal to that of int8_t, int16_t, int32_t, int64_t, leading to sh values of 56, 48, 32. The code appears to zero the lowest sh number of bits, i.e. the portion of T of the uint64_t piece of memory. Adding a printout here (for cuda program) confirms that whenever sh != 0, both k1 and k2 are 0.

@VinInn, could you help here with the intention of this lambda? (the code is exactly the same in CMSSW)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to test sorting? (one way to speed up radix sort is to consider only the MSBs. for instance in ptsort only the 16MSBs are used.
most probably this lambda is used to verify that the result is the same (I need to check)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most probably this lambda is used to verify that the result is the same (I need to check)

Right, but setting the compared-to values to zero for all types T shorter than uint64_t doesn't sound very useful for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. most probably was written for 64bit values and never fix it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a version that keep GCC 11 happy and seems to work as intended for arbitrary types:

auto shorten = [](T& t) {
    // byte representation of t
    char* bytes = reinterpret_cast<char*>(&t);
    // bytes to zero out
    const int zeroes = static_cast<int>(sizeof(T)) - NS;
    // zero out the least significant bytes (assuming a little endian architecture)
    static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "This test assumes a little-endian architecture");
    for (int i = 0; i < zeroes; ++i) {
        bytes[i] = 0x00;
    }
};

Copy link
Contributor

@fwyzard fwyzard Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is a version that is more similar to the original, and avoids the loop, but adds a lot of boilerplate to keep the compiler happy about "aliasing rules":

// A templated unsigned integer type with N bytes
template <int N>
struct uintN;

template <>
struct uintN<8> {
  using type = uint8_t;
};

template <>
struct uintN<16> {
  using type = uint16_t;
};

template <>
struct uintN<32> {
  using type = uint32_t;
};

template <>
struct uintN<64> {
  using type = uint64_t;
};

template <int N>
using uintN_t = typename uintN<N>::type;

// A templated unsigned integer type with the same size as T
template <typename T>
using uintT_t = uintN_t<sizeof(T) * 8>;

// Keep only the `N` most significant bytes of `t`, and set the others to zero
template <int N, typename T, typename SFINAE = std::enable_if_t<N <= sizeof(T)>>
void shorten(T& t) {
  const int shift = 8 * (sizeof(T) - N);
  union {
    T t;
    uintT_t<T> u;
  } c;
  c.t = t;
  c.u = c.u >> shift << shift;
  t = c.t;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR'ed in #209.

@makortel makortel added the hip label Jul 28, 2021

template <typename T, int NS = sizeof(T)>
__global__ void __launch_bounds__(256, 4)
// The launch bounds seems to cause the kernel to silently fail to run (rocm 4.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

known issue?
what hippify tells you about?

if (threadIdx.x == 0) {
ibs -= sb;
// Workaround for problems in radixSort_t.
ibs2 = ibs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where ibs2 is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect what is needed is a threadfence

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a threadfence instead of the temporary variable works.

@VinInn
Copy link
Contributor

VinInn commented Jul 28, 2021

from HIP doc

The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int), __syncthreads_and(int) and __syncthreads_or(int) functions are under development.

not sure what exaclly means

The workaround involving ibs3 for a hang in gpuVertexFinder is no
longer needed in ROCm 4.3.
The workaround invovling ibs2 can be replaced with a threadfence();
@makortel
Copy link
Collaborator

@markdewing Could you try if the version of shorten() merged in #209 would work for HIP too?

@mdewing
Copy link
Contributor Author

mdewing commented Aug 19, 2021

@makortel The version of shorten from #209 works on HIP.

The remaining changes needed to make the radixSort_t test pass on HIP are

  • remove __launch_bounds__ from the function declaration near the end of radixSort.h
  • add __threadfence(); around line 182 of radixSort.h

@makortel
Copy link
Collaborator

Thanks @markdewing. Could you change this PR (or close this one and open a new one) to contain

The remaining changes needed to make the radixSort_t test pass on HIP are

  • remove __launch_bounds__ from the function declaration near the end of radixSort.h
  • add __threadfence(); around line 182 of radixSort.h

and then I'll merge?

@mdewing
Copy link
Contributor Author

mdewing commented Aug 19, 2021

I might open a new PR with the simplified changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants