Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions src/hip/CUDACore/radixSort.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,8 @@ __device__ __forceinline__ void radixSortImpl(

// broadcast
ibs = size - 1;


__syncthreads();
while (__syncthreads_and(ibs > 0)) {
int i = ibs - threadIdx.x;
Expand Down Expand Up @@ -177,8 +179,13 @@ __device__ __forceinline__ void radixSortImpl(
__syncthreads();
if (bin >= 0)
assert(c[bin] >= 0);
if (threadIdx.x == 0)
if (threadIdx.x == 0) {
ibs -= sb;

// Fix for problems in radixSort_t.
// Without this, this c[bin] >=0 assert above will be triggered.
__threadfence();
}
__syncthreads();
}

Expand Down Expand Up @@ -260,7 +267,9 @@ namespace cms {
namespace hip {

template <typename T, int NS = sizeof(T)>
__global__ void __launch_bounds__(256, 4)
// The launch bounds seems to cause the kernel to silently fail to run (rocm 4.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

known issue?
what hippify tells you about?

//__global__ void __launch_bounds__(256, 4)
__global__ void
radixSortMultiWrapper(T const* v, uint16_t* index, uint32_t const* offsets, uint16_t* workspace) {
radixSortMulti<T, NS>(v, index, offsets, workspace);
}
Expand Down
5 changes: 3 additions & 2 deletions src/hip/test/radixSort_t.cu
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,9 @@ void go(bool useShared) {
auto sh = sizeof(uint64_t) - NS;
sh *= 8;
auto shorten = [sh](T& t) {
auto k = (uint64_t*)(&t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this approach used only in the test ? or also in some of the application code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shorten function is only used in the radixSort_t test in the cuda* ports and the hip port. The radixSort function is used in plugin-PixelVertexFinding/gpuSortByPt2.h, and it doesn't seem that any such bit-level processing of the numbers is needed there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apologies for long delay)

I'm trying to understand what this lambda is supposed to do

        auto sh = sizeof(uint64_t) - NS;
        sh *= 8;
        auto shorten = [sh](T& t) {
          auto k = (uint64_t*)(&t);
          *k = (*k >> sh) << sh;
        };

It gets run for T with sizeof equal to that of int8_t, int16_t, int32_t, int64_t, leading to sh values of 56, 48, 32. The code appears to zero the lowest sh number of bits, i.e. the portion of T of the uint64_t piece of memory. Adding a printout here (for cuda program) confirms that whenever sh != 0, both k1 and k2 are 0.

@VinInn, could you help here with the intention of this lambda? (the code is exactly the same in CMSSW)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to test sorting? (one way to speed up radix sort is to consider only the MSBs. for instance in ptsort only the 16MSBs are used.
most probably this lambda is used to verify that the result is the same (I need to check)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most probably this lambda is used to verify that the result is the same (I need to check)

Right, but setting the compared-to values to zero for all types T shorter than uint64_t doesn't sound very useful for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. most probably was written for 64bit values and never fix it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a version that keep GCC 11 happy and seems to work as intended for arbitrary types:

auto shorten = [](T& t) {
    // byte representation of t
    char* bytes = reinterpret_cast<char*>(&t);
    // bytes to zero out
    const int zeroes = static_cast<int>(sizeof(T)) - NS;
    // zero out the least significant bytes (assuming a little endian architecture)
    static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "This test assumes a little-endian architecture");
    for (int i = 0; i < zeroes; ++i) {
        bytes[i] = 0x00;
    }
};

Copy link
Contributor

@fwyzard fwyzard Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is a version that is more similar to the original, and avoids the loop, but adds a lot of boilerplate to keep the compiler happy about "aliasing rules":

// A templated unsigned integer type with N bytes
template <int N>
struct uintN;

template <>
struct uintN<8> {
  using type = uint8_t;
};

template <>
struct uintN<16> {
  using type = uint16_t;
};

template <>
struct uintN<32> {
  using type = uint32_t;
};

template <>
struct uintN<64> {
  using type = uint64_t;
};

template <int N>
using uintN_t = typename uintN<N>::type;

// A templated unsigned integer type with the same size as T
template <typename T>
using uintT_t = uintN_t<sizeof(T) * 8>;

// Keep only the `N` most significant bytes of `t`, and set the others to zero
template <int N, typename T, typename SFINAE = std::enable_if_t<N <= sizeof(T)>>
void shorten(T& t) {
  const int shift = 8 * (sizeof(T) - N);
  union {
    T t;
    uintT_t<T> u;
  } c;
  c.t = t;
  c.u = c.u >> shift << shift;
  t = c.t;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR'ed in #209.

*k = (*k >> sh) << sh;
uint64_t k = *(uint64_t *)(&t);
k = (k >> sh) << sh;
t = *(T*)(&k);
};
shorten(k1);
shorten(k2);
Expand Down