-
Notifications
You must be signed in to change notification settings - Fork 47
Fix shorten function and add HIP workarounds #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the size of type T is smaller than 8 bytes, the existing shorten function will modify memory beyond the location of the existing argument. The fix creates a temporary value of 8 bytes in the function to perform the operations on. It still reads too much memory (for T < 8 bytes), but it will no longer write to that memory. To be completely memory safe, the unsigned temporary type could be chosen with std::conditional, like using utype = std::conditional<sizeof(float) == 4, uint32_t, uint64_t>::type; It would need more branches to handle the size 1 and 2 case as well. The casts to and from uint64_t involve pointers rather than values because we want to transfer the bit patterns, not the values.
Workarounds listed in cms-patatrack#178 Assign the 'ibs' variable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder_t (ibs3) Remove launch bounds - fixes radixSort_t where the kernel silently fails to run
src/hip/CUDACore/radixSort.h
Outdated
|
|
||
| // broadcast | ||
| ibs = size - 1; | ||
| ibs3 = ibs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add comments on the code about why the ibs2/ibs3 workaround is needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where ibs3 is used?
| auto sh = sizeof(uint64_t) - NS; | ||
| sh *= 8; | ||
| auto shorten = [sh](T& t) { | ||
| auto k = (uint64_t*)(&t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this approach used only in the test ? or also in some of the application code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shorten function is only used in the radixSort_t test in the cuda* ports and the hip port. The radixSort function is used in plugin-PixelVertexFinding/gpuSortByPt2.h, and it doesn't seem that any such bit-level processing of the numbers is needed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(apologies for long delay)
I'm trying to understand what this lambda is supposed to do
auto sh = sizeof(uint64_t) - NS;
sh *= 8;
auto shorten = [sh](T& t) {
auto k = (uint64_t*)(&t);
*k = (*k >> sh) << sh;
};It gets run for T with sizeof equal to that of int8_t, int16_t, int32_t, int64_t, leading to sh values of 56, 48, 32. The code appears to zero the lowest sh number of bits, i.e. the portion of T of the uint64_t piece of memory. Adding a printout here (for cuda program) confirms that whenever sh != 0, both k1 and k2 are 0.
@VinInn, could you help here with the intention of this lambda? (the code is exactly the same in CMSSW)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to test sorting? (one way to speed up radix sort is to consider only the MSBs. for instance in ptsort only the 16MSBs are used.
most probably this lambda is used to verify that the result is the same (I need to check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most probably this lambda is used to verify that the result is the same (I need to check)
Right, but setting the compared-to values to zero for all types T shorter than uint64_t doesn't sound very useful for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. most probably was written for 64bit values and never fix it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a version that keep GCC 11 happy and seems to work as intended for arbitrary types:
auto shorten = [](T& t) {
// byte representation of t
char* bytes = reinterpret_cast<char*>(&t);
// bytes to zero out
const int zeroes = static_cast<int>(sizeof(T)) - NS;
// zero out the least significant bytes (assuming a little endian architecture)
static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "This test assumes a little-endian architecture");
for (int i = 0; i < zeroes; ++i) {
bytes[i] = 0x00;
}
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here is a version that is more similar to the original, and avoids the loop, but adds a lot of boilerplate to keep the compiler happy about "aliasing rules":
// A templated unsigned integer type with N bytes
template <int N>
struct uintN;
template <>
struct uintN<8> {
using type = uint8_t;
};
template <>
struct uintN<16> {
using type = uint16_t;
};
template <>
struct uintN<32> {
using type = uint32_t;
};
template <>
struct uintN<64> {
using type = uint64_t;
};
template <int N>
using uintN_t = typename uintN<N>::type;
// A templated unsigned integer type with the same size as T
template <typename T>
using uintT_t = uintN_t<sizeof(T) * 8>;
// Keep only the `N` most significant bytes of `t`, and set the others to zero
template <int N, typename T, typename SFINAE = std::enable_if_t<N <= sizeof(T)>>
void shorten(T& t) {
const int shift = 8 * (sizeof(T) - N);
union {
T t;
uintT_t<T> u;
} c;
c.t = t;
c.u = c.u >> shift << shift;
t = c.t;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR'ed in #209.
|
|
||
| template <typename T, int NS = sizeof(T)> | ||
| __global__ void __launch_bounds__(256, 4) | ||
| // The launch bounds seems to cause the kernel to silently fail to run (rocm 4.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
known issue?
what hippify tells you about?
src/hip/CUDACore/radixSort.h
Outdated
| if (threadIdx.x == 0) { | ||
| ibs -= sb; | ||
| // Workaround for problems in radixSort_t. | ||
| ibs2 = ibs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where ibs2 is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect what is needed is a threadfence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
RocM memory model maybe even weaker than CUDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a threadfence instead of the temporary variable works.
|
from HIP doc not sure what exaclly means |
The workaround involving ibs3 for a hang in gpuVertexFinder is no longer needed in ROCm 4.3. The workaround invovling ibs2 can be replaced with a threadfence();
|
@markdewing Could you try if the version of |
|
Thanks @markdewing. Could you change this PR (or close this one and open a new one) to contain
and then I'll merge? |
|
I might open a new PR with the simplified changes. |
Add the workarounds listed in #178
Assign the
ibsvariable to another variable - fixes radixSort_t (ibs2) and gpuVertexFinder (ibs3)Remove launch bounds - fixes radixSort_t where the kernel silently fails to run
Fix the shorten function to avoid overwriting other memory if the type size is less than 8 bytes.