This repository was archived by the owner on Jan 26, 2024. It is now read-only.

Description
Currently, HIP implements atomicMin/Max for single and double precision floating point values as CAS loops. However, in fast math scenarios, on architectures with hardware support for signed/unsigned integer atomicMin/Max a better implementation is possible. As per https://stackoverflow.com/a/72461459 for single precision:
__device__ __forceinline__ float atomicMinFloat(float* addr, float value) {
float old;
old = !signbit(value) ? __int_as_float(atomicMin((int*)addr, __float_as_int(value))) :
__uint_as_float(atomicMax((unsigned int*)addr, __float_as_uint(value)));
return old;
}
__device__ __forceinline__ float atomicMaxFloat(float* addr, float value) {
float old;
old = !signbit(value) ? __int_as_float(atomicMax((int*)addr, __float_as_int(value))) :
__uint_as_float(atomicMin((unsigned int*)addr, __float_as_uint(value)));
return old;
}
Better implementations still are possible on NVIDIA using Opportunistic Warp-level Programming wherein one first looks to see if any other active threads in the warp have the same addr, and if so first do the reduction at the warp level. This greatly cuts down the number of RMW operations which leave the core when there is contention. I suspect a similar idea can carry over to AMD GPUs.