-
Notifications
You must be signed in to change notification settings - Fork 47
[alpaka] add element_stride class and test #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
79bb2fe to
4d7ee8a
Compare
makortel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes from first read(s).
The code should also be formatted with clang-format (in principle even I can do it just before merge).
| * Class which simplifies "for" loops over elements index | ||
| */ | ||
| template <typename T, typename T_Acc> | ||
| class elements_with_stride { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relationship between elements_with_stride and elements_with_stride_<N>d is not clear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the addition of dimIndex argument, nevermind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that elements_with_stride should loop over a single index using a scalar variable; usually this is the 0th index (assuming a one-dimensional kernel) but it can be chosen by dimIndex.
While elements_with_stride_<N>d should loop over an N-dimensional space using a Vec3D variable.
In fact, after having clarified that the platform and device are independent from the dimensionality, it makes sense to change elements_with_stride_<N>d to use a Vec<N>D instead of always a Vec3D.
e34151e to
d52117d
Compare
d216d22 to
5c4c43d
Compare
1b9a70e to
d6588cf
Compare
|
I will show the comparison between Alpaka-CUDA and Native CUDA for atomics and barriers. I used 10 running times to get the average and the standard deviation. The tests used were added in this PR as well. For atomics, I used 256 threads/block : NVidia V100:
NVidia T4:
For the syncThreads test, I used 1024 threads/block. For the threadfence, I used 256 threads/block: NVidia V100:
NVidia T4:
|
a8eb537 to
e2a58ab
Compare
8e0162e to
cae4cfd
Compare
1667920 to
2c7c455
Compare
244006d to
bed0543
Compare
b5e83b0 to
7184d76
Compare
|
Rebased and fixed conflicts. |
7184d76 to
7b4051a
Compare
| // increment the 3rd index and check its value | ||
| index_[2u] += 1; | ||
| if (index_[2u] == old_index_[2u] + blockDim[2u]) | ||
| index_[2u] = old_index_[2u]; | ||
|
|
||
| // if the 3rd index was reset, increment the 2nd index | ||
| if (index_[2u] == old_index_[2u]) | ||
| index_[1u] += 1; | ||
| if (index_[1u] == old_index_[1u] + blockDim[1u]) | ||
| index_[1u] = old_index_[1u]; | ||
|
|
||
| // if the 3rd and 2nd indices were set, increment the first coordinate | ||
| if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
| index_[0u] += 1; | ||
|
|
||
| if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u]) { | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th index, here is over all three indices.
| // increment the 3rd index and check its value | ||
| index_[2u] += 1; | ||
| if (index_[2u] == old_index_[2u] + blockDim[2u]) | ||
| index_[2u] = old_index_[2u]; | ||
|
|
||
| // if the 3rd index was reset, increment the 2nd index | ||
| if (index_[2u] == old_index_[2u]) | ||
| index_[1u] += 1; | ||
| if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u]) | ||
| index_[1u] = old_index_[1u]; | ||
|
|
||
| // if the 3rd and 2nd indices were set, increment the first coordinate | ||
| if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
| index_[0u] += 1; | ||
|
|
||
| if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u]) { | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above: there the iteration is only over the 0th and 1st indices, here is over all three indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the increments also seems different.
| // increment the 3rd index and check its value | ||
| index_[2u] += 1; | ||
| if (index_[2u] == old_index_[2u] + blockDim[2u] || index_[2u] == extent_[2u]) | ||
| index_[2u] = old_index_[2u]; | ||
|
|
||
| // if the 3rd index was reset, increment the 2nd index | ||
| if (index_[2u] == old_index_[2u]) | ||
| index_[1u] += 1; | ||
| if (index_[1u] == old_index_[1u] + blockDim[1u] || index_[1u] == extent_[1u]) | ||
| index_[1u] = old_index_[1u]; | ||
|
|
||
| // if the 3rd and 2nd indices were set, increment the first coordinate | ||
| if (index_[1u] == old_index_[1u] && index_[2u] == old_index_[2u]) | ||
| index_[0u] += 1; | ||
| if (index_[0u] < old_index_[0u] + blockDim[0u] && index_[0u] < extent_[0u] && index_[1u] < extent_[1u] && | ||
| index_[2u] < extent_[2u]) { | ||
| return *this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the increments is inconsistent with the ALPAKA_ACC_GPU_CUDA_ENABLED case above.
Is this intended ?
194c43d to
37df2db
Compare
|
Rebased etc. BeforeAfter |
The new classes implement
range-based for loopfor elements indices. In addition, I added a test for the new classes.