Skip to content

Conversation

@PaulGannay
Copy link
Contributor

No description provided.

@PaulGannay PaulGannay marked this pull request as draft December 10, 2025 15:33
Copy link
Member

@pzehner pzehner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better with the tables, indeed. I think that the layout of the slides should be modified to separate the code from its execution.

Comment on lines 175 to 734
% Trainee could play with the following program to check that it really present a race condition:
%#include <iostream>
%#include <Kokkos_Core.hpp>
%
%int main(int argc, char *argv[]) {
% Kokkos::initialize(argc, argv);
% {
% const int N = 10000;
% Kokkos::View<double*> v("v", N);
% Kokkos::deep_copy(v, 4);
%
% Kokkos::View<double> res("res", N);
%
% Kokkos::parallel_for(Kokkos::RangePolicy(0, N),
% KOKKOS_LAMBDA(int i) {
% //Kokkos::atomic_add(&res(), v(i));
% res() = res() + v(i);
% });
%
% double res_;
%
% deep_copy(res_, res);
%
% std::cout << "res_:" << res_ << std::endl;
% std::cout << "4*N:" << 4*N << std::endl;
% }
% Kokkos::finalize();
%}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea for an exercise.

\item they bypass and invalidate cache line.
\end{itemize}

=> Atomics should be used with care and only when strictly necessary.\linebreak
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that sometimes, developers should change an algorithm that depends on atomics.

This is especially true for algorithms that iterate over faces of a mesh, then over the two cells neighboring the face. This pattern is very common for unstructured CFD codes that run on CPU, because then you compute the flux between the two cells only once. This can be ported to GPU as is, but sometimes the best strategy is actually to rewrite the algorithm to iterate over the cells directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you are right, but it deserves its own set of slides. I didn't had time today, I'll see later.
Thank you for the proposed example.

Comment on lines 317 to 318
For some of your needs, more performant alternative exist, like \texttt{parallel\_reduce} or \texttt{Kokkos::ScatterView}.
\end{frame}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware, because on GPU a ScatterView will not give you any performance gain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following a discussion I had this morning, maybe it's better to not talk about ScatterView in this tutorial. It's still experimental and maybe less suited for nowadays CPUs. Especially, since it relies on data duplication on CPU, it may be counterproductive on CPUs with a very large number of threads (say, more than 100).

@pzehner pzehner mentioned this pull request Dec 18, 2025
@PaulGannay PaulGannay marked this pull request as ready for review December 19, 2025 09:44
Copy link
Member

@pzehner pzehner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's almost done. Just some fixes.

\begin{column}{0.5\linewidth}
\begin{minted}{C++}
Kokkos::View<double*> histo(5);
Kokkos::deep_copy(histo, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to manually initialize to 0, the view does it by default.

(Same remark for the other slides.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is guaranteed and not a side effect of memory allocation?
I find the doc not very clear on this subject, all allocating constructor have this text:

The initialization is executed on the default instance of the execution space corresponding to memory_space and fences it.

but it doesn't explain what kind of initialisation takes place for default types.

Copy link
Contributor Author

@PaulGannay PaulGannay Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Adrien (he worked on View initialisation), and he confirmed that you are right, I will delete the extra deep_copy.

Comment on lines 72 to 73
\colorlet{thread1}{gray!25}
\colorlet{thread6}{example!25}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would select two tones of gray instead

Suggested change
\colorlet{thread1}{gray!25}
\colorlet{thread6}{example!25}
\colorlet{thread1}{gray!20}
\colorlet{thread6}{gray!40}

Or plainly use colors:

Suggested change
\colorlet{thread1}{gray!25}
\colorlet{thread6}{example!25}
\colorlet{thread1}{lightalert}
\colorlet{thread6}{lightexample}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried with the different levels of gray but found it hard to read, especially on slide 30.

The light red + light blue looks nice in colour but is harder to differentiate in B&W.
I'll do the change, we'll revert it if you think readability in B&W is important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants