This is the text of the third and fourth weekly assignment for the PMPH edition 2024-2025.
-
Assignment 3 consists of tasks 1, 2 and 4: total 10 pts (tasks 1 and 2 are pen and paper, task 4 is a programming tasks)
-
Assignment 4 consists of tasks 3, 5 and 6: total 11pts (all are programming tasks)
-
The due date is the same for both assignments; you are encouraged to submit once.
-
the programming tasks can be solved in groups of two but the report writing and submission should be individual. Please write on the first page the names and ku-ids of all group members.
Hand in your solutions in the form of a short report in text or PDF
format, along with the whole codebase that has been improved with your
solutions. We hand-in incomplete code in
the archive w3-code-handin.tar.gz
. You are supposed to fill in the missing
code such that all the tests are valid and to report performance
results. Please implement the missing parts directly in the provided
files and submit the whole code (not only the missing parts).
There are comments in the source files that are supposed to guide you
(together with the text of this assignment).
Unziping the handed in archive w3-code-handin.tar.gz
will create the w3-code-handin
folder, which contains:
-
A
helper.h
file that contains helper functionality used by all exercises -
A
README.md
that provides a short rationale about the Cuda-coding exercises and presents the code structure; make sure to read it. -
Folder
histo-L3-thrashing
provides the code base for the first coding exercise referring to optimizing last-level cache (LLC) threshing in the context of a histogram-like computation. -
Folder
gpu-coalescing
provides the code base for the second coding exercise referring to optimizing GPU spatial locality (coalescing) by means of transposition. -
Folder
mmm
provides the code base for the third coding exercise referring to optimizing temporal locality for matrix-matrix multiplication. -
Folder
batch-mmm
provide the code base for the fourth coding exercise referring to optimizing temporal locality for a batch instance of matrix multiplication, in which the same matrices are multiplied but under a different mask.
Write a neat and short report containing the solutions to the first two theoretical questions, and also the missing code and short explanations for the four coding exercises in Cuda. Also provide comments regarding the performance behavior of your programs:
-
what is the performance (in GB/sec or Gflops) and what is the speedup generated by your improvement with respect to the best performing GPU version provided?
-
short and human-understandable rationale for justifying the speedup, for example:
-
for
histo-L3-thrashing
andgpu-coalescing
: why does the optimized GPU program is several times faster than the GPU baseline in spite of performing a factor of 4x and 3x more accesses to global memory than the baseline, respectively ? -
for matrix multiplication and batch matrix multiplication: how does the optimization improves the temporal reuse ?
-
Consider the C-like pseudocode below:
float A[2*M]; for (int i = 0; i < N; i++) { A[0] = i+1; for (int k = 1; k < 2*M; k++) { A[k] = sqrt(A[k-1] * i * k); } for (int j = 0; j < M; j++) { B[i+1, j+1] = B[i, j] * A[2*j ]; C[i, j+1] = C[i, j] * A[2*j+1]; } }
Your task is to apply privatization, array expansion, loop distribution and loop interchange in order to parallelize as many loops as possible. Answer the following in your report:
-
Explain why in the (original) code above neither the outer loop (of index
i
) nor the inner loops (of indicesk
andj
) are parallel; -
Explain why is it safe to privatize array
A
; -
Once privatized, please explain "informally" why is it safe to distribute the outermost loop across the
A[0] = i+1;
statement and across the other two inner loops. Then perform (safely!) the loop distribution, while remembering to perform array expansion forA
. Show the resulted code.-
Clarification: "Informally" above means that you should use human reasoning to show the dependency graph between three (compound) statements/nodes: (1) statement
A[0] = i+1;
, (2) the compound statement that denotes the (entire) loop of indexk
, and (3) the compound statement that denotes the loop of indexj
. Then show the distribution according to the graph. We do not ask you to use direction-vectors above because we have not applied them for imperfectly-nested loops.
-
-
On the code resulted from the previous step, please reason in terms of direction vectors to determine which loops in each of (the resulted) three loop nests are parallel. Please explain why and annotate each loop with the comment
// parallel
or// sequential
. -
Explain in terms of direction-vectors why is it legal to apply loop interchange on the third (last) loop nest. After interchange, please re-analyze the parallelism of the two loops in the third nest; has something changed? Please show the code after interchange and annotate each loop with the comment
// parallel
or// sequential
.
Assume that both A and B are matrices with N rows and 64 columns. Consider the pseudocode below:
float A[N,64]; float B[N,64]; float accum, tmpA; for (int i = 0; i < N; i++) { // outer loop accum = 0; for (int j = 0; j < 64; j++) { // inner loop tmpA = A[i, j]; accum = sqrt(accum) + tmpA*tmpA; // (**) B[i,j] = accum; } }
Reason about the loop-level parallelism of the code above and answer the following in your report:
-
Why is the outer loop not parallel?
-
Please explain what technique can be used to make it parallel and why is it safe to apply it? Re-write the code such that the outer loop is parallel, i.e., the outer loop does not carry any dependencies.
-
Explain why the inner loop is not parallel.
-
Assume the line marked with
(**)
is re-written asaccum = accum + tmpA*tmpA
. Now it is possible to rewrite both the inner and the outer loop as a nested composition of parallel operators! Please write in your report a semantically-equivalent, nested-parallel Futhark program.-
Clarification: at step 4. above, we ask you to use human reasoning to derive a semantically-equivalent program that is fully parallel and is written in terms of the basic-blocks of functional programming, such as map, reduce, scan, etc.
-
See section "LL$ threshing: Histogram-like computation" in companion lecture slides L6-locality.pdf
.
The programming task refers to implementing the missing code in files main-gpu.cu
and kernels.cu.h
---search for keyword "Exercise" in those files and follow the instructions.
Program arguments are, e.g., see Makefile:
-
The first argument of the program is the size
N
of the array of indices/values. -
The second argument of the program is the size of the last-level cache (LL$) in bytes. Please make sure to adjust it to the hardware you are running on (both CPU and GPU), otherwise you will not observe much. The sizes used in the makefile are particularized to the
hendrixfut01fl
andhendrixfut03fl
machines. -
The size of the histogram is computed internally such as four passes over the input are always performed.
Briefly comment in your report on:
-
the code implementing your solution, i.e., present the code and comment on its correctness and on how it optimizes locality. For example, why do you expect speedup when the improved implementation performs a factor of 3-4x more accesses to global memory (since it traverses the input four times).
-
specify whether your implementation validates
-
report the GB/sec achieved by your implementations and of the GPU baseline and also report the speedup in comparison with the GPU baseline (i.e., the other provided implementation)
See section "Optimizing Spatial Locality by Transposition" in companion lecture slides L6-locality.pdf
.
The programming task refers to implementing in folder gpu-coalescing
:
-
In file
goldeSeq.h
, please correctly parallelize by means of OpenMP the outer loopfor(uint64_t i = 0; i < num_rows; i++) …
.-
Hint: Task 2 should have taught you what "correctly" means. If the parallel runtime is in the same ballpark as the sequential one, it probably means that it was incorrectly parallelized. (Yes, the incorrect code still validates, can you figure out why?)
-
-
The code of CUDA kernel
transKernel
in filekernels.cu.h
, which works on the transposed versions of A and B, namedA_tr
andB_tr
, respectively. Please search for keyword "Exercise" in filekernels.cu.h
to find the implementation place.
Please include in your report:
-
The OpenMP (parallel) code of
goldeSeq.h
; report the speedup obtained by parallelizing golden sequential, i.e., sequential CPU runtime divided by parallel runtime. -
the CUDA-kernel code implementing your solution, i.e., present the code and comment on its correctness and on how it optimizes spatial locality (i.e., coalesced access to global memory). For example, why do you expect speedup when your implementation performs a factor of 3x more access to global memory than the baseline.
-
specify whether your CUDA implementation validates. (The handin does not, since that kernel is for you to implement).
-
report the GB/sec achieved by your GPU implementation and of the GPU baseline , and also report the speedup w.r.t. the baseline.
-
briefly explain why the CPU implementation that uses GPU-like coalescing has abysmal performance (i.e., much slower than the baseline).
-
BONUS briefly explain at a very high level, why/how "the Optimal-GPU Program" is about 2x faster than your implementation. ("the Optimal-GPU Program" is the last GPU program run by the Makefile)
See section "L1$ and Register: Matrix-Matrix Multiplication" in companion lecture slides L6-locality.pdf
.
The programming task refers to implementing in folder mmm
some of the code of Cuda kernel mmmSymBlkRegInnSeqKer
in file kernels.cu.h
. Please search for keyword "Exercise" in file kernels.cu.h
to find the implementation place, and follow the instructions there. Also look around to see how it is called from the CPU (host) code.
Please be aware that Section 6.4 of lecture notes presents a different tiling strategy for matrix-matrix multiplication; i.e., it is related but it is not what you have to do.
Briefly comment in your report on:
-
the code implementing your solution, i.e., show the code and comment on it, e.g., explaining why the access to global memory is coalesced
-
specify whether your implementation validates,
-
report the performance in Gflops achieved by your GPU implementation and by the GPU baseline , and also report the speedup w.r.t. the baseline.
-
Finally, explain in your report the high-level reasons for obtaining this speedup, i.e., how did your implementation improved the temporal locality (e.g., by what factor has decreased the number of accesses to global memory).
See section "L1$ and Register: Batch Matrix Multiplication under a Mask" in companion lecture slides L6-locality.pdf
.
The programming task refers to implementing in folder batch-mmm
the code of the Cuda kernel bmmmTiledKer
in file kernels.cu.h
. Please search for keyword "Exercise" in file kernels.cu.h
to find the implementation place, and follow the instructions there. Remember to flatten the indices to all arrays hold in global memory. Also look around to see how it is called from the CPU (host) code.
Briefly comment in your report on:
-
the code implementing your solution,
-
specify whether your implementation validates,
-
report the performance in Gflops achieved by your GPU implementation and by the GPU baseline , and also report the speedup w.r.t. the baseline.
-
Finally, explain in your report the high-level reasons for obtaining this speedup, i.e., how did your implementation improved the temporal locality (e.g., by what factor has decreased the number of accesses to global memory).