diff --git a/docs/sphinx/user_guide/feature/loop_basic.rst b/docs/sphinx/user_guide/feature/loop_basic.rst index 8918335b02..db93432cff 100644 --- a/docs/sphinx/user_guide/feature/loop_basic.rst +++ b/docs/sphinx/user_guide/feature/loop_basic.rst @@ -344,22 +344,25 @@ the same team. The ``RAJA::launch`` interface has three main concepts: device execution environment, which enables run time selection of kernel execution. - * ``RAJA::LaunchParams`` type. This type takes a number of teams and and a - number of threads as arguments. + * ``RAJA::LaunchParams`` type. This type takes a number of teams, threads + per team, and optionally the size of dynamic shared memory in bytes. * ``RAJA::loop`` template. These are used to define hierarchical parallel execution of a kernel. Operations within a loop are mapped to either teams or threads based on the execution policy template parameter provided. -Team shared memory is available by using the ``RAJA_TEAM_SHARED`` macro. Team -shared memory enables threads in a given team to share data. In practice, -team policies are typically aliases for RAJA GPU block policies in the -x,y,z dimensions, while thread policies are aliases for RAJA GPU thread -policies in the x,y,z dimensions. In a host execution environment, teams and -threads may be mapped to sequential loop execution or OpenMP threaded regions. -Often, the ``RAJA::LaunchParams`` method can take an empty argument list for -host execution. +Team shared memory can be allocated by using the ``RAJA_TEAM_SHARED`` macro on +statically sized arrays or via dynamic allocation in the ``RAJA::LaunchParams`` +method. Team shared memory enables threads in a given team to have shared access to a shared memory buffer. +Loops are then assigned to either teams or threads based on the GPU execution +policy. Under the CUDA/HIP nomenclature, teams correspond to blocks while, +in SYCL nomenclature, teams correspond to workgroups. + +In a host execution environment, team and thread parameters in the +``RAJA::LaunchParams`` struct have no effect in execution and may be +omitted if only running on the host. + Please see the following tutorial sections for detailed examples that use ``RAJA::launch``: diff --git a/docs/sphinx/user_guide/feature/policies.rst b/docs/sphinx/user_guide/feature/policies.rst index 03df35df8e..28c4c40bb2 100644 --- a/docs/sphinx/user_guide/feature/policies.rst +++ b/docs/sphinx/user_guide/feature/policies.rst @@ -13,10 +13,10 @@ Policies ================== RAJA kernel execution methods take an execution policy type template parameter -to specialize execution behavior. Typically, the policy indicates which -programming model back-end to use and other information about the execution -pattern, such as number of CUDA threads per thread block, whether execution is -synchronous or asynchronous, etc. This section describes RAJA policies for +to specialize execution behavior. Typically, the policy indicates which +programming model back-end to use and other information about the execution +pattern, such as number of CUDA threads per thread block, whether execution is +synchronous or asynchronous, etc. This section describes RAJA policies for loop kernel execution, scans, sorts, reductions, atomics, etc. Please detailed examples in :ref:`tutorial-label` for a variety of use cases. @@ -24,7 +24,7 @@ As RAJA functionality evolves, new policies are added and some may be redefined and to work in new ways. .. note:: * All RAJA policies are in the namespace ``RAJA``. - * All RAJA policies have a prefix indicating the back-end + * All RAJA policies have a prefix indicating the back-end implementation that they use; e.g., ``omp_`` for OpenMP, ``cuda_`` for CUDA, etc. @@ -47,6 +47,8 @@ apply during code compilation. ====================================== ============= ========================== Sequential/SIMD Execution Policies Works with Brief description ====================================== ============= ========================== + seq_launch_t launch Creates a sequential + execution space. seq_exec forall, Strictly sequential kernel (For), execution. scan, @@ -55,10 +57,10 @@ apply during code compilation. kernel (For), SIMD instructions via scan compiler hints in RAJA's internal implementation. - loop_exec forall, Allow the compiler to + loop_exec forall, Allow the compiler to kernel (For), generate any optimizations scan, that its heuristics deem - sort beneficial according; + sort beneficial; i.e., no loop decorations (pragmas or intrinsics) in RAJA implementation. @@ -71,65 +73,69 @@ OpenMP Parallel CPU Policies For the OpenMP CPU multithreading back-end, RAJA has policies that can be used by themselves to execute kernels. In particular, they create an OpenMP parallel region and execute a kernel within it. To distinguish these in this discussion, -we refer to these as **full policies**. These policies are provided -to users for convenience in common use cases. +we refer to these as **full policies**. These policies are provided +to users for convenience in common use cases. -RAJA also provides other OpenMP policies, which we refer to as -**partial policies**, since they need to be used in combination with other -policies. Typically, they work by providing an *outer policy* and an -*inner policy* as a template parameter to the outer policy. These give users +RAJA also provides other OpenMP policies, which we refer to as +**partial policies**, since they need to be used in combination with other +policies. Typically, they work by providing an *outer policy* and an +*inner policy* as a template parameter to the outer policy. These give users flexibility to create more complex execution patterns. .. note:: To control the number of threads used by OpenMP policies, set the value of the environment variable 'OMP_NUM_THREADS' (which is fixed for duration of run), or call the OpenMP routine - 'omp_set_num_threads(nthreads)' in your application, which allows + 'omp_set_num_threads(nthreads)' in your application, which allows one to change the number of threads at run time. The full policies are described in the following table. Partial policies are described in other tables below. - ========================================= ============= ======================= - OpenMP CPU Full Policies Works with Brief description - ========================================= ============= ======================= - omp_parallel_for_exec forall, Same as applying - kernel (For), 'omp parallel for' - scan, pragma + ========================================= ============== ====================== + OpenMP CPU Full Policies Works with Brief description + ========================================= ============== ====================== + omp_parallel_for_exec forall, Same as applying + kernel (For), 'omp parallel for' + launch (loop), pragma + scan, sort - omp_parallel_for_static_exec forall, Same as applying - kernel (For) 'omp parallel for - schedule(static, - ChunkSize)' - omp_parallel_for_dynamic_exec forall, Same as applying - kernel (For) 'omp parallel for - schedule(dynamic, - ChunkSize)' - omp_parallel_for_guided_exec forall, Same as applying - kernel (For) 'omp parallel for - schedule(guided, - ChunkSize)' - omp_parallel_for_runtime_exec forall, Same as applying - kernel (For) 'omp parallel for - schedule(runtime)' - ========================================= ============= ======================= + omp_parallel_for_static_exec forall, Same as applying + kernel (For) 'omp parallel for + schedule(static, + ChunkSize)' + omp_parallel_for_dynamic_exec forall, Same as applying + kernel (For) 'omp parallel for + schedule(dynamic, + ChunkSize)' + omp_parallel_for_guided_exec forall, Same as applying + kernel (For) 'omp parallel for + schedule(guided, + ChunkSize)' + omp_parallel_for_runtime_exec forall, Same as applying + kernel (For) 'omp parallel for + schedule(runtime)' + ========================================= ============== ====================== .. note:: For the OpenMP scheduling policies above that take a ``ChunkSize`` - parameter, the chunk size is optional. If not provided, the + parameter, the chunk size is optional. If not provided, the default chunk size that OpenMP applies will be used, which may be specific to the OpenMP implementation in use. For this case, - the RAJA policy syntax is - ``omp_parallel_for_{static|dynamic|guided}_exec< >``, which will - result in the OpenMP pragma - ``omp parallel for schedule({static|dynamic|guided})`` being applied. + the RAJA policy syntax is + ``omp_parallel_for_{static|dynamic|guided}_exec< >``, which will + result in the OpenMP pragma + ``omp parallel for schedule({static|dynamic|guided})`` being applied. -RAJA provides an (outer) OpenMP CPU policy to create a parallel region in -which to execute a kernel. It requires an inner policy that defines how a +RAJA provides an (outer) OpenMP CPU policy to create a parallel region in +which to execute a kernel. It requires an inner policy that defines how a kernel will execute in parallel inside the region. ====================================== ============= ========================== OpenMP CPU Outer Policies Works with Brief description ====================================== ============= ========================== + omp_launch_t launch Creates an OpenMP parallel + region. Same as applying + 'omp parallel pragma' omp_parallel_exec forall, Creates OpenMP parallel kernel (For), region and requires an scan **InnerPolicy**. Same as @@ -138,16 +144,16 @@ kernel will execute in parallel inside the region. ====================================== ============= ========================== Finally, we summarize the inner policies that RAJA provides for OpenMP. -These policies are passed to the RAJA ``omp_parallel_exec`` outer policy as +These policies are passed to the RAJA ``omp_parallel_exec`` outer policy as a template argument as described above. ====================================== ============= ========================== OpenMP CPU Inner Policies Works with Brief description ====================================== ============= ========================== omp_for_exec forall, Parallel execution within - kernel (For), *existing parallel - scan region*; i.e., - apply 'omp for' pragma. + kernel (For), existing parallel + launch (loop) region; i.e., + scan apply 'omp for' pragma. omp_for_static_exec forall, Same as applying kernel (For) 'omp for schedule(static, @@ -173,37 +179,37 @@ a template argument as described above. indicated using ArgList ====================================== ============= ========================== -.. important:: **RAJA only provides a nowait policy option for static - scheduling** since that is the only schedule case that can be - used with nowait and be correct in general when executing - multiple loops in a single parallel region. Paraphrasing the +.. important:: **RAJA only provides a nowait policy option for static + scheduling** since that is the only schedule case that can be + used with nowait and be correct in general when executing + multiple loops in a single parallel region. Paraphrasing the OpenMP standard: *programs that depend on which thread executes a particular loop iteration under any circumstance other than static schedule are non-conforming.* .. note:: As in the RAJA full policies for OpenMP scheduling, the ``ChunkSize`` - is optional. If not provided, the default chunk size that the OpenMP + is optional. If not provided, the default chunk size that the OpenMP implementation applies will be used. .. note:: As noted above, RAJA inner OpenMP policies must only be used within an - **existing** parallel region to work properly. Embedding an inner - policy inside the RAJA outer ``omp_parallel_exec`` will allow you to - apply the OpenMP execution prescription specified by the policies to + **existing** parallel region to work properly. Embedding an inner + policy inside the RAJA outer ``omp_parallel_exec`` will allow you to + apply the OpenMP execution prescription specified by the policies to a single kernel. To support use cases with multiple kernels inside an - OpenMP parallel region, RAJA provides a **region** construct that - takes a template argument to specify the execution back-end. For + OpenMP parallel region, RAJA provides a **region** construct that + takes a template argument to specify the execution back-end. For example:: RAJA::region([=]() { - RAJA::forall >(segment, + RAJA::forall >(segment, [=] (int idx) { // do something at iterate 'idx' } ); - RAJA::forall >(segment, + RAJA::forall >(segment, [=] (int idx) { // do something else at iterate 'idx' } @@ -213,22 +219,22 @@ a template argument as described above. Here, the ``RAJA::region`` method call creates an OpenMP parallel region, which contains two ``RAJA::forall`` - kernels. The first uses the ``RAJA::omp_for_nowait_static_exec< >`` - policy, meaning that no thread synchronization is needed after the - kernel. Thus, threads can start working on the second kernel while + kernels. The first uses the ``RAJA::omp_for_nowait_static_exec< >`` + policy, meaning that no thread synchronization is needed after the + kernel. Thus, threads can start working on the second kernel while others are still working on the first kernel. I general, this will be correct when the segments used in the two kernels are the same, each loop is data parallel, and static scheduling is applied to both - loops. The second kernel uses the ``RAJA::omp_for_static_exec`` - policy, which means that all threads will complete before the kernel - exits. In this example, this is not really needed since there is no - more code to execute in the parallel region and there is an implicit + loops. The second kernel uses the ``RAJA::omp_for_static_exec`` + policy, which means that all threads will complete before the kernel + exits. In this example, this is not really needed since there is no + more code to execute in the parallel region and there is an implicit barrier at the end of it. Threading Building Block (TBB) Parallel CPU Policies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -RAJA provides a basic set of TBB execution policies for use with the +RAJA provides a basic set of TBB execution policies for use with the RAJA TBB back-end, which supports a subset of RAJA features. ====================================== ============= ========================== @@ -265,77 +271,84 @@ RAJA TBB back-end, which supports a subset of RAJA features. GPU Policies for CUDA and HIP ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -RAJA policies for GPU execution using CUDA or HIP are essentially identical. -The only difference is that CUDA policies have the prefix ``cuda_`` and HIP +RAJA policies for GPU execution using CUDA or HIP are essentially identical. +The only difference is that CUDA policies have the prefix ``cuda_`` and HIP policies have the prefix ``hip_``. - ======================================== ============= ======================== + ======================================== ============= ======================================= CUDA/HIP Execution Policies Works with Brief description - ======================================== ============= ======================== + ======================================== ============= ======================================= cuda/hip_exec forall, Execute loop iterations scan, in a GPU kernel launched sort with given thread-block - size. Note that the + size. Note that the thread-block size must be provided, there is no default provided. + cuda/hip_launch_t launch Launches a device kernel, + any code expressed within + the lambda is executed + on the device. cuda/hip_thread_x_direct kernel (For) Map loop iterates - directly to GPU threads + launch (loop) directly to GPU threads in x-dimension, one iterate per thread (see note below about limitations) cuda/hip_thread_y_direct kernel (For) Same as above, but map - to threads in y-dim + launch (loop) to threads in y-dim cuda/hip_thread_z_direct kernel (For) Same as above, but map - to threads in z-dim - cuda/hip_thread_x_loop kernel (For) Similar to - thread-x-direct + launch (loop) to threads in z-dim + cuda/hip_thread_x_loop kernel (For) Similar to + launch (loop) thread-x-direct policy, but use a block-stride loop which doesn't limit number of loop iterates cuda/hip_thread_y_loop kernel (For) Same as above, but for - threads in y-dimension + launch (loop) threads in y-dimension cuda/hip_thread_z_loop kernel (For) Same as above, but for - threads in z-dimension - cuda/hip_flatten_block_threads_{xyz} Launch (Loop) Reshapes threads in a + launch (loop) threads in z-dimension + cuda/hip_flatten_block_threads_{xyz} launch (loop) Reshapes threads in a multi-dimensional thread team into one-dimension, accepts any permutation of dimensions (expt namespace) cuda/hip_block_x_direct kernel (For) Map loop iterates - directly to GPU thread + launch (loop) directly to GPU thread blocks in x-dimension, one iterate per block cuda/hip_block_y_direct kernel (For) Same as above, but map - to blocks in y-dimension + launch (loop) to blocks in y-dimension cuda/hip_block_z_direct kernel (For) Same as above, but map - to blocks in z-dimension - cuda/hip_block_x_loop kernel (For) Similar to - block-x-direct policy, - but use a grid-stride + launch (loop) to blocks in z-dimension + cuda/hip_block_x_loop kernel (For) Similar to + launch (loop) block-x-direct policy, + but use a grid-stride loop. cuda/hip_block_y_loop kernel (For) Same as above, but use - blocks in y-dimension + launch (loop) blocks in y-dimension cuda/hip_block_z_loop kernel (For) Same as above, but use - blocks in z-dimension - cuda/hip_global_thread_x Launch (Loop) Creates a unique thread - id for each thread on - x-dimension of the grid - (expt namespace) - cuda/hip_global_thread_y Launch (Loop) Same as above, but uses - threads in y-dimension - (expt namespace) - cuda/hip_global_thread_z Launch (Loop) Same as above, but uses - threads in z-dimension - (expt namespace) + launch (loop) blocks in z-dimension + cuda/hip_global_thread_x Creates a unique thread + id for each thread on + x-dimension of the grid. + Same as computing + threadIdx.x + threadDim.x * blockIdx.x. + cuda/hip_global_thread_y launch (loop) Same as above, but uses + threads in y-dimension. + Same as computing + threadIdx.y + threadDim.y * blockIdx.y. + cuda/hip_global_thread_z launch (loop) Same as above, but uses + threads in z-dimension. + Same as computing + threadIdx.z + threadDim.z * blockIdx.z. cuda/hip_warp_direct kernel (For) Map work to threads in a warp directly. Cannot be used in conjunction with - cuda/hip_thread_x_* + cuda/hip_thread_x_* policies. Multiple warps can be created by using @@ -346,7 +359,7 @@ policies have the prefix ``hip_``. a warp-stride loop. Cannot be used in conjunction with - cuda/hip_thread_x_* + cuda/hip_thread_x_* policies. Multiple warps can be created by using @@ -357,7 +370,7 @@ policies have the prefix ``hip_``. warp using a bit mask. Cannot be used in conjunction with - cuda/hip_thread_x_* + cuda/hip_thread_x_* policies. Multiple warps can be created by using @@ -365,11 +378,12 @@ policies have the prefix ``hip_``. policies. cuda/hip_warp_masked_loop> kernel (For) Policy to map work to threads in a warp using - a bit mask and a + a bit mask and a warp-stride loop. Cannot - be used in conjunction + be used in conjunction with cuda/hip_thread_x_* - policies. Multiple warps can be created by using + policies. Multiple warps + can be created by using cuda/hip_thread_y/z_* policies. cuda/hip_block_reduce kernel Perform a reduction @@ -378,7 +392,7 @@ policies have the prefix ``hip_``. cuda/_warp_reduce kernel Perform a reduction (Reduce) across a single GPU thread warp. - ======================================== ============= ======================== + ======================================== ============= ======================================= Several notable constraints apply to RAJA CUDA/HIP *thread-direct* policies. @@ -415,77 +429,100 @@ Finally GPU Policies for SYCL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - ======================================== ============= ======================== + ======================================== ============= ============================== SYCL Execution Policies Works with Brief description - ======================================== ============= ======================== + ======================================== ============= ============================== sycl_exec forall, Execute loop iterations in a GPU kernel launched with given work group size. + sycl_launch_t launch Launches a sycl kernel, + any code express within + the lambda is executed + on the device. sycl_global_0 kernel (For) Map loop iterates directly to GPU global ids in first - dimension, one iterate + dimension, one iterate per work item. Group execution into work - groups of given size. + groups of given size. sycl_global_1 kernel (For) Same as above, but map to global ids in second dim sycl_global_2 kernel (For) Same as above, but map - to global ids in third + to global ids in third dim + sycl_global_item_0 launch (loop) Creates a unique thread + id for each thread for + dimension 0 of the grid. + Same as computing + itm.get_group(0) * + itm.get_local_range(0) + + itm.get_local_id(0). + sycl_global_item_1 launch (loop) Same as above, but uses + threads in dimension 1 + Same as computing + itm.get_group(1) + + itm.get_local_range(1) * + itm.get_local_id(1). + sycl_global_item_2 launch (loop) Same as above, but uses + threads in dimension 2 + Same as computing + itm.get_group(2) + + itm.get_local_range(2) * + itm.get_local_id(2). sycl_local_0_direct kernel (For) Map loop iterates - directly to GPU work + launch (loop) directly to GPU work items in first - dimension, one iterate - per work item (see note + dimension, one iterate + per work item (see note below about limitations) sycl_local_1_direct kernel (For) Same as above, but map - to work items in second + launch (loop) to work items in second dim sycl_local_2_direct kernel (For) Same as above, but map - to work items in third + launch (loop) to work items in third dim - sycl_local_0_loop kernel (For) Similar to - local-1-direct policy, - but use a work + sycl_local_0_loop kernel (For) Similar to + launch (loop) local-1-direct policy, + but use a work group-stride loop which doesn't limit number of loop iterates sycl_local_1_loop kernel (For) Same as above, but for - work items in second + launch (loop) work items in second dimension sycl_local_2_loop kernel (For) Same as above, but for - work items in third + launch (loop) work items in third dimension sycl_group_0_direct kernel (For) Map loop iterates - directly to GPU group - ids in first dimension, + launch (loop) directly to GPU group + ids in first dimension, one iterate per group sycl_group_1_direct kernel (For) Same as above, but map - to groups in second + launch (loop) to groups in second dimension sycl_group_2_direct kernel (For) Same as above, but map - to groups in third + launch (loop) to groups in third dimension - sycl_group_0_loop kernel (For) Similar to - group-1-direct policy, - but use a group-stride + sycl_group_0_loop kernel (For) Similar to + launch (loop) group-1-direct policy, + but use a group-stride loop. sycl_group_1_loop kernel (For) Same as above, but use - groups in second + launch (loop) groups in second dimension sycl_group_2_loop kernel (For) Same as above, but use - groups in third + launch (loop) groups in third dimension - ======================================== ============= ======================== + ======================================== ============= ============================== -OpenMP Target Offload Policies +OpenMP Target Offload Policies ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -RAJA provides policies to use OpenMP to offload kernel execution to a GPU +RAJA provides policies to use OpenMP to offload kernel execution to a GPU device, for example. They are summarized in the following table. ====================================== ============= ========================== @@ -520,10 +557,10 @@ RAJA IndexSet Execution Policies ----------------------------------------------------- When an IndexSet iteration space is used in RAJA by passing an IndexSet -to a ``RAJA::forall`` method, for example, an index set execution policy is -required. An index set execution policy is a **two-level policy**: an 'outer' -policy for iterating over segments in the index set, and an 'inner' policy -used to execute the iterations defined by each segment. An index set execution +to a ``RAJA::forall`` method, for example, an index set execution policy is +required. An index set execution policy is a **two-level policy**: an 'outer' +policy for iterating over segments in the index set, and an 'inner' policy +used to execute the iterations defined by each segment. An index set execution policy type has the form:: RAJA::ExecPolicy< segment_iteration_policy, segment_execution_policy > @@ -556,8 +593,8 @@ Parallel Region Policies ------------------------- Earlier, we discussed using the ``RAJA::region`` construct to -execute multiple kernels in an OpenMP parallel region. To support source code -portability, RAJA provides a sequential region concept that can be used to +execute multiple kernels in an OpenMP parallel region. To support source code +portability, RAJA provides a sequential region concept that can be used to surround code that uses execution back-ends other than OpenMP. For example:: RAJA::region([=]() { @@ -574,7 +611,7 @@ surround code that uses execution back-ends other than OpenMP. For example:: .. note:: The sequential region specialization is essentially a *pass through* operation. It is provided so that if you want to turn off OpenMP in - your code, for example, you can simply replace the region policy + your code, for example, you can simply replace the region policy type and you do not have to change your algorithm source code. @@ -615,8 +652,8 @@ cuda/hip_reduce any CUDA/HIP Parallel reduction in a CUDA/HIP kernel reduction value is finalized). cuda/hip_reduce_atomic any CUDA/HIP Same as above, but reduction may use CUDA policy atomic operations. -sycl_reduce any SYCL Reduction in a SYCL kernel (device - policy synchronization will occur when the +sycl_reduce any SYCL Reduction in a SYCL kernel (device + policy synchronization will occur when the reduction value is finalized). ======================= ============= ========================================== @@ -644,17 +681,17 @@ Atomic Policy Loop Policies Brief description seq_atomic seq_exec, Atomic operation performed in a loop_exec non-parallel (sequential) kernel. omp_atomic any OpenMP Atomic operation in OpenM kernel.P - policy multithreading or target kernel; + policy multithreading or target kernel; i.e., apply ``omp atomic`` pragma. cuda/hip/sycl_atomic any Atomic operation performed in a CUDA/HIP/SYCL CUDA/HIP/SYCL kernel. - policy + policy cuda/hip_atomic_explicit any CUDA/HIP Atomic operation performed in a CUDA/HIP policy kernel that may also be used in a host execution context. The atomic policy takes a host atomic policy template - argument. See additional explanation + argument. See additional explanation and example below. builtin_atomic seq_exec, Compiler *builtin* atomic operation. loop_exec, @@ -664,7 +701,7 @@ auto_atomic seq_exec, Atomic operation *compatible* with l loop_exec, execution policy. See example below. any OpenMP Can not be used inside cuda/hip policy, explicit atomic policies. - any + any CUDA/HIP/SYCL policy ============================= ============= ======================================== @@ -743,19 +780,19 @@ of Statements that are composed in the order that they appear in the kernel policy to construct a kernel. A Statement may contain an enclosed StatmentList. Thus, a ``RAJA::KernelPolicy`` type is really just a StatementList. The main Statement types provided by RAJA are ``RAJA::statement::For`` and -``RAJA::statement::Lambda``, that we discussed in -:ref:`loop_elements-kernel-label`. -A ``RAJA::statement::For`` type +``RAJA::statement::Lambda``, that we discussed in +:ref:`loop_elements-kernel-label`. +A ``RAJA::statement::For`` type indicates a for-loop structure. The ``ArgID`` parameter is an integral constant -that identifies the position of the iteration space in the iteration space -tuple passed to the ``RAJA::kernel`` method to be used for the loop. The -``ExecPolicy`` is the RAJA execution policy to use on the loop, which is -similar to ``RAJA::forall`` usage. The ``EnclosedStatements`` type is a -nested template parameter that contains whatever is needed to execute the -kernel and which forms a valid StatementList. The +that identifies the position of the iteration space in the iteration space +tuple passed to the ``RAJA::kernel`` method to be used for the loop. The +``ExecPolicy`` is the RAJA execution policy to use on the loop, which is +similar to ``RAJA::forall`` usage. The ``EnclosedStatements`` type is a +nested template parameter that contains whatever is needed to execute the +kernel and which forms a valid StatementList. The ``RAJA::statement::Lambda`` -type invokes the lambda expression corresponding to its position 'LambdaID' -in the sequence of lambda expressions in the ``RAJA::kernel`` argument list. +type invokes the lambda expression corresponding to its position 'LambdaID' +in the sequence of lambda expressions in the ``RAJA::kernel`` argument list. For example, a simple sequential for-loop:: for (int i = 0; i < N; ++i) { @@ -791,13 +828,13 @@ can be used with ``RAJA::kernel`` and ``RAJA::kernel_param``. More detailed explanation along with examples of how they are used can be found in the ``RAJA::kernel`` examples in :ref:`tutorial-label`. -.. note:: All of the statement types described below are in the namespace +.. note:: All of the statement types described below are in the namespace ``RAJA::statement``. For brevity, we omit the namespaces in the discussion in this section. -.. note:: ``RAJA::kernel_param`` functions similarly to ``RAJA::kernel`` - except that the second argument is a *tuple of parameters* used - in a kernel for local arrays, thread local variables, tiling +.. note:: ``RAJA::kernel_param`` functions similarly to ``RAJA::kernel`` + except that the second argument is a *tuple of parameters* used + in a kernel for local arrays, thread local variables, tiling information, etc. Several RAJA statements can be specialized with auxilliary types, which are @@ -813,12 +850,12 @@ The following list contains the most commonly used statement types. * ``Collapse< ExecPolicy, ArgList<...>, EnclosedStatements >`` collapses multiple perfectly nested loops specified by tuple iteration space indices in ``ArgList``, using the ``ExecPolicy`` execution policy, and places ``EnclosedStatements`` inside the collapsed loops which are executed for each iteration. **Note that this only works for CPU execution policies (e.g., sequential, OpenMP).** It may be available for CUDA in the future if such use cases arise. -There is one statement specific to OpenMP kernels. +There is one statement specific to OpenMP kernels. * ``OmpSyncThreads`` applies the OpenMP ``#pragma omp barrier`` directive. -Statement types that launch CUDA or HIP GPU kernels are listed next. They work -similarly for each back-end and their names are distinguished by the prefix +Statement types that launch CUDA or HIP GPU kernels are listed next. They work +similarly for each back-end and their names are distinguished by the prefix ``Cuda`` or ``Hip``. For example, ``CudaKernel`` or ``HipKernel``. * ``Cuda/HipKernel< EnclosedStatements>`` launches ``EnclosedStatements`` as a GPU kernel; e.g., a loop nest where the iteration spaces of each loop level are associated with threads and/or thread blocks as described by the execution policies applied to them. This kernel launch is synchronous. @@ -845,14 +882,14 @@ similarly for each back-end and their names are distinguished by the prefix * ``Cuda/HipSyncWarp`` invokes CUDA ``__syncwarp()`` barrier. Warp sync is not supported in HIP, so the HIP variant is a no-op. -Statement types that launch SYCL kernels are listed next. +Statement types that launch SYCL kernels are listed next. * ``SyclKernel`` launches ``EnclosedStatements`` as a SYCL kernel. This kernel launch is synchronous. * ``SyclKernelAsync`` asynchronous version of SyclKernel. -RAJA provides statements to define loop tiling which can improve performance; -e.g., by allowing CPU cache blocking or use of GPU shared memory. +RAJA provides statements to define loop tiling which can improve performance; +e.g., by allowing CPU cache blocking or use of GPU shared memory. * ``Tile< ArgId, TilePolicy, ExecPolicy, EnclosedStatements >`` abstracts an outer tiling loop containing an inner for-loop over each tile. The ``ArgId`` indicates which entry in the iteration space tuple to which the tiling loop applies and the ``TilePolicy`` specifies the tiling pattern to use, including its dimension. The ``ExecPolicy`` and ``EnclosedStatements`` are similar to what they represent in a ``statement::For`` type. @@ -884,7 +921,7 @@ The following list summarizes auxilliary types used in the above statements. The types live in the ``RAJA`` namespace. * ``tile_fixed`` tile policy argument to a ``Tile`` or ``TileTCount`` statement; partitions loop iterations into tiles of a fixed size specified by ``TileSize``. This statement type can be used as the ``TilePolicy`` template parameter in the ``Tile`` statements above. - + * ``tile_dynamic`` TilePolicy argument to a Tile or TileTCount statement; partitions loop iterations into tiles of a size specified by a ``TileSize{}`` positional parameter argument. This statement type can be used as the ``TilePolicy`` template paramter in the ``Tile`` statements above. * ``Segs<...>`` argument to a Lambda statement; used to specify which segments in a tuple will be used as lambda arguments. diff --git a/docs/sphinx/user_guide/feature/tiling.rst b/docs/sphinx/user_guide/feature/tiling.rst index 3d327f8c04..b42ac964c0 100644 --- a/docs/sphinx/user_guide/feature/tiling.rst +++ b/docs/sphinx/user_guide/feature/tiling.rst @@ -13,13 +13,13 @@ Loop Tiling =========== In this section, we discuss RAJA statements that can be used to tile nested -loops. Typical loop tiling involves partitioning an iteration space into -a collection of "tiles" and then iterating over tiles in outer loops and -indices within each tile in inner loops. Many scientific computing algorithms +loops. Typical loop tiling involves partitioning an iteration space into +a collection of "tiles" and then iterating over tiles in outer loops and +indices within each tile in inner loops. Many scientific computing algorithms can benefit from loop tiling due to more efficient cache usage on a CPU or use of GPU shared memory. -For example, consider an operation performed using a C-style for-loop with +For example, consider an operation performed using a C-style for-loop with a range of [0, 10):: for (int i=0; i<10; ++i) { @@ -52,35 +52,67 @@ Here is a way to write the tiled loop kernel above using ``RAJA::kernel``:: >; RAJA::kernel( - RAJA::make_tuple(RAJA::TypedRangeSegment(0,10)), + RAJA::make_tuple(RAJA::TypedRangeSegment(0,10)), [=] (int i) { // kernel body using index 'i' } ); -In RAJA, the simplest way to tile an iteration space is to use +In RAJA kernel, the simplest way to tile an iteration space is to use ``RAJA::statement::Tile`` and ``RAJA::statement::For`` statement types. A -``RAJA::statement::Tile`` type is similar to a ``RAJA::statement::For`` type, -but takes a tile size as the second template argument. The -``RAJA::statement::Tile`` type generates the outer loop over tiles and -the ``RAJA::statement::For`` type iterates over each tile. Nested together, -these statements will pass the global index ('i' in the example) to the +``RAJA::statement::Tile`` type is similar to a ``RAJA::statement::For`` type, +but takes a tile size as the second template argument. The +``RAJA::statement::Tile`` type generates the outer loop over tiles and +the ``RAJA::statement::For`` type iterates over each tile. Nested together, +these statements will pass the global index ('i' in the example) to the lambda expression as (kernel body) in a non-tiled version above. -.. note:: When using ``RAJA::statement::Tile`` and ``RAJA::statement::For`` - types together to define a tiled loop structure, the integer passed - as the first template argument to each statement type must be the +.. note:: When using ``RAJA::statement::Tile`` and ``RAJA::statement::For`` + types together to define a tiled loop structure, the integer passed + as the first template argument to each statement type must be the same. This indicates that they both apply to the same iteration space in the space tuple passed to the ``RAJA::kernel`` method. -RAJA also provides alternative statements that provide the tile number and -local tile index, if needed inside the kernel body, as shown below:: + +The ``RAJA::launch`` API also supports loop tiling through specialized + +methods. The launch version of the code above is :: + + using launch_t = RAJA::LaunchPolicy; + using loop_t = RAJA::LoopPolicy; + + RAJA::launch( + RAJA::LaunchParams(), RAJA_HOST_DEVICE(RAJA::launchContext ctx) { + + RAJA::tile( + ctx, tile_size, RAJA::TypedRangeSegment(0, 10), [&] (RAJA::TypedRangeSegment const &tile) { + + RAJA::loop( + ctx, tile, [&] (int i) { + + // kernel body using index 'i' + + } + ); + } + ); + } + ); + +In the example above the ``RAJA::tile`` method is used to generate a tile in the larger iteration space. + +This approach requires source code changes if the developer wanted to remove tiling, while RAJA kernel enables +switching between tiling and non-tiling via execution policies and recompilation. Tile size in ``RAJA::launch`` +can be be selected dynamically and tiles are created on the device. + +In both kernel and launch RAJA also provides alternative statements that provide the tile number and +local tile index. Using RAJA kernel, we illustrate usage below:: using KERNEL_EXEC_POL2 = RAJA::KernelPolicy< - RAJA::statement::TileTCount<0, RAJA::statement::Param<0>, + RAJA::statement::TileTCount<0, RAJA::statement::Param<0>, RAJA::tile_fixed<2>, RAJA::seq_exec, - RAJA::statement::ForICount<0, RAJA::statement::Param<1>, + RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::seq_exec, RAJA::statement::Lambda<0> > @@ -101,19 +133,46 @@ local tile index, if needed inside the kernel body, as shown below:: } ); -The ``RAJA::statement::TileTCount`` type indicates that the tile number will -be passed to the lambda expression and the ``RAJA::statement::ForICount`` type -indicates that the local tile loop index will be passed to the lambda -expression. Storage for these values is specified in the parameter tuple, the -second argument passed to the ``RAJA::kernel_param`` method. The -``RAJA::statement::Param<#>`` type appearing as the second -template parameter for each statement type indicates which parameter tuple -entry, the tile number or local tile loop index, is passed to the lambda and -in which order. Here, the tile number is the second lambda argument (tuple -parameter '0') and the local tile loop index is the third lambda argument +The ``RAJA::statement::TileTCount`` type indicates that the tile number will +be passed to the lambda expression and the ``RAJA::statement::ForICount`` type +indicates that the local tile loop index will be passed to the lambda +expression. Storage for these values is specified in the parameter tuple, the +second argument passed to the ``RAJA::kernel_param`` method. The +``RAJA::statement::Param<#>`` type appearing as the second +template parameter for each statement type indicates which parameter tuple +entry, the tile number or local tile loop index, is passed to the lambda and +in which order. Here, the tile number is the second lambda argument (tuple +parameter '0') and the local tile loop index is the third lambda argument (tuple parameter '1'). .. note:: The global loop indices always appear as the first lambda expression - arguments. Then, the parameter tuples identified by the integers - in the ``RAJA::Param`` statement types given for the loop statement + arguments. Then, the parameter tuples identified by the integers + in the ``RAJA::Param`` statement types given for the loop statement types follow. + +The launch API uses ``RAJA::tile_tcount`` and ``RAJA::loop_icount`` methods +which has a second argument on the lambda for the index. We illustrate usage below:: + + using launch_t = RAJA::LaunchPolicy; + using loop_t = RAJA::LoopPolicy; + + RAJA::launch( + RAJA::LaunchParams(), RAJA_HOST_DEVICE(RAJA::launchContext ctx) { + + RAJA::tile_tcount( + ctx, tile_size, RAJA::TypedRangeSegment(0, 10), [&] (RAJA::TypedRangeSegment const &tile, int t) { + + RAJA::loop_icount( + ctx, tile, [&] (int idx, int i) { + + // idx - global index + // t - tile number + // i - index within tile + // Then, idx = i + tile_size*t + + } + ); + } + ); + } + ); diff --git a/docs/sphinx/user_guide/tutorial.rst b/docs/sphinx/user_guide/tutorial.rst index 2d81586c9f..9a8c04b419 100644 --- a/docs/sphinx/user_guide/tutorial.rst +++ b/docs/sphinx/user_guide/tutorial.rst @@ -20,12 +20,12 @@ RAJA Tutorial =============== This section contains a self-paced tutorial that shows how to use many RAJA -features by way of a sequence of examples and exercises. Each exercise is -located in files in the ``RAJA/exercises`` directory, one *exercise* file with -code sections removed and comments containing instructions to fill in the -missing code parts and one *solution* file containing complete working code to +features by way of a sequence of examples and exercises. Each exercise is +located in files in the ``RAJA/exercises`` directory, one *exercise* file with +code sections removed and comments containing instructions to fill in the +missing code parts and one *solution* file containing complete working code to compare with and for guidance if you get stuck working on the exercise file. -You are encouraged to build and run the exercises and modify them to try out +You are encouraged to build and run the exercises and modify them to try out different variations. We also maintain a repository of tutorial slide presentations @@ -40,12 +40,12 @@ difference between CPU (host) and GPU (device) memory allocations and how transfers between those memory spaces work. For a detailed discussion, see `Device Memory `_. -It is important to note that RAJA does not provide a memory model. This is by -design as application developers who use RAJA prefer to manage memory -in different ways. Thus, users are responsible for ensuring that data is -properly allocated and initialized on a GPU device when running GPU code. -This can be done using explicit host and device allocation and copying between -host and device memory spaces or via unified memory (UM), if available. +It is important to note that RAJA does not provide a memory model. This is by +design as application developers who use RAJA prefer to manage memory +in different ways. Thus, users are responsible for ensuring that data is +properly allocated and initialized on a GPU device when running GPU code. +This can be done using explicit host and device allocation and copying between +host and device memory spaces or via unified memory (UM), if available. The RAJA Portability Suite contains other libraries, namely `CHAI `_ and `Umpire `_, that complement RAJA by @@ -68,20 +68,20 @@ templates and lambda expressions is required. So, before we begin, we provide a bit of background discussion of basic aspects of how RAJA use employs C++ templates and lambda expressions, which is essential to use RAJA successfully. -RAJA is almost an entirely header-only library that makes heavy use of -C++ templates. Using RAJA most easily and effectively is done by representing -the bodies of loop kernels as C++ lambda expressions. Alternatively, C++ -functors can be used, but they make application source code more complex, -potentially placing a significant negative burden on source code readability +RAJA is almost an entirely header-only library that makes heavy use of +C++ templates. Using RAJA most easily and effectively is done by representing +the bodies of loop kernels as C++ lambda expressions. Alternatively, C++ +functors can be used, but they make application source code more complex, +potentially placing a significant negative burden on source code readability and maintainability. ----------------------------------- C++ Templates ----------------------------------- -C++ templates enable one to write type-generic code and have the compiler +C++ templates enable one to write type-generic code and have the compiler generate an implementation for each set of template parameter types specified. -For example, the ``RAJA::forall`` method to execute loop kernels is +For example, the ``RAJA::forall`` method to execute loop kernels is essentially method defined as:: template `` method calls, which +Code written inside the lambda expression body will execute in the chosen +execution environment. Within that environment, a user executes +kernel operations using ``RAJA::loop`` method calls, which take lambda expressions to express loop body operations. -.. note:: A key difference between the ``RAJA::kernel`` and - ``RAJA::expt::launch`` approaches is that almost all of the - kernel execution pattern is expressed in the execution policy - when using ``RAJA::kernel``, whereas with ``RAJA::expt::launch`` the +.. note:: A key difference between the ``RAJA::kernel`` and + ``RAJA::launch`` approaches is that almost all of the + kernel execution pattern is expressed in the execution policy + when using ``RAJA::kernel``, whereas with ``RAJA::launch`` the kernel execution pattern is expressed mostly in the lambda - expression kernel body. + expression kernel body. One may argue that ``RAJA::kernel`` is more portable and flexible in that -the execution policy enables compile time code transformations without -changing kernel body code. On the other hand, ``RAJA::expt::launch`` is +the execution policy enables compile time code transformations without +changing kernel body code. On the other hand, ``RAJA::launch`` is less opaque and more intuitive, but may require kernel body code changes for algorithm changes. Which interface to use depends on personal preference -and other concerns, such as portability requirements, the need for run time -execution selection, etc. Kernel structure is more explicit in application -source code with ``RAJA::expt::launch``, and more concise and arguably more -opaque with ``RAJA::kernel``. There is a large overlap of algorithms that can -be expressed with either interface. However, there are things that one can do +and other concerns, such as portability requirements, the need for run time +execution selection, etc. Kernel structure is more explicit in application +source code with ``RAJA::launch``, and more concise and arguably more +opaque with ``RAJA::kernel``. There is a large overlap of algorithms that can +be expressed with either interface. However, there are things that one can do with one or the other but not both. In the following sections, we introduce the basic mechanics and features of both APIs with examples and exercises. We also present a sequence of -execution policy examples and matrix transpose examples using both -``RAJA::kernel`` and ``RAJA::expt::launch`` to compare and contrast the +execution policy examples and matrix transpose examples using both +``RAJA::kernel`` and ``RAJA::launch`` to compare and contrast the two interfaces. =========================================================================== @@ -409,10 +408,10 @@ mechanisms to transform loop patterns. More information can be found in tutorial/offset-layout-5pt-stencil.rst ================================================================= -Nested Loops with ``RAJA::expt::launch`` +Nested Loops with ``RAJA::launch`` ================================================================= -The examples in this section illustrate how to use ``RAJA::expt::launch`` +The examples in this section illustrate how to use ``RAJA::launch`` to create an run time selectable execution space for expressing algorithms as nested loops. @@ -426,12 +425,12 @@ as nested loops. .. _tutorialmatrixtranspose-label: =============================================================================== -Comparing ``RAJA::kernel`` and ``RAJA::expt::launch``: Matrix-Transpose +Comparing ``RAJA::kernel`` and ``RAJA::launch``: Matrix-Transpose =============================================================================== -In this section, we compare ``RAJA::kernel`` and ``RAJA::expt::launch`` -implementations of a matrix transpose algorithm. We illustrate -implementation differences of the two interfaces as we build upon each +In this section, we compare ``RAJA::kernel`` and ``RAJA::launch`` +implementations of a matrix transpose algorithm. We illustrate +implementation differences of the two interfaces as we build upon each example with more complex features. .. toctree:: @@ -450,5 +449,3 @@ Other RAJA Features and Usage Examples tutorial/halo-exchange.rst tutorial/matrix_multiply.rst - - diff --git a/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst b/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst index cbaabd459d..0bb70fa94e 100644 --- a/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst +++ b/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst @@ -14,12 +14,12 @@ Tiled Matrix Transpose with Local Array This section extends the discussion in :ref:`tut-tiledmatrixtranspose-label` by adding *local array* objects which are used to store data for each tile in -CPU stack-allocated arrays or GPU thread local and shared memory to be used +CPU stack-allocated arrays or GPU thread local and shared memory to be used within kernels. There are exercise files ``RAJA/exercises/kernel-matrix-transpose-local-array.cpp`` and -``RAJA/exercises/launch-matrix-transpose-local-array.cpp`` for you to work +``RAJA/exercises/launch-matrix-transpose-local-array.cpp`` for you to work through if you wish to get some practice with RAJA. The files ``RAJA/exercises/kernel-matrix-transpose-local-array._solutioncpp`` and ``RAJA/exercises/launch-matrix-transpose-local-array_solution.cpp`` contain @@ -35,24 +35,24 @@ Key RAJA features shown in this example are: * ``RAJA::statement::ForICount`` type for generating local tile indices * ``RAJA::LocalArray`` type for thread-local tile memory arrays * ``RAJA::launch`` kernel execution interface - * ``RAJA::expt::tile`` type for loop tiling - * ``RAJA::expt::loop_icount`` method to generate local tile indices for Launch + * ``RAJA::tile`` type for loop tiling + * ``RAJA::loop_icount`` method to generate local tile indices for Launch * ``RAJA_TEAM_SHARED`` macro for thread-local tile memory arrays -As in :ref:`tut-tiledmatrixtranspose-label`, this example computes the -transpose of an input matrix :math:`A` of size :math:`N_r \times N_c` and +As in :ref:`tut-tiledmatrixtranspose-label`, this example computes the +transpose of an input matrix :math:`A` of size :math:`N_r \times N_c` and stores the result in a second matrix :math:`At` of size :math:`N_c \times N_r`. -The operation uses a local memory tiling algorithm, which tiles the outer -loops and iterates over tiles in inner loops. The algorithm first loads -input matrix entries into a local two-dimensional array for a tile, and then -reads from the tile swapping the row and column indices to generate the output -matrix. +The operation uses a local memory tiling algorithm, which tiles the outer +loops and iterates over tiles in inner loops. The algorithm first loads +input matrix entries into a local two-dimensional array for a tile, and then +reads from the tile swapping the row and column indices to generate the output +matrix. -We choose tile dimensions smaller than the dimensions of the matrix and note +We choose tile dimensions smaller than the dimensions of the matrix and note that it is not necessary for the tile dimensions to divide evenly the number -of rows and columns in the matrix. As in the -:ref:`tut-tiledmatrixtranspose-label` example, we start by defining the number -of rows and columns in the matrices, the tile dimensions, and the number of +of rows and columns in the matrix. As in the +:ref:`tut-tiledmatrixtranspose-label` example, we start by defining the number +of rows and columns in the matrices, the tile dimensions, and the number of tiles. .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp @@ -68,7 +68,7 @@ as in the :ref:`tut-tiledmatrixtranspose-label` example. :end-before: // _mattranspose_localarray_views_end :language: C++ -The complete sequential C-style implementation of the tiled transpose operation +The complete sequential C-style implementation of the tiled transpose operation using a stack-allocated local array for the tiles is: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp @@ -87,38 +87,38 @@ using a stack-allocated local array for the tiles is: ``RAJA::kernel`` Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``RAJA::kernel`` interface provides mechanisms to tile loops and use -*local arrays* in kernels so that algorithm patterns like the C-style kernel -above can be implemented with RAJA. When using ``RAJA::kernel``, a -``RAJA::LocalArray`` type specifies an object whose memory is created inside -a kernel using a statement type in a RAJA kernel execution policy. The local -array data is only usable within the kernel. See :ref:`feat-local_array-label` -for more information. +The ``RAJA::kernel`` interface provides mechanisms to tile loops and use +*local arrays* in kernels so that algorithm patterns like the C-style kernel +above can be implemented with RAJA. When using ``RAJA::kernel``, a +``RAJA::LocalArray`` type specifies an object whose memory is created inside +a kernel using a statement type in a RAJA kernel execution policy. The local +array data is only usable within the kernel. See :ref:`feat-local_array-label` +for more information. -``RAJA::kernel`` methods also support loop tiling statements which determine +``RAJA::kernel`` methods also support loop tiling statements which determine the number of tiles needed to perform an operation based on tile size and extent of the corresponding iteration space. Moreover, lambda expressions for -the kernel will not be invoked for iterations outside the bounds of an +the kernel will not be invoked for iterations outside the bounds of an iteration space when tile dimensions do not divide evenly the size of the iteration space; thus, no conditional checks on loop bounds are needed inside inner loops. For the RAJA version of the matrix transpose kernel above, we define the type of the ``RAJA::LocalArray`` used for matrix entries in a tile and -create an object to represent it: +create an object to represent it: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp :start-after: // _mattranspose_localarray_start :end-before: // _mattranspose_localarray_end :language: C++ -The template parameters that define the type are: the array data type, the -data stride permutation for the array indices (here the identity permutation -is given, so the default RAJA conventions apply; i.e., the rightmost array -index will be stride-1), and the array dimensions. Next, we compare two +The template parameters that define the type are: the array data type, the +data stride permutation for the array indices (here the identity permutation +is given, so the default RAJA conventions apply; i.e., the rightmost array +index will be stride-1), and the array dimensions. Next, we compare two ``RAJA::kernel`` implementations of the matrix transpose operation. -The complete RAJA sequential CPU variant with kernel execution policy and +The complete RAJA sequential CPU variant with kernel execution policy and kernel is: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp @@ -127,73 +127,73 @@ kernel is: :language: C++ In the execution policy, the ``RAJA::statement::Tile`` types define -tiling of the outer 'row' (iteration space tuple index '1') and 'col' -(iteration space tuple index '0') loops, as well as tile sizes -(``RAJA::tile_fixed`` types) and loop execution policies. Next, +tiling of the outer 'row' (iteration space tuple index '1') and 'col' +(iteration space tuple index '0') loops, as well as tile sizes +(``RAJA::tile_fixed`` types) and loop execution policies. Next, the ``RAJA::statement::InitLocalMem`` type allocates the local tile array based on the memory policy type (here, we use ``RAJA::cpu_tile_mem`` for -a CPU stack-allocated array). The ``RAJA::ParamList<2>`` parameter indicates -that the local array object is associated with position '2' in the parameter +a CPU stack-allocated array). The ``RAJA::ParamList<2>`` parameter indicates +that the local array object is associated with position '2' in the parameter tuple argument passed to the ``RAJA::kernel_param`` method. The first two -entries in the parameter tuple indicate storage for the local tile indices +entries in the parameter tuple indicate storage for the local tile indices that are used in the two lambda expressions that comprise the kernel body. -Finally, we have two sets of nested inner loops for reading the input matrix -entries into the local tile array and writing them out to the output matrix -transpose. The inner bodies of each of these loop nests are identified by +Finally, we have two sets of nested inner loops for reading the input matrix +entries into the local tile array and writing them out to the output matrix +transpose. The inner bodies of each of these loop nests are identified by lambda expression invocation statements ``RAJA::statement::Lambda<0>`` for the first lambda passed as an argument to the ``RAJA::kernel_param`` method and ``RAJA::statement::Lambda<1>`` for the second lambda argument. -Note that the loops within tiles use ``RAJA::statement::ForICount`` types -rather than ``RAJA::statement::For`` types that we saw in the +Note that the loops within tiles use ``RAJA::statement::ForICount`` types +rather than ``RAJA::statement::For`` types that we saw in the tiled matrix transpose example in :ref:`tut-tiledmatrixtranspose-label`. -The ``RAJA::statement::ForICount`` type generates local tile indices that +The ``RAJA::statement::ForICount`` type generates local tile indices that are passed to lambda loop body expressions to index into the local tile -memory array. As the reader will observe, there is no local tile index -computation needed in the lambdas for the RAJA version of the kernel as a -result. The first integer template parameter for each -``RAJA::statement::ForICount`` type indicates the item in the iteration space -tuple passed to the ``RAJA::kernel_param`` method to which it applies. -The second template parameter for each -``RAJA::statement::ForICount`` type indicates the position in the parameter -tuple passed to the ``RAJA::kernel_param`` method that will hold the -associated local tile index. For more detailed discussion of RAJA loop tiling +memory array. As the reader will observe, there is no local tile index +computation needed in the lambdas for the RAJA version of the kernel as a +result. The first integer template parameter for each +``RAJA::statement::ForICount`` type indicates the item in the iteration space +tuple passed to the ``RAJA::kernel_param`` method to which it applies. +The second template parameter for each +``RAJA::statement::ForICount`` type indicates the position in the parameter +tuple passed to the ``RAJA::kernel_param`` method that will hold the +associated local tile index. For more detailed discussion of RAJA loop tiling statement types, please see :ref:`feat-tiling-label`. -Now that we have described the execution policy in some detail, let's pull -everything together by briefly walking though the call to the +Now that we have described the execution policy in some detail, let's pull +everything together by briefly walking though the call to the ``RAJA::kernel_param`` method, which is similar to ``RAJA::kernel`` but takes additional arguments needed to execute the operations involving local -tile indices and the local memory array. The first argument is a tuple of -iteration spaces that define the iteration ranges for the levels in the loop +tile indices and the local memory array. The first argument is a tuple of +iteration spaces that define the iteration ranges for the levels in the loop nest. Again, the first integer parameters given to the ``RAJA::statement::Tile`` -and ``RAJA::statement::ForICount`` types identify the tuple entry to which +and ``RAJA::statement::ForICount`` types identify the tuple entry to which they apply. The second argument:: RAJA::make_tuple((int)0, (int)0, Tile_Array) -is a tuple of data parameters that will hold the local tile indices and -``RAJA::LocalArray`` tile memory. The tuple entries are +is a tuple of data parameters that will hold the local tile indices and +``RAJA::LocalArray`` tile memory. The tuple entries are associated with various statements in the execution policy as we described -earlier. Next, two lambda expression arguments are passed to the -``RAJA::kernel_param`` method for reading and writing the input and output +earlier. Next, two lambda expression arguments are passed to the +``RAJA::kernel_param`` method for reading and writing the input and output matrix entries, respectively. .. note:: ``RAJA::kernel_param`` accepts a parameter tuple argument after - the iteration space tuple, which enables the parameters to be + the iteration space tuple, which enables the parameters to be used in multiple lambda expressions in a kernel. -In the kernel, both lambda expressions take the same five arguments. The first -two are the matrix global column and row indices associated with the iteration +In the kernel, both lambda expressions take the same five arguments. The first +two are the matrix global column and row indices associated with the iteration space tuple. The next three arguments correspond to the parameter tuple entries. -The first two of these are the local tile indices used to access entries in the -``RAJA::LocalArray`` object memory. The last argument is a reference to the +The first two of these are the local tile indices used to access entries in the +``RAJA::LocalArray`` object memory. The last argument is a reference to the ``RAJA::LocalArray`` object itself. -The next ``RAJA::kernel_param`` variant we present works the same as the one -above. It is different from the previous version since we include -additional template parameters in the ``RAJA::statement::Lambda`` types to -indicate which arguments each lambda expression takes and in which order. +The next ``RAJA::kernel_param`` variant we present works the same as the one +above. It is different from the previous version since we include +additional template parameters in the ``RAJA::statement::Lambda`` types to +indicate which arguments each lambda expression takes and in which order. Here is the complete version including execution policy and kernel: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp @@ -202,25 +202,25 @@ Here is the complete version including execution policy and kernel: :language: C++ Here, the two ``RAJA::statement::Lambda`` types in the execution policy show -two different ways to specify the segments (``RAJA::Segs``) -associated with the matrix column and row indices. That is, we can use a +two different ways to specify the segments (``RAJA::Segs``) +associated with the matrix column and row indices. That is, we can use a ``Segs`` statement for each argument, or include multiple segment ids in one -statement. +statement. Note that we are using ``RAJA::statement::For`` types for the inner tile loops instead of `RAJA::statement::ForICount`` types used in the first variant. As a consequence of specifying lambda arguments, there are two main differences. -The local tile indices are properly computed and passed to the lambda +The local tile indices are properly computed and passed to the lambda expressions as a result of the ``RAJA::Offsets`` types that appear in the lambda statement types. The ``RAJA::statement::Lambda`` type for each -lambda shows the two ways to specify the local tile index arguments; we can -use an ``Offsets`` statement for each argument, or include multiple segment +lambda shows the two ways to specify the local tile index arguments; we can +use an ``Offsets`` statement for each argument, or include multiple segment ids in one statement. Lastly, there is only one entry in the parameter tuple in this case, the local tile array. The placeholders in the previous example are not needed. -.. note:: In this example, we need all five arguments in each lambda - expression so the lambda expression argument lists are +.. note:: In this example, we need all five arguments in each lambda + expression so the lambda expression argument lists are the same. Another use case for the template parameter argument specification described here is to be able to pass only the arguments used in a lambda expression. In particular when we use @@ -228,24 +228,30 @@ previous example are not needed. can have a different argument lists from the others. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -``RAJA::expt::launch`` Variants +``RAJA::launch`` Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The ``RAJA::expt::launch`` interface provides mechanisms to tile loops and use -*local arrays* in kernels to support algorithm patterns like the C-style kernel -above. When, using ``RAJA::expt::launch``, the ``RAJA_TEAM_SHARED`` macro is -used to create a GPU shared memory array or a CPU stack memory array inside -a kernel. - -``RAJA::expt::launch`` support methods for tiling over an iteration space -using ``RAJA::expt::tile`` and ``RAJA::expt::loop_icount`` methods to tile -loops and generate global iteration indices and local tile offsets. -Moreover, lambda expressions for these methods will not be invoked for -iterations outside the bounds of an iteration space when tile dimensions -do not divide evenly the size of the iteration space; thus, no conditional -checks on loop bounds are needed inside inner loops. - -A complete RAJA sequential CPU variant with kernel execution policy and +The ``RAJA::launch`` interface provides mechanisms to tile loops and use +*local arrays* in kernels to support algorithm patterns like the C-style kernel +above. When, using ``RAJA::launch``, the ``RAJA_TEAM_SHARED`` macro is +used to create a GPU a static sized shared memory array when using the CUDA and HIP backends, +static shared memory using SYCL is currently not supported. On the CPU, allocating +``RAJA_TEAM_SHARED`` corresponds to allocating memory on the stack. +Alternatively, one can allocated dynamic shared memory by specifying the amount of shared memory in +the ``RAJA::LaunchParams`` struct. Dynamic shared memory is supported with all +backends and will be demonstrated as our second example. On the CPU dynamic +shared memory is mapped to heap memory and allocated via malloc at kernel launch. + +As a first example, we illustrate the usage of static shared memory +and the use of ``RAJA::launch`` tiling methods. RAJA tiling methods +take a iteration space in ``RAJA::tile`` and output tiles which +the ``RAJA::loop_icount`` method can iterate over and generate +global and local tile index offsets. Moreover, lambda expressions for these +methods will not be invoked for iterations outside the bounds of an iteration +space when tile dimensions do not evenly divide the size of the iteration space; +thus, no conditional checks on loop bounds are needed inside inner loops. + +A complete RAJA sequential CPU variant with kernel execution policy and kernel is: .. literalinclude:: ../../../../exercises/launch-matrix-transpose-local-array_solution.cpp @@ -253,15 +259,35 @@ kernel is: :end-before: // _mattranspose_localarray_raja_end :language: C++ -Here, the ``RAJA::expt::tile`` method is used to create tilings of the outer -'row' and 'col' iteration spaces. The ``RAJA::expt::tile`` method -takes an additional argument specifying the tile size for the corresponding -loop. To traverse the tile, we use the ``RAJA::expt::loop_icount`` method, -which is similar to the ``RAJA::ForICount`` statement used in a -``RAJA::kernel`` execution policy as shown above. A -``RAJA::expt::loop_icount`` method call +Here, the ``RAJA::tile`` method is used to create tilings of the outer +'row' and 'col' iteration spaces. The ``RAJA::tile`` method +takes an additional argument specifying the tile size for the corresponding +loop. To traverse the tile, we use the ``RAJA::loop_icount`` method, +which is similar to the ``RAJA::ForICount`` statement used in a +``RAJA::kernel`` execution policy as shown above. A +``RAJA::loop_icount`` method call will generate local tile index associated with the outer global index. -The local tile index is necessary as we use it to read and write entries +The local tile index is necessary as we use it to read and write entries from/to global memory to ``RAJA_TEAM_SHARED`` memory array. +As an alternative to static shared memory, the matrix transpose kernel may be +express using dynamic shared memory. Prior to invoking the amount of shared memory +must be specified +.. literalinclude:: ../../../../examples/dynamic_mat_transpose.cpp + :start-after: // _dynamic_mattranspose_shared_mem_start + :end-before: // _dynamic_mattranspose_shared_mem_end + :language: C++ + +The amount of shared memory is then specifed in the ``RAJA::LaunchParams`` struct +and then accessed within the kernel using the LaunchContext's ``getSharedMemory`` method. +The ``getSharedMemory`` method may be invoked multiple times each time returning an +offset to the shared memory buffer. Since the offset moves per ``getSharedMemory`` +call (also known as bump style allocation), it becomes necessary to reset the allocator offset +count at the end of the shared memory scope to avoid going beyond the buffer size. +The full example of matrix transpose with dynamic shared memory is provided below + +.. literalinclude:: ../../../../examples/dynamic_mat_transpose.cpp + :start-after: // _dynamic_mattranspose_kernel_start + :end-before: // _dynamic_mattranspose_kernel_end + :language: C++ diff --git a/examples/dynamic_mat_transpose.cpp b/examples/dynamic_mat_transpose.cpp index 26deb9c142..4a03b33c39 100644 --- a/examples/dynamic_mat_transpose.cpp +++ b/examples/dynamic_mat_transpose.cpp @@ -199,7 +199,7 @@ int main(int argc, char *argv[]) // // Define num rows/cols in matrix, tile dimensions, and number of tiles // - // _mattranspose_localarray_dims_start + // _dynamic_mattranspose_localarray_dims_start const int N_r = 267; const int N_c = 251; @@ -207,7 +207,7 @@ int main(int argc, char *argv[]) const int outer_Dimc = (N_c - 1) / TILE_DIM + 1; const int outer_Dimr = (N_r - 1) / TILE_DIM + 1; - // _mattranspose_localarray_dims_end + // _dynamic_mattranspose_localarray_dims_end // // Allocate matrix data @@ -221,10 +221,10 @@ int main(int argc, char *argv[]) // holds a pointer to a data array and enables multi-dimensional indexing // into the data. // - // _mattranspose_localarray_views_start + // _dynamic_mattranspose_localarray_views_start RAJA::View> Aview(A, N_r, N_c); RAJA::View> Atview(At, N_c, N_r); - // _mattranspose_localarray_views_end + // _dynamic_mattranspose_localarray_views_end // // Initialize matrix data @@ -241,7 +241,7 @@ int main(int argc, char *argv[]) std::memset(At, 0, N_r * N_c * sizeof(int)); - // _mattranspose_localarray_cstyle_start + // _dynamic_mattranspose_localarray_cstyle_start // // (0) Outer loops to iterate over tiles // @@ -291,7 +291,7 @@ int main(int argc, char *argv[]) } } - // _mattranspose_localarray_cstyle_end + // _dynamic_mattranspose_localarray_cstyle_end checkResult(Atview, N_c, N_r); // printResult(Atview, N_c, N_r); @@ -317,9 +317,11 @@ int main(int argc, char *argv[]) } #endif - + // _dynamic_mattranspose_shared_mem_start constexpr size_t dynamic_shared_mem_size = TILE_DIM * TILE_DIM * sizeof(int); + // _dynamic_mattranspose_shared_mem_end + // _dynamic_mattranspose_kernel_start RAJA::launch (select_cpu_or_gpu, RAJA::LaunchParams(RAJA::Teams(outer_Dimr, outer_Dimc), @@ -368,15 +370,17 @@ int main(int argc, char *argv[]) }); }); - //The launch context uses bump style allocator to return different segments of shared memory - //to avoid requesting beyond the pre-allocated memory quantity we reset the allocator offset counter - //effectively releasing shared memory. + //The launch context uses bump style allocator in which calls + //to getSharedMemory moves a memory buffer pointer to return + //different segments of shared memory. To avoid requesting beyond + //the pre-allocated memory quantity we reset the allocator offset counter + //in the launch context effectively releasing shared memory. ctx.releaseSharedMemory(); - }); }); }); + // _dynamic_mattranspose_kernel_end #if defined(RAJA_ENABLE_HIP)