diff --git a/docs/sphinx/user_guide/feature/loop_basic.rst b/docs/sphinx/user_guide/feature/loop_basic.rst
index 8918335b02..db93432cff 100644
--- a/docs/sphinx/user_guide/feature/loop_basic.rst
+++ b/docs/sphinx/user_guide/feature/loop_basic.rst
@@ -344,22 +344,25 @@ the same team. The ``RAJA::launch`` interface has three main concepts:
     device execution environment, which enables run time selection of
     kernel execution.
 
-  * ``RAJA::LaunchParams`` type. This type takes a number of teams and and a
-    number of threads as arguments.
+  * ``RAJA::LaunchParams`` type. This type takes a number of teams, threads
+    per team, and optionally the size of dynamic shared memory in bytes.
 
   * ``RAJA::loop`` template. These are used to define hierarchical
     parallel execution of a kernel. Operations within a loop are mapped to
     either teams or threads based on the execution policy template parameter
     provided.
 
-Team shared memory is available by using the ``RAJA_TEAM_SHARED`` macro. Team
-shared memory enables threads in a given team to share data. In practice,
-team policies are typically aliases for RAJA GPU block policies in the
-x,y,z dimensions, while thread policies are aliases for RAJA GPU thread
-policies in the x,y,z dimensions. In a host execution environment, teams and
-threads may be mapped to sequential loop execution or OpenMP threaded regions.
-Often, the ``RAJA::LaunchParams`` method can take an empty argument list for
-host execution.
+Team shared memory can be allocated by using the ``RAJA_TEAM_SHARED`` macro on
+statically sized arrays or via dynamic allocation in the ``RAJA::LaunchParams``
+method. Team shared memory enables threads in a given team to have shared access to a shared memory buffer.
+Loops are then assigned to either teams or threads based on the GPU execution
+policy. Under the CUDA/HIP nomenclature, teams correspond to blocks while,
+in SYCL nomenclature, teams correspond to workgroups.
+
+In a host execution environment, team and thread parameters in the
+``RAJA::LaunchParams`` struct have no effect in execution and may be
+omitted if only running on the host.
+
 
 Please see the following tutorial sections for detailed examples that use
 ``RAJA::launch``:
diff --git a/docs/sphinx/user_guide/feature/policies.rst b/docs/sphinx/user_guide/feature/policies.rst
index 03df35df8e..28c4c40bb2 100644
--- a/docs/sphinx/user_guide/feature/policies.rst
+++ b/docs/sphinx/user_guide/feature/policies.rst
@@ -13,10 +13,10 @@ Policies
 ==================
 
 RAJA kernel execution methods take an execution policy type template parameter
-to specialize execution behavior. Typically, the policy indicates which 
-programming model back-end to use and other information about the execution 
-pattern, such as number of CUDA threads per thread block, whether execution is 
-synchronous or asynchronous, etc. This section describes RAJA policies for 
+to specialize execution behavior. Typically, the policy indicates which
+programming model back-end to use and other information about the execution
+pattern, such as number of CUDA threads per thread block, whether execution is
+synchronous or asynchronous, etc. This section describes RAJA policies for
 loop kernel execution, scans, sorts, reductions, atomics, etc. Please
 detailed examples in :ref:`tutorial-label` for a variety of use cases.
 
@@ -24,7 +24,7 @@ As RAJA functionality evolves, new policies are added and some may
 be redefined and to work in new ways.
 
 .. note:: * All RAJA policies are in the namespace ``RAJA``.
-          * All RAJA policies have a prefix indicating the back-end 
+          * All RAJA policies have a prefix indicating the back-end
             implementation that they use; e.g., ``omp_`` for OpenMP, ``cuda_``
             for CUDA, etc.
 
@@ -47,6 +47,8 @@ apply during code compilation.
  ====================================== ============= ==========================
  Sequential/SIMD Execution Policies     Works with    Brief description
  ====================================== ============= ==========================
+ seq_launch_t                           launch        Creates a sequential
+                                                      execution space.
  seq_exec                               forall,       Strictly sequential
                                         kernel (For), execution.
                                         scan,
@@ -55,10 +57,10 @@ apply during code compilation.
                                         kernel (For), SIMD instructions via
                                         scan          compiler hints in RAJA's
                                                       internal implementation.
- loop_exec                              forall,       Allow the compiler to 
+ loop_exec                              forall,       Allow the compiler to
                                         kernel (For), generate any optimizations
                                         scan,         that its heuristics deem
-                                        sort          beneficial according;
+                                        sort          beneficial;
                                                       i.e., no loop decorations
                                                       (pragmas or intrinsics) in
                                                       RAJA implementation.
@@ -71,65 +73,69 @@ OpenMP Parallel CPU Policies
 For the OpenMP CPU multithreading back-end, RAJA has policies that can be used
 by themselves to execute kernels. In particular, they create an OpenMP parallel
 region and execute a kernel within it. To distinguish these in this discussion,
-we refer to these as **full policies**. These policies are provided 
-to users for convenience in common use cases. 
+we refer to these as **full policies**. These policies are provided
+to users for convenience in common use cases.
 
-RAJA also provides other OpenMP policies, which we refer to as 
-**partial policies**, since they need to be used in combination with other 
-policies. Typically, they work by providing an *outer policy* and an 
-*inner policy* as a template parameter to the outer policy. These give users 
+RAJA also provides other OpenMP policies, which we refer to as
+**partial policies**, since they need to be used in combination with other
+policies. Typically, they work by providing an *outer policy* and an
+*inner policy* as a template parameter to the outer policy. These give users
 flexibility to create more complex execution patterns.
 
 
 .. note:: To control the number of threads used by OpenMP policies,
           set the value of the environment variable 'OMP_NUM_THREADS' (which is
           fixed for duration of run), or call the OpenMP routine
-          'omp_set_num_threads(nthreads)' in your application, which allows 
+          'omp_set_num_threads(nthreads)' in your application, which allows
           one to change the number of threads at run time.
 
 The full policies are described in the following table. Partial policies
 are described in other tables below.
 
- ========================================= ============= =======================
- OpenMP CPU Full Policies                  Works with    Brief description
- ========================================= ============= =======================
- omp_parallel_for_exec                     forall,       Same as applying 
-                                           kernel (For), 'omp parallel for' 
-                                           scan,         pragma
+ ========================================= ============== ======================
+ OpenMP CPU Full Policies                  Works with     Brief description
+ ========================================= ============== ======================
+ omp_parallel_for_exec                     forall,        Same as applying
+                                           kernel (For),  'omp parallel for'
+					   launch (loop), pragma
+                                           scan,
                                            sort
- omp_parallel_for_static_exec<ChunkSize>   forall,       Same as applying
-                                           kernel (For)  'omp parallel for
-                                                         schedule(static,
-                                                         ChunkSize)'
- omp_parallel_for_dynamic_exec<ChunkSize>  forall,       Same as applying
-                                           kernel (For)  'omp parallel for
-                                                         schedule(dynamic,
-                                                         ChunkSize)'
- omp_parallel_for_guided_exec<ChunkSize>   forall,       Same as applying
-                                           kernel (For)  'omp parallel for
-                                                         schedule(guided,
-                                                         ChunkSize)'
- omp_parallel_for_runtime_exec             forall,       Same as applying
-                                           kernel (For)  'omp parallel for
-                                                         schedule(runtime)'
- ========================================= ============= =======================
+ omp_parallel_for_static_exec<ChunkSize>   forall,        Same as applying
+                                           kernel (For)   'omp parallel for
+                                                          schedule(static,
+                                                          ChunkSize)'
+ omp_parallel_for_dynamic_exec<ChunkSize>  forall,        Same as applying
+                                           kernel (For)   'omp parallel for
+                                                          schedule(dynamic,
+                                                          ChunkSize)'
+ omp_parallel_for_guided_exec<ChunkSize>   forall,        Same as applying
+                                           kernel (For)   'omp parallel for
+                                                          schedule(guided,
+                                                          ChunkSize)'
+ omp_parallel_for_runtime_exec             forall,        Same as applying
+                                           kernel (For)   'omp parallel for
+                                                          schedule(runtime)'
+ ========================================= ============== ======================
 
 .. note:: For the OpenMP scheduling policies above that take a ``ChunkSize``
-          parameter, the chunk size is optional. If not provided, the 
+          parameter, the chunk size is optional. If not provided, the
           default chunk size that OpenMP applies will be used, which may
           be specific to the OpenMP implementation in use. For this case,
-          the RAJA policy syntax is 
-          ``omp_parallel_for_{static|dynamic|guided}_exec< >``, which will 
-          result in the OpenMP pragma 
-          ``omp parallel for schedule({static|dynamic|guided})`` being applied. 
+          the RAJA policy syntax is
+          ``omp_parallel_for_{static|dynamic|guided}_exec< >``, which will
+          result in the OpenMP pragma
+          ``omp parallel for schedule({static|dynamic|guided})`` being applied.
 
-RAJA provides an (outer) OpenMP CPU policy to create a parallel region in 
-which to execute a kernel. It requires an inner policy that defines how a 
+RAJA provides an (outer) OpenMP CPU policy to create a parallel region in
+which to execute a kernel. It requires an inner policy that defines how a
 kernel will execute in parallel inside the region.
 
  ====================================== ============= ==========================
  OpenMP CPU Outer Policies              Works with    Brief description
  ====================================== ============= ==========================
+ omp_launch_t                           launch        Creates an OpenMP parallel
+                                                      region. Same as applying
+						      'omp parallel pragma'
  omp_parallel_exec<InnerPolicy>         forall,       Creates OpenMP parallel
                                         kernel (For), region and requires an
                                         scan          **InnerPolicy**. Same as
@@ -138,16 +144,16 @@ kernel will execute in parallel inside the region.
  ====================================== ============= ==========================
 
 Finally, we summarize the inner policies that RAJA provides for OpenMP.
-These policies are passed to the RAJA ``omp_parallel_exec`` outer policy as 
+These policies are passed to the RAJA ``omp_parallel_exec`` outer policy as
 a template argument as described above.
 
  ====================================== ============= ==========================
  OpenMP CPU Inner Policies              Works with    Brief description
  ====================================== ============= ==========================
  omp_for_exec                           forall,       Parallel execution within
-                                        kernel (For), *existing parallel 
-                                        scan          region*; i.e., 
-                                                      apply 'omp for' pragma. 
+                                        kernel (For), existing parallel
+					launch (loop) region; i.e.,
+                                        scan          apply 'omp for' pragma.
  omp_for_static_exec<ChunkSize>         forall,       Same as applying
                                         kernel (For)  'omp for
                                                       schedule(static,
@@ -173,37 +179,37 @@ a template argument as described above.
                                                       indicated using ArgList
  ====================================== ============= ==========================
 
-.. important:: **RAJA only provides a nowait policy option for static 
-               scheduling** since that is the only schedule case that can be 
-               used with nowait and be correct in general when executing 
-               multiple loops in a single parallel region. Paraphrasing the 
+.. important:: **RAJA only provides a nowait policy option for static
+               scheduling** since that is the only schedule case that can be
+               used with nowait and be correct in general when executing
+               multiple loops in a single parallel region. Paraphrasing the
                OpenMP standard:
                *programs that depend on which thread executes a particular
                loop iteration under any circumstance other than static schedule
                are non-conforming.*
 
 .. note:: As in the RAJA full policies for OpenMP scheduling, the ``ChunkSize``
-          is optional. If not provided, the default chunk size that the OpenMP 
+          is optional. If not provided, the default chunk size that the OpenMP
           implementation applies will be used.
 
 .. note:: As noted above, RAJA inner OpenMP policies must only be used within an
-          **existing** parallel region to work properly. Embedding an inner 
-          policy inside the RAJA outer ``omp_parallel_exec`` will allow you to 
-          apply the OpenMP execution prescription specified by the policies to 
+          **existing** parallel region to work properly. Embedding an inner
+          policy inside the RAJA outer ``omp_parallel_exec`` will allow you to
+          apply the OpenMP execution prescription specified by the policies to
           a single kernel. To support use cases with multiple kernels inside an
-          OpenMP parallel region, RAJA provides a **region** construct that 
-          takes a template argument to specify the execution back-end. For 
+          OpenMP parallel region, RAJA provides a **region** construct that
+          takes a template argument to specify the execution back-end. For
           example::
 
             RAJA::region<RAJA::omp_parallel_region>([=]() {
 
-              RAJA::forall<RAJA::omp_for_nowait_static_exec< > >(segment, 
+              RAJA::forall<RAJA::omp_for_nowait_static_exec< > >(segment,
                 [=] (int idx) {
                   // do something at iterate 'idx'
                 }
               );
 
-              RAJA::forall<RAJA::omp_for_static_exec< > >(segment, 
+              RAJA::forall<RAJA::omp_for_static_exec< > >(segment,
                 [=] (int idx) {
                   // do something else at iterate 'idx'
                 }
@@ -213,22 +219,22 @@ a template argument as described above.
 
           Here, the ``RAJA::region<RAJA::omp_parallel_region>`` method call
           creates an OpenMP parallel region, which contains two ``RAJA::forall``
-          kernels. The first uses the ``RAJA::omp_for_nowait_static_exec< >`` 
-          policy, meaning that no thread synchronization is needed after the 
-          kernel. Thus, threads can start working on the second kernel while 
+          kernels. The first uses the ``RAJA::omp_for_nowait_static_exec< >``
+          policy, meaning that no thread synchronization is needed after the
+          kernel. Thus, threads can start working on the second kernel while
           others are still working on the first kernel. I general, this will
           be correct when the segments used in the two kernels are the same,
           each loop is data parallel, and static scheduling is applied to both
-          loops. The second kernel uses the ``RAJA::omp_for_static_exec`` 
-          policy, which means that all threads will complete before the kernel 
-          exits. In this example, this is not really needed since there is no 
-          more code to execute in the parallel region and there is an implicit 
+          loops. The second kernel uses the ``RAJA::omp_for_static_exec``
+          policy, which means that all threads will complete before the kernel
+          exits. In this example, this is not really needed since there is no
+          more code to execute in the parallel region and there is an implicit
           barrier at the end of it.
 
 Threading Building Block (TBB) Parallel CPU Policies
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-RAJA provides a basic set of TBB execution policies for use with the 
+RAJA provides a basic set of TBB execution policies for use with the
 RAJA TBB back-end, which supports a subset of RAJA features.
 
  ====================================== ============= ==========================
@@ -265,77 +271,84 @@ RAJA TBB back-end, which supports a subset of RAJA features.
 GPU Policies for CUDA and HIP
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-RAJA policies for GPU execution using CUDA or HIP are essentially identical. 
-The only difference is that CUDA policies have the prefix ``cuda_`` and HIP 
+RAJA policies for GPU execution using CUDA or HIP are essentially identical.
+The only difference is that CUDA policies have the prefix ``cuda_`` and HIP
 policies have the prefix ``hip_``.
 
- ======================================== ============= ========================
+ ======================================== ============= =======================================
  CUDA/HIP Execution Policies              Works with    Brief description
- ======================================== ============= ========================
+ ======================================== ============= =======================================
  cuda/hip_exec<BLOCK_SIZE>                forall,       Execute loop iterations
                                           scan,         in a GPU kernel launched
                                           sort          with given thread-block
-                                                        size. Note that the 
+                                                        size. Note that the
                                                         thread-block size must
                                                         be provided, there is
                                                         no default provided.
+ cuda/hip_launch_t                        launch        Launches a device kernel,
+                                                        any code expressed within
+							the lambda is executed
+							on the device.
  cuda/hip_thread_x_direct                 kernel (For)  Map loop iterates
-                                                        directly to GPU threads
+                                          launch (loop) directly to GPU threads
                                                         in x-dimension, one
                                                         iterate per thread
                                                         (see note below about
                                                         limitations)
  cuda/hip_thread_y_direct                 kernel (For)  Same as above, but map
-                                                        to threads in y-dim
+                                          launch (loop) to threads in y-dim
  cuda/hip_thread_z_direct                 kernel (For)  Same as above, but map
-                                                        to threads in z-dim
- cuda/hip_thread_x_loop                   kernel (For)  Similar to 
-                                                        thread-x-direct
+                                          launch (loop) to threads in z-dim
+ cuda/hip_thread_x_loop                   kernel (For)  Similar to
+                                          launch (loop) thread-x-direct
                                                         policy, but use a
                                                         block-stride loop which
                                                         doesn't limit number of
                                                         loop iterates
  cuda/hip_thread_y_loop                   kernel (For)  Same as above, but for
-                                                        threads in y-dimension
+                                          launch (loop) threads in y-dimension
  cuda/hip_thread_z_loop                   kernel (For)  Same as above, but for
-                                                        threads in z-dimension
- cuda/hip_flatten_block_threads_{xyz}     Launch (Loop) Reshapes threads in a
+                                          launch (loop) threads in z-dimension
+ cuda/hip_flatten_block_threads_{xyz}     launch (loop) Reshapes threads in a
                                                         multi-dimensional thread
                                                         team into one-dimension,
                                                         accepts any permutation
                                                         of dimensions
                                                         (expt namespace)
  cuda/hip_block_x_direct                  kernel (For)  Map loop iterates
-                                                        directly to GPU thread
+                                          launch (loop) directly to GPU thread
                                                         blocks in x-dimension,
                                                         one iterate per block
  cuda/hip_block_y_direct                  kernel (For)  Same as above, but map
-                                                        to blocks in y-dimension
+                                          launch (loop) to blocks in y-dimension
  cuda/hip_block_z_direct                  kernel (For)  Same as above, but map
-                                                        to blocks in z-dimension
- cuda/hip_block_x_loop                    kernel (For)  Similar to 
-                                                        block-x-direct policy, 
-                                                        but use a grid-stride 
+                                          launch (loop) to blocks in z-dimension
+ cuda/hip_block_x_loop                    kernel (For)  Similar to
+                                          launch (loop) block-x-direct policy,
+                                                        but use a grid-stride
                                                         loop.
  cuda/hip_block_y_loop                    kernel (For)  Same as above, but use
-                                                        blocks in y-dimension
+                                          launch (loop) blocks in y-dimension
  cuda/hip_block_z_loop                    kernel (For)  Same as above, but use
-                                                        blocks in z-dimension
- cuda/hip_global_thread_x                 Launch (Loop) Creates a unique thread
-                                                        id for each thread on 
-                                                        x-dimension of the grid
-                                                        (expt namespace)
- cuda/hip_global_thread_y                 Launch (Loop) Same as above, but uses
-                                                        threads in y-dimension
-                                                        (expt namespace)
- cuda/hip_global_thread_z                 Launch (Loop) Same as above, but uses
-                                                        threads in z-dimension
-                                                        (expt namespace)
+                                          launch (loop) blocks in z-dimension
+ cuda/hip_global_thread_x                               Creates a unique thread
+                                                        id for each thread on
+                                                        x-dimension of the grid.
+							Same as computing
+							threadIdx.x + threadDim.x * blockIdx.x.
+ cuda/hip_global_thread_y                 launch (loop) Same as above, but uses
+                                                        threads in y-dimension.
+							Same as computing
+							threadIdx.y + threadDim.y * blockIdx.y.
+ cuda/hip_global_thread_z                 launch (loop) Same as above, but uses
+                                                        threads in z-dimension.
+							Same as computing
+							threadIdx.z + threadDim.z * blockIdx.z.
  cuda/hip_warp_direct                     kernel (For)  Map work to threads
                                                         in a warp directly.
                                                         Cannot be used in
                                                         conjunction with
-                                                        cuda/hip_thread_x_* 
+                                                        cuda/hip_thread_x_*
                                                         policies.
                                                         Multiple warps can be
                                                         created by using
@@ -346,7 +359,7 @@ policies have the prefix ``hip_``.
                                                         a warp-stride loop.
                                                         Cannot be used in
                                                         conjunction with
-                                                        cuda/hip_thread_x_* 
+                                                        cuda/hip_thread_x_*
                                                         policies.
                                                         Multiple warps can be
                                                         created by using
@@ -357,7 +370,7 @@ policies have the prefix ``hip_``.
                                                         warp using a bit mask.
                                                         Cannot be used in
                                                         conjunction with
-                                                        cuda/hip_thread_x_* 
+                                                        cuda/hip_thread_x_*
                                                         policies.
                                                         Multiple warps can
                                                         be created by using
@@ -365,11 +378,12 @@ policies have the prefix ``hip_``.
                                                         policies.
  cuda/hip_warp_masked_loop<BitMask<..>>   kernel (For)  Policy to map work to
                                                         threads in a warp using
-                                                        a bit mask and a 
+                                                        a bit mask and a
                                                         warp-stride loop. Cannot
-                                                        be used in conjunction 
+                                                        be used in conjunction
                                                         with cuda/hip_thread_x_*
-                                                        policies. Multiple warps                                                        can be created by using
+                                                        policies. Multiple warps
+							can be created by using
                                                         cuda/hip_thread_y/z_*
                                                         policies.
  cuda/hip_block_reduce                    kernel        Perform a reduction
@@ -378,7 +392,7 @@ policies have the prefix ``hip_``.
  cuda/_warp_reduce                        kernel        Perform a reduction
                                           (Reduce)      across a single GPU
                                                         thread warp.
- ======================================== ============= ========================
+ ======================================== ============= =======================================
 
 Several notable constraints apply to RAJA CUDA/HIP *thread-direct* policies.
 
@@ -415,77 +429,100 @@ Finally
 GPU Policies for SYCL
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
- ======================================== ============= ========================
+ ======================================== ============= ==============================
  SYCL Execution Policies                  Works with    Brief description
- ======================================== ============= ========================
+ ======================================== ============= ==============================
  sycl_exec<WORK_GROUP_SIZE>               forall,       Execute loop iterations
                                                         in a GPU kernel launched
                                                         with given work group
                                                         size.
+ sycl_launch_t                            launch        Launches a sycl kernel,
+                                                        any code express within
+							the lambda is executed
+							on the device.
  sycl_global_0<WORK_GROUP_SIZE>           kernel (For)  Map loop iterates
                                                         directly to GPU global
                                                         ids in first
-                                                        dimension, one iterate 
+                                                        dimension, one iterate
                                                         per work item. Group
                                                         execution into work
-                                                        groups of given size. 
+                                                        groups of given size.
  sycl_global_1<WORK_GROUP_SIZE>           kernel (For)  Same as above, but map
                                                         to global ids in second
                                                         dim
  sycl_global_2<WORK_GROUP_SIZE>           kernel (For)  Same as above, but map
-                                                        to global ids in third 
+                                                        to global ids in third
                                                         dim
+ sycl_global_item_0                       launch (loop) Creates a unique thread
+                                                        id for each thread for
+                                                        dimension 0 of the grid.
+							Same as computing
+							itm.get_group(0) *
+							itm.get_local_range(0) +
+							itm.get_local_id(0).
+ sycl_global_item_1                       launch (loop) Same as above, but uses
+                                                        threads in dimension 1
+							Same as computing
+						        itm.get_group(1) +
+							itm.get_local_range(1) *
+							itm.get_local_id(1).
+ sycl_global_item_2                       launch (loop) Same as above, but uses
+                                                        threads in dimension 2
+							Same as computing
+						        itm.get_group(2) +
+							itm.get_local_range(2) *
+							itm.get_local_id(2).
  sycl_local_0_direct                      kernel (For)  Map loop iterates
-                                                        directly to GPU work
+                                          launch (loop) directly to GPU work
                                                         items in first
-                                                        dimension, one iterate 
-                                                        per work item (see note 
+                                                        dimension, one iterate
+                                                        per work item (see note
                                                         below about limitations)
  sycl_local_1_direct                      kernel (For)  Same as above, but map
-                                                        to work items in second
+                                          launch (loop) to work items in second
                                                         dim
  sycl_local_2_direct                      kernel (For)  Same as above, but map
-                                                        to work items in third 
+                                          launch (loop) to work items in third
                                                         dim
- sycl_local_0_loop                        kernel (For)  Similar to 
-                                                        local-1-direct policy, 
-                                                        but use a work 
+ sycl_local_0_loop                        kernel (For)  Similar to
+                                          launch (loop) local-1-direct policy,
+                                                        but use a work
                                                         group-stride loop which
                                                         doesn't limit number of
                                                         loop iterates
  sycl_local_1_loop                        kernel (For)  Same as above, but for
-                                                        work items in second 
+                                          launch (loop) work items in second
                                                         dimension
  sycl_local_2_loop                        kernel (For)  Same as above, but for
-                                                        work items in third 
+                                          launch (loop) work items in third
                                                         dimension
  sycl_group_0_direct                      kernel (For)  Map loop iterates
-                                                        directly to GPU group
-                                                        ids in first dimension, 
+                                          launch (loop) directly to GPU group
+                                                        ids in first dimension,
                                                         one iterate per group
  sycl_group_1_direct                      kernel (For)  Same as above, but map
-                                                        to groups in second 
+                                          launch (loop) to groups in second
                                                         dimension
  sycl_group_2_direct                      kernel (For)  Same as above, but map
-                                                        to groups in third 
+                                          launch (loop) to groups in third
                                                         dimension
- sycl_group_0_loop                        kernel (For)  Similar to 
-                                                        group-1-direct policy, 
-                                                        but use a group-stride 
+ sycl_group_0_loop                        kernel (For)  Similar to
+                                          launch (loop) group-1-direct policy,
+                                                        but use a group-stride
                                                         loop.
  sycl_group_1_loop                        kernel (For)  Same as above, but use
-                                                        groups in second 
+                                          launch (loop) groups in second
                                                         dimension
  sycl_group_2_loop                        kernel (For)  Same as above, but use
-                                                        groups in third 
+                                          launch (loop) groups in third
                                                         dimension
 
- ======================================== ============= ========================
+ ======================================== ============= ==============================
 
-OpenMP Target Offload Policies 
+OpenMP Target Offload Policies
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-RAJA provides policies to use OpenMP to offload kernel execution to a GPU 
+RAJA provides policies to use OpenMP to offload kernel execution to a GPU
 device, for example. They are summarized in the following table.
 
  ====================================== ============= ==========================
@@ -520,10 +557,10 @@ RAJA IndexSet Execution Policies
 -----------------------------------------------------
 
 When an IndexSet iteration space is used in RAJA by passing an IndexSet
-to a ``RAJA::forall`` method, for example, an index set execution policy is 
-required. An index set execution policy is a **two-level policy**: an 'outer' 
-policy for iterating over segments in the index set, and an 'inner' policy 
-used to execute the iterations defined by each segment. An index set execution 
+to a ``RAJA::forall`` method, for example, an index set execution policy is
+required. An index set execution policy is a **two-level policy**: an 'outer'
+policy for iterating over segments in the index set, and an 'inner' policy
+used to execute the iterations defined by each segment. An index set execution
 policy type has the form::
 
   RAJA::ExecPolicy< segment_iteration_policy, segment_execution_policy >
@@ -556,8 +593,8 @@ Parallel Region Policies
 -------------------------
 
 Earlier, we discussed using the ``RAJA::region`` construct to
-execute multiple kernels in an OpenMP parallel region. To support source code 
-portability, RAJA provides a sequential region concept that can be used to 
+execute multiple kernels in an OpenMP parallel region. To support source code
+portability, RAJA provides a sequential region concept that can be used to
 surround code that uses execution back-ends other than OpenMP. For example::
 
   RAJA::region<RAJA::seq_region>([=]() {
@@ -574,7 +611,7 @@ surround code that uses execution back-ends other than OpenMP. For example::
 
 .. note:: The sequential region specialization is essentially a *pass through*
           operation. It is provided so that if you want to turn off OpenMP in
-          your code, for example, you can simply replace the region policy 
+          your code, for example, you can simply replace the region policy
           type and you do not have to change your algorithm source code.
 
 
@@ -615,8 +652,8 @@ cuda/hip_reduce         any CUDA/HIP  Parallel reduction in a CUDA/HIP kernel
                                       reduction value is finalized).
 cuda/hip_reduce_atomic  any CUDA/HIP  Same as above, but reduction may use CUDA
                         policy        atomic operations.
-sycl_reduce             any SYCL      Reduction in a SYCL kernel (device 
-                        policy        synchronization will occur when the 
+sycl_reduce             any SYCL      Reduction in a SYCL kernel (device
+                        policy        synchronization will occur when the
                                       reduction value is finalized).
 ======================= ============= ==========================================
 
@@ -644,17 +681,17 @@ Atomic Policy                 Loop Policies Brief description
 seq_atomic                    seq_exec,     Atomic operation performed in a
                               loop_exec     non-parallel (sequential) kernel.
 omp_atomic                    any OpenMP    Atomic operation in OpenM kernel.P
-                              policy        multithreading or target kernel; 
+                              policy        multithreading or target kernel;
                                             i.e., apply ``omp atomic`` pragma.
 cuda/hip/sycl_atomic          any           Atomic operation performed in a
                               CUDA/HIP/SYCL CUDA/HIP/SYCL kernel.
-                              policy        
+                              policy
 
 cuda/hip_atomic_explicit      any CUDA/HIP  Atomic operation performed in a CUDA/HIP
                               policy        kernel that may also be used in a host
                                             execution context. The atomic policy
                                             takes a host atomic policy template
-                                            argument. See additional explanation 
+                                            argument. See additional explanation
                                             and example below.
 builtin_atomic                seq_exec,     Compiler *builtin* atomic operation.
                               loop_exec,
@@ -664,7 +701,7 @@ auto_atomic                   seq_exec,     Atomic operation *compatible* with l
                               loop_exec,    execution policy. See example below.
                               any OpenMP    Can not be used inside cuda/hip
                               policy,       explicit atomic policies.
-                              any 
+                              any
                               CUDA/HIP/SYCL
                               policy
 ============================= ============= ========================================
@@ -743,19 +780,19 @@ of Statements that are composed in the order that they appear in the kernel
 policy to construct a kernel. A Statement may contain an enclosed StatmentList. Thus, a ``RAJA::KernelPolicy`` type is really just a StatementList.
 
 The main Statement types provided by RAJA are ``RAJA::statement::For`` and
-``RAJA::statement::Lambda``, that we discussed in 
-:ref:`loop_elements-kernel-label`. 
-A ``RAJA::statement::For<ArgID, ExecPolicy, Enclosed Satements>`` type 
+``RAJA::statement::Lambda``, that we discussed in
+:ref:`loop_elements-kernel-label`.
+A ``RAJA::statement::For<ArgID, ExecPolicy, Enclosed Satements>`` type
 indicates a for-loop structure. The ``ArgID`` parameter is an integral constant
-that identifies the position of the iteration space in the iteration space 
-tuple passed to the ``RAJA::kernel`` method to be used for the loop. The 
-``ExecPolicy`` is the RAJA execution policy to use on the loop, which is 
-similar to ``RAJA::forall`` usage. The ``EnclosedStatements`` type is a 
-nested template parameter that contains whatever is needed to execute the 
-kernel and which forms a valid StatementList. The 
+that identifies the position of the iteration space in the iteration space
+tuple passed to the ``RAJA::kernel`` method to be used for the loop. The
+``ExecPolicy`` is the RAJA execution policy to use on the loop, which is
+similar to ``RAJA::forall`` usage. The ``EnclosedStatements`` type is a
+nested template parameter that contains whatever is needed to execute the
+kernel and which forms a valid StatementList. The
 ``RAJA::statement::Lambda<LambdaID>``
-type invokes the lambda expression corresponding to its position 'LambdaID' 
-in the sequence of lambda expressions in the ``RAJA::kernel`` argument list. 
+type invokes the lambda expression corresponding to its position 'LambdaID'
+in the sequence of lambda expressions in the ``RAJA::kernel`` argument list.
 For example, a simple sequential for-loop::
 
   for (int i = 0; i < N; ++i) {
@@ -791,13 +828,13 @@ can be used with ``RAJA::kernel`` and ``RAJA::kernel_param``. More detailed
 explanation along with examples of how they are used can be found in
 the ``RAJA::kernel`` examples in :ref:`tutorial-label`.
 
-.. note:: All of the statement types described below are in the namespace 
+.. note:: All of the statement types described below are in the namespace
           ``RAJA::statement``. For brevity, we omit the namespaces in
           the discussion in this section.
 
-.. note::  ``RAJA::kernel_param`` functions similarly to ``RAJA::kernel`` 
-           except that the second argument is a *tuple of parameters* used 
-           in a kernel for local arrays, thread local variables, tiling 
+.. note::  ``RAJA::kernel_param`` functions similarly to ``RAJA::kernel``
+           except that the second argument is a *tuple of parameters* used
+           in a kernel for local arrays, thread local variables, tiling
            information, etc.
 
 Several RAJA statements can be specialized with auxilliary types, which are
@@ -813,12 +850,12 @@ The following list contains the most commonly used statement types.
 
 * ``Collapse< ExecPolicy, ArgList<...>, EnclosedStatements >`` collapses multiple perfectly nested loops specified by tuple iteration space indices in ``ArgList``, using the ``ExecPolicy`` execution policy, and places ``EnclosedStatements`` inside the collapsed loops which are executed for each iteration. **Note that this only works for CPU execution policies (e.g., sequential, OpenMP).** It may be available for CUDA in the future if such use cases arise.
 
-There is one statement specific to OpenMP kernels. 
+There is one statement specific to OpenMP kernels.
 
 * ``OmpSyncThreads`` applies the OpenMP ``#pragma omp barrier`` directive.
 
-Statement types that launch CUDA or HIP GPU kernels are listed next. They work 
-similarly for each back-end and their names are distinguished by the prefix 
+Statement types that launch CUDA or HIP GPU kernels are listed next. They work
+similarly for each back-end and their names are distinguished by the prefix
 ``Cuda`` or ``Hip``. For example, ``CudaKernel`` or ``HipKernel``.
 
 * ``Cuda/HipKernel< EnclosedStatements>`` launches ``EnclosedStatements`` as a GPU kernel; e.g., a loop nest where the iteration spaces of each loop level are associated with threads and/or thread blocks as described by the execution policies applied to them. This kernel launch is synchronous.
@@ -845,14 +882,14 @@ similarly for each back-end and their names are distinguished by the prefix
 
 * ``Cuda/HipSyncWarp`` invokes CUDA ``__syncwarp()`` barrier. Warp sync is not supported in HIP, so the HIP variant is a no-op.
 
-Statement types that launch SYCL kernels are listed next. 
+Statement types that launch SYCL kernels are listed next.
 
 * ``SyclKernel<EnclosedStatements>`` launches ``EnclosedStatements`` as a SYCL kernel.  This kernel launch is synchronous.
 
 * ``SyclKernelAsync<EnclosedStatements>`` asynchronous version of SyclKernel.
 
-RAJA provides statements to define loop tiling which can improve performance; 
-e.g., by allowing CPU cache blocking or use of GPU shared memory. 
+RAJA provides statements to define loop tiling which can improve performance;
+e.g., by allowing CPU cache blocking or use of GPU shared memory.
 
 * ``Tile< ArgId, TilePolicy, ExecPolicy, EnclosedStatements >`` abstracts an outer tiling loop containing an inner for-loop over each tile. The ``ArgId`` indicates which entry in the iteration space tuple to which the tiling loop applies and the ``TilePolicy`` specifies the tiling pattern to use, including its dimension. The ``ExecPolicy`` and ``EnclosedStatements`` are similar to what they represent in a ``statement::For`` type.
 
@@ -884,7 +921,7 @@ The following list summarizes auxilliary types used in the above statements. The
 types live in the ``RAJA`` namespace.
 
   * ``tile_fixed<TileSize>`` tile policy argument to a ``Tile`` or ``TileTCount`` statement; partitions loop iterations into tiles of a fixed size specified by ``TileSize``. This statement type can be used as the ``TilePolicy`` template parameter in the ``Tile`` statements above.
- 
+
   * ``tile_dynamic<ParamIdx>`` TilePolicy argument to a Tile or TileTCount statement; partitions loop iterations into tiles of a size specified by a ``TileSize{}`` positional parameter argument. This statement type can be used as the ``TilePolicy`` template paramter in the ``Tile`` statements above.
 
   * ``Segs<...>`` argument to a Lambda statement; used to specify which segments in a tuple will be used as lambda arguments.
diff --git a/docs/sphinx/user_guide/feature/tiling.rst b/docs/sphinx/user_guide/feature/tiling.rst
index 3d327f8c04..b42ac964c0 100644
--- a/docs/sphinx/user_guide/feature/tiling.rst
+++ b/docs/sphinx/user_guide/feature/tiling.rst
@@ -13,13 +13,13 @@ Loop Tiling
 ===========
 
 In this section, we discuss RAJA statements that can be used to tile nested
-loops. Typical loop tiling involves partitioning an iteration space into 
-a collection of "tiles" and then iterating over tiles in outer loops and 
-indices within each tile in inner loops. Many scientific computing algorithms 
+loops. Typical loop tiling involves partitioning an iteration space into
+a collection of "tiles" and then iterating over tiles in outer loops and
+indices within each tile in inner loops. Many scientific computing algorithms
 can benefit from loop tiling due to more efficient cache usage on a CPU or
 use of GPU shared memory.
 
-For example, consider an operation performed using a C-style for-loop with 
+For example, consider an operation performed using a C-style for-loop with
 a range of [0, 10)::
 
   for (int i=0; i<10; ++i) {
@@ -52,35 +52,67 @@ Here is a way to write the tiled loop kernel above using ``RAJA::kernel``::
      >;
 
    RAJA::kernel<KERNEL_EXEC_POL>(
-     RAJA::make_tuple(RAJA::TypedRangeSegment<int>(0,10)), 
+     RAJA::make_tuple(RAJA::TypedRangeSegment<int>(0,10)),
      [=] (int i) {
        // kernel body using index 'i'
      }
    );
 
-In RAJA, the simplest way to tile an iteration space is to use
+In RAJA kernel, the simplest way to tile an iteration space is to use
 ``RAJA::statement::Tile`` and ``RAJA::statement::For`` statement types. A
-``RAJA::statement::Tile`` type is similar to a ``RAJA::statement::For`` type, 
-but takes a tile size as the second template argument. The 
-``RAJA::statement::Tile`` type generates the outer loop over tiles and 
-the ``RAJA::statement::For`` type iterates over each tile.  Nested together, 
-these statements will pass the global index ('i' in the example) to the 
+``RAJA::statement::Tile`` type is similar to a ``RAJA::statement::For`` type,
+but takes a tile size as the second template argument. The
+``RAJA::statement::Tile`` type generates the outer loop over tiles and
+the ``RAJA::statement::For`` type iterates over each tile.  Nested together,
+these statements will pass the global index ('i' in the example) to the
 lambda expression as (kernel body) in a non-tiled version above.
 
-.. note:: When using ``RAJA::statement::Tile`` and ``RAJA::statement::For`` 
-          types together to define a tiled loop structure, the integer passed 
-          as the first template argument to each statement type must be the 
+.. note:: When using ``RAJA::statement::Tile`` and ``RAJA::statement::For``
+          types together to define a tiled loop structure, the integer passed
+          as the first template argument to each statement type must be the
           same. This indicates that they both apply to the same iteration space
           in the space tuple passed to the ``RAJA::kernel`` method.
 
-RAJA also provides alternative statements that provide the tile number and 
-local tile index, if needed inside the kernel body, as shown below::
+
+The ``RAJA::launch`` API also supports loop tiling through specialized
+
+methods. The launch version of the code above is ::
+
+  using launch_t = RAJA::LaunchPolicy<RAJA::seq_launch>;
+  using loop_t   = RAJA::LoopPolicy<RAJA::loop_exec>;
+
+  RAJA::launch<launch_t>(
+    RAJA::LaunchParams(), RAJA_HOST_DEVICE(RAJA::launchContext ctx) {
+
+     RAJA::tile<loop_t>(
+       ctx, tile_size, RAJA::TypedRangeSegment<int>(0, 10), [&] (RAJA::TypedRangeSegment<int> const &tile) {
+
+          RAJA::loop<loop_t>(
+            ctx, tile, [&] (int i) {
+
+              // kernel body using index 'i'
+
+	    }
+	  );
+        }
+      );
+    }
+  );
+
+In the example above the ``RAJA::tile`` method is used to generate a tile in the larger iteration space.
+
+This approach requires source code changes if the developer wanted to remove tiling, while RAJA kernel enables
+switching between tiling and non-tiling via execution policies and recompilation. Tile size in ``RAJA::launch``
+can be be selected dynamically and tiles are created on the device.
+
+In both kernel and launch RAJA also provides alternative statements that provide the tile number and
+local tile index. Using RAJA kernel, we illustrate usage below::
 
   using KERNEL_EXEC_POL2 =
     RAJA::KernelPolicy<
-      RAJA::statement::TileTCount<0, RAJA::statement::Param<0>, 
+      RAJA::statement::TileTCount<0, RAJA::statement::Param<0>,
                                   RAJA::tile_fixed<2>, RAJA::seq_exec,
-        RAJA::statement::ForICount<0, RAJA::statement::Param<1>, 
+        RAJA::statement::ForICount<0, RAJA::statement::Param<1>,
                                    RAJA::seq_exec,
           RAJA::statement::Lambda<0>
         >
@@ -101,19 +133,46 @@ local tile index, if needed inside the kernel body, as shown below::
     }
   );
 
-The ``RAJA::statement::TileTCount`` type indicates that the tile number will 
-be passed to the lambda expression and the ``RAJA::statement::ForICount`` type 
-indicates that the local tile loop index will be passed to the lambda 
-expression. Storage for these values is specified in the parameter tuple, the 
-second argument passed to the ``RAJA::kernel_param`` method. The 
-``RAJA::statement::Param<#>`` type appearing as the second 
-template parameter for each statement type indicates which parameter tuple 
-entry, the tile number or local tile loop index, is passed to the lambda and 
-in which order. Here, the tile number is the second lambda argument (tuple 
-parameter '0') and the local tile loop index is the third lambda argument 
+The ``RAJA::statement::TileTCount`` type indicates that the tile number will
+be passed to the lambda expression and the ``RAJA::statement::ForICount`` type
+indicates that the local tile loop index will be passed to the lambda
+expression. Storage for these values is specified in the parameter tuple, the
+second argument passed to the ``RAJA::kernel_param`` method. The
+``RAJA::statement::Param<#>`` type appearing as the second
+template parameter for each statement type indicates which parameter tuple
+entry, the tile number or local tile loop index, is passed to the lambda and
+in which order. Here, the tile number is the second lambda argument (tuple
+parameter '0') and the local tile loop index is the third lambda argument
 (tuple parameter '1').
 
 .. note:: The global loop indices always appear as the first lambda expression
-          arguments. Then, the parameter tuples identified by the integers 
-          in the ``RAJA::Param`` statement types given for the loop statement 
+          arguments. Then, the parameter tuples identified by the integers
+          in the ``RAJA::Param`` statement types given for the loop statement
           types follow.
+
+The launch API uses ``RAJA::tile_tcount`` and ``RAJA::loop_icount`` methods
+which has a second argument on the lambda for the index. We illustrate usage below::
+
+  using launch_t = RAJA::LaunchPolicy<RAJA::seq_launch>;
+  using loop_t   = RAJA::LoopPolicy<RAJA::loop_exec>;
+
+  RAJA::launch<launch_t>(
+    RAJA::LaunchParams(), RAJA_HOST_DEVICE(RAJA::launchContext ctx) {
+
+      RAJA::tile_tcount<loop_t>(
+        ctx, tile_size, RAJA::TypedRangeSegment<int>(0, 10), [&] (RAJA::TypedRangeSegment<int> const &tile, int t) {
+
+          RAJA::loop_icount<loop_t>(
+            ctx, tile, [&] (int idx, int i) {
+
+              // idx - global index
+              // t - tile number
+              // i - index within tile
+	      // Then, idx = i + tile_size*t
+
+	    }
+	  );
+        }
+      );
+    }
+  );
diff --git a/docs/sphinx/user_guide/tutorial.rst b/docs/sphinx/user_guide/tutorial.rst
index 2d81586c9f..9a8c04b419 100644
--- a/docs/sphinx/user_guide/tutorial.rst
+++ b/docs/sphinx/user_guide/tutorial.rst
@@ -20,12 +20,12 @@ RAJA Tutorial
 ===============
 
 This section contains a self-paced tutorial that shows how to use many RAJA
-features by way of a sequence of examples and exercises. Each exercise is 
-located in files in the ``RAJA/exercises`` directory, one *exercise* file with 
-code sections removed and comments containing instructions to fill in the 
-missing code parts and one *solution* file containing complete working code to 
+features by way of a sequence of examples and exercises. Each exercise is
+located in files in the ``RAJA/exercises`` directory, one *exercise* file with
+code sections removed and comments containing instructions to fill in the
+missing code parts and one *solution* file containing complete working code to
 compare with and for guidance if you get stuck working on the exercise file.
-You are encouraged to build and run the exercises and modify them to try out 
+You are encouraged to build and run the exercises and modify them to try out
 different variations.
 
 We also maintain a repository of tutorial slide presentations
@@ -40,12 +40,12 @@ difference between CPU (host) and GPU (device) memory allocations and how
 transfers between those memory spaces work. For a detailed discussion, see
 `Device Memory <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory>`_.
 
-It is important to note that RAJA does not provide a memory model. This is by 
-design as application developers who use RAJA prefer to manage memory 
-in different ways. Thus, users are responsible for ensuring that data is 
-properly allocated and initialized on a GPU device when running GPU code. 
-This can be done using explicit host and device allocation and copying between 
-host and device memory spaces or via unified memory (UM), if available. 
+It is important to note that RAJA does not provide a memory model. This is by
+design as application developers who use RAJA prefer to manage memory
+in different ways. Thus, users are responsible for ensuring that data is
+properly allocated and initialized on a GPU device when running GPU code.
+This can be done using explicit host and device allocation and copying between
+host and device memory spaces or via unified memory (UM), if available.
 The RAJA Portability Suite contains other libraries, namely
 `CHAI <https://github.com/LLNL/CHAI>`_ and
 `Umpire <https://github.com/LLNL/Umpire>`_, that complement RAJA by
@@ -68,20 +68,20 @@ templates and lambda expressions is required. So, before we begin, we provide
 a bit of background discussion of basic aspects of how RAJA use employs C++
 templates and lambda expressions, which is essential to use RAJA successfully.
 
-RAJA is almost an entirely header-only library that makes heavy use of 
-C++ templates. Using RAJA most easily and effectively is done by representing 
-the bodies of loop kernels as C++ lambda expressions. Alternatively, C++ 
-functors can be used, but they make application source code more complex, 
-potentially placing a significant negative burden on source code readability 
+RAJA is almost an entirely header-only library that makes heavy use of
+C++ templates. Using RAJA most easily and effectively is done by representing
+the bodies of loop kernels as C++ lambda expressions. Alternatively, C++
+functors can be used, but they make application source code more complex,
+potentially placing a significant negative burden on source code readability
 and maintainability.
 
 -----------------------------------
 C++ Templates
 -----------------------------------
 
-C++ templates enable one to write type-generic code and have the compiler 
+C++ templates enable one to write type-generic code and have the compiler
 generate an implementation for each set of template parameter types specified.
-For example, the ``RAJA::forall`` method to execute loop kernels is 
+For example, the ``RAJA::forall`` method to execute loop kernels is
 essentially method defined as::
 
   template <typename ExecPol,
@@ -91,7 +91,7 @@ essentially method defined as::
      ...
   }
 
-Here, "ExecPol", "IdxType", and "LoopBody" are C++ types that a user specifies 
+Here, "ExecPol", "IdxType", and "LoopBody" are C++ types that a user specifies
 in her code and which are seen by the compiler when the code is built.
 For example::
 
@@ -151,9 +151,9 @@ or *by reference*. This is similar to how arguments to C++ methods are passed;
 i.e., *pass-by-reference* or *pass-by-value*. However, there are some subtle
 differences between lambda variable capture rules and those for ordinary
 methods. **Variables included in the capture list with no extra symbols are
-captured by value.** Variables captured by value are effectively *const* 
-inside the lambda expression body and cannot be written to. 
-Capture-by-reference is accomplished by using the reference symbol '&' before 
+captured by value.** Variables captured by value are effectively *const*
+inside the lambda expression body and cannot be written to.
+Capture-by-reference is accomplished by using the reference symbol '&' before
 the variable name similar to C++ method arguments.  For example::
 
   int x;
@@ -177,7 +177,7 @@ Note that the following two attempts will generate compilation errors::
   [x, &y](){ x = y; };  // error: cannot assign to 'x' since it is captured
                         //        by value.
 
-.. note:: A variable that is captured by value in a lambda expression is 
+.. note:: A variable that is captured by value in a lambda expression is
           **read-only.**
 
 ----------------------------------------
@@ -227,7 +227,7 @@ represent kernel bodies with RAJA. We describe them here.
    improved support for host-device lambda expressions, a loop kernel
    containing a lambda annotated in this way may run noticeably slower on
    a CPU than the same lambda with no annotation depending on the version of
-   the compiler (e.g., nvcc) you are using. To be sure that your code does not 
+   the compiler (e.g., nvcc) you are using. To be sure that your code does not
    suffer in performance, we recommend comparing CPU execution timings of
    important kernels with and without the ``__host__ __device__`` annotation.
 
@@ -273,7 +273,7 @@ represent kernel bodies with RAJA. We describe them here.
        // access entries of bounds.array
      } );
 
-   This issue was resolved in the 10.1 release of CUDA. If you are using an 
+   This issue was resolved in the 10.1 release of CUDA. If you are using an
    earlier version, an implementation similar to the one above will be required.
 
 .. |br| raw:: html
@@ -285,15 +285,15 @@ RAJA Examples and Exercises
 ===========================
 
 The remainder of this tutorial illustrates how to use RAJA features with
-working code examples and interactive exercises. Files containing the 
-exercise source code are located in  the ``RAJA/exercises`` directory. 
-Additional information about the RAJA features used can be found 
+working code examples and interactive exercises. Files containing the
+exercise source code are located in  the ``RAJA/exercises`` directory.
+Additional information about the RAJA features used can be found
 in :ref:`features-label`.
 
 The examples demonstrate CPU execution (sequential and OpenMP
 multithreading) and GPU execution (CUDA and/or HIP). Examples that show how
-to use RAJA with other parallel programming model back-ends will appear in 
-future RAJA releases. For adventurous users who wish to try experimental 
+to use RAJA with other parallel programming model back-ends will appear in
+future RAJA releases. For adventurous users who wish to try experimental
 RAJA back-end support, usage is similar to what is shown in the
 examples here.
 
@@ -309,7 +309,7 @@ Simple Loops and Basic RAJA Features
 The examples in this section illustrate how to use ``RAJA::forall`` methods
 to execute simple loop kernels; i.e., non-nested loops. It also describes
 iteration spaces, reductions, atomic operations, scans, sorts, and RAJA
-data views. 
+data views.
 
 .. toctree::
    :maxdepth: 1
@@ -332,63 +332,62 @@ Complex Loops and Advanced RAJA Features
 =================================================================
 
 RAJA provides two APIs for writing complex kernels involving nested
-loops: ``RAJA::kernel`` that has been available for several years and 
-``RAJA::expt::launch``, which is more recent and which will be moved out of
-the ``expt`` namespace soon. We briefly introduce both interfaces here.
-The tutorial sections that follow provide much more detailed descriptions.
+loops: ``RAJA::kernel`` that has been available for several years and
+``RAJA::launch``, which is more recent. We briefly introduce both interfaces
+here. The tutorial sections that follow provide much more detailed descriptions.
 
 ``RAJA::kernel`` is analogous to ``RAJA::forall`` in that it involves
-kernel execution templates, execution policies, iteration spaces, and lambda 
+kernel execution templates, execution policies, iteration spaces, and lambda
 expression kernel bodies. The main differences between ``RAJA::kernel`` and
 ``RAJA::forall`` are:
 
-  * ``RAJA::kernel`` requires a tuple of iteration spaces, one for each level 
+  * ``RAJA::kernel`` requires a tuple of iteration spaces, one for each level
     in a loop nest, whereas ``RAJA::forall`` takes exactly one iteration
     space.
-  * ``RAJA::kernel`` can accept multiple lambda expressions to express 
+  * ``RAJA::kernel`` can accept multiple lambda expressions to express
     different parts of a kernel body, whereas ``RAJA::forall`` accepts
     exactly one lambda expression for a kernel body.
-  * ``RAJA::kernel`` execution policies are more complicated than those 
-    for ``RAJA::forall``. ``RAJA::forall`` policies essentially represent 
-    the kernel execution back-end only. ``RAJA::kernel`` execution policies 
-    enable complex compile time algorithm transformations to be done without 
-    changing the kernel code. 
+  * ``RAJA::kernel`` execution policies are more complicated than those
+    for ``RAJA::forall``. ``RAJA::forall`` policies essentially represent
+    the kernel execution back-end only. ``RAJA::kernel`` execution policies
+    enable complex compile time algorithm transformations to be done without
+    changing the kernel code.
 
 The following exercises illustrate the common usage of ``RAJA::kernel``
-and ````RAJA::expt::launch``. Please see :ref:`loop_elements-kernelpol-label` 
-for more information about other execution policy constructs ``RAJA::kernel`` 
-provides. ``RAJA::expt::launch`` takes a ``RAJA::expt::Grid`` type argument for
+and ````RAJA::launch``. Please see :ref:`loop_elements-kernelpol-label`
+for more information about other execution policy constructs ``RAJA::kernel``
+provides. ``RAJA::launch`` takes a ``RAJA::LaunchParams`` type argument for
 representing a teams-thread launch configuration, and a lambda expression
-which takes a ``RAJA::expt::LaunchContext`` argument. ``RAJA::expt::launch``
+which takes a ``RAJA::launchContext`` argument. ``RAJA::launch``
 allows an optional run time choice of execution environment, either CPU or GPU.
-Code written inside the lambda expression body will execute in the chosen 
-execution environment. Within that environment, a user executes 
-kernel operations using ``RAJA::expt::loop<EXEC_POL>`` method calls, which 
+Code written inside the lambda expression body will execute in the chosen
+execution environment. Within that environment, a user executes
+kernel operations using ``RAJA::loop<EXEC_POL>`` method calls, which
 take lambda expressions to express loop body operations.
 
-.. note:: A key difference between the ``RAJA::kernel`` and 
-          ``RAJA::expt::launch`` approaches is that almost all of the
-          kernel execution pattern is expressed in the execution policy 
-          when using ``RAJA::kernel``, whereas with ``RAJA::expt::launch`` the 
+.. note:: A key difference between the ``RAJA::kernel`` and
+          ``RAJA::launch`` approaches is that almost all of the
+          kernel execution pattern is expressed in the execution policy
+          when using ``RAJA::kernel``, whereas with ``RAJA::launch`` the
           kernel execution pattern is expressed mostly in the lambda
-          expression kernel body. 
+          expression kernel body.
 
 One may argue that ``RAJA::kernel`` is more portable and flexible in that
-the execution policy enables compile time code transformations without 
-changing kernel body code. On the other hand, ``RAJA::expt::launch`` is 
+the execution policy enables compile time code transformations without
+changing kernel body code. On the other hand, ``RAJA::launch`` is
 less opaque and more intuitive, but may require kernel body code changes for
 algorithm changes. Which interface to use depends on personal preference
-and other concerns, such as portability requirements, the need for run time 
-execution selection, etc. Kernel structure is more explicit in application 
-source code with ``RAJA::expt::launch``, and more concise and arguably more 
-opaque with ``RAJA::kernel``. There is a large overlap of algorithms that can 
-be expressed with either interface. However, there are things that one can do 
+and other concerns, such as portability requirements, the need for run time
+execution selection, etc. Kernel structure is more explicit in application
+source code with ``RAJA::launch``, and more concise and arguably more
+opaque with ``RAJA::kernel``. There is a large overlap of algorithms that can
+be expressed with either interface. However, there are things that one can do
 with one or the other but not both.
 
 In the following sections, we introduce the basic mechanics and features
 of both APIs with examples and exercises. We also present a sequence of
-execution policy examples and matrix transpose examples using both 
-``RAJA::kernel`` and ``RAJA::expt::launch`` to compare and contrast the
+execution policy examples and matrix transpose examples using both
+``RAJA::kernel`` and ``RAJA::launch`` to compare and contrast the
 two interfaces.
 
 ===========================================================================
@@ -409,10 +408,10 @@ mechanisms to transform loop patterns. More information can be found in
    tutorial/offset-layout-5pt-stencil.rst
 
 =================================================================
-Nested Loops with ``RAJA::expt::launch``
+Nested Loops with ``RAJA::launch``
 =================================================================
 
-The examples in this section illustrate how to use ``RAJA::expt::launch``
+The examples in this section illustrate how to use ``RAJA::launch``
 to create an run time selectable execution space for expressing algorithms
 as nested loops.
 
@@ -426,12 +425,12 @@ as nested loops.
 .. _tutorialmatrixtranspose-label:
 
 ===============================================================================
-Comparing ``RAJA::kernel`` and ``RAJA::expt::launch``: Matrix-Transpose
+Comparing ``RAJA::kernel`` and ``RAJA::launch``: Matrix-Transpose
 ===============================================================================
 
-In this section, we compare ``RAJA::kernel`` and ``RAJA::expt::launch`` 
-implementations of a matrix transpose algorithm. We illustrate 
-implementation differences of the two interfaces as we build upon each 
+In this section, we compare ``RAJA::kernel`` and ``RAJA::launch``
+implementations of a matrix transpose algorithm. We illustrate
+implementation differences of the two interfaces as we build upon each
 example with more complex features.
 
 .. toctree::
@@ -450,5 +449,3 @@ Other RAJA Features and Usage Examples
 
    tutorial/halo-exchange.rst
    tutorial/matrix_multiply.rst
-
-
diff --git a/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst b/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst
index cbaabd459d..0bb70fa94e 100644
--- a/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst
+++ b/docs/sphinx/user_guide/tutorial/matrix_transpose_local_array.rst
@@ -14,12 +14,12 @@ Tiled Matrix Transpose with Local Array
 
 This section extends the discussion in :ref:`tut-tiledmatrixtranspose-label`
 by adding *local array* objects which are used to store data for each tile in
-CPU stack-allocated arrays or GPU thread local and shared memory to be used 
+CPU stack-allocated arrays or GPU thread local and shared memory to be used
 within kernels.
 
 There are exercise files
 ``RAJA/exercises/kernel-matrix-transpose-local-array.cpp`` and
-``RAJA/exercises/launch-matrix-transpose-local-array.cpp`` for you to work 
+``RAJA/exercises/launch-matrix-transpose-local-array.cpp`` for you to work
 through if you wish to get some practice with RAJA. The files
 ``RAJA/exercises/kernel-matrix-transpose-local-array._solutioncpp`` and
 ``RAJA/exercises/launch-matrix-transpose-local-array_solution.cpp`` contain
@@ -35,24 +35,24 @@ Key RAJA features shown in this example are:
   * ``RAJA::statement::ForICount`` type for generating local tile indices
   * ``RAJA::LocalArray`` type for thread-local tile memory arrays
   * ``RAJA::launch`` kernel execution interface
-  * ``RAJA::expt::tile`` type for loop tiling
-  * ``RAJA::expt::loop_icount`` method to generate local tile indices for Launch
+  * ``RAJA::tile`` type for loop tiling
+  * ``RAJA::loop_icount`` method to generate local tile indices for Launch
   * ``RAJA_TEAM_SHARED`` macro for thread-local tile memory arrays
 
-As in :ref:`tut-tiledmatrixtranspose-label`, this example computes the 
-transpose of an input matrix :math:`A` of size :math:`N_r \times N_c` and 
+As in :ref:`tut-tiledmatrixtranspose-label`, this example computes the
+transpose of an input matrix :math:`A` of size :math:`N_r \times N_c` and
 stores the result in a second matrix :math:`At` of size :math:`N_c \times N_r`.
-The operation uses a local memory tiling algorithm, which tiles the outer 
-loops and iterates over tiles in inner loops. The algorithm first loads 
-input matrix entries into a local two-dimensional array for a tile, and then 
-reads from the tile swapping the row and column indices to generate the output 
-matrix. 
+The operation uses a local memory tiling algorithm, which tiles the outer
+loops and iterates over tiles in inner loops. The algorithm first loads
+input matrix entries into a local two-dimensional array for a tile, and then
+reads from the tile swapping the row and column indices to generate the output
+matrix.
 
-We choose tile dimensions smaller than the dimensions of the matrix and note 
+We choose tile dimensions smaller than the dimensions of the matrix and note
 that it is not necessary for the tile dimensions to divide evenly the number
-of rows and columns in the matrix. As in the 
-:ref:`tut-tiledmatrixtranspose-label` example, we start by defining the number 
-of rows and columns in the matrices, the tile dimensions, and the number of 
+of rows and columns in the matrix. As in the
+:ref:`tut-tiledmatrixtranspose-label` example, we start by defining the number
+of rows and columns in the matrices, the tile dimensions, and the number of
 tiles.
 
 .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp
@@ -68,7 +68,7 @@ as in the :ref:`tut-tiledmatrixtranspose-label` example.
    :end-before: // _mattranspose_localarray_views_end
    :language: C++
 
-The complete sequential C-style implementation of the tiled transpose operation 
+The complete sequential C-style implementation of the tiled transpose operation
 using a stack-allocated local array for the tiles is:
 
 .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp
@@ -87,38 +87,38 @@ using a stack-allocated local array for the tiles is:
 ``RAJA::kernel`` Variants
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The ``RAJA::kernel`` interface provides mechanisms to tile loops and use 
-*local arrays* in kernels so that algorithm patterns like the C-style kernel 
-above can be implemented with RAJA. When using ``RAJA::kernel``, a 
-``RAJA::LocalArray`` type specifies an object whose memory is created inside 
-a kernel using a statement type in a RAJA kernel execution policy. The local 
-array data is only usable within the kernel. See :ref:`feat-local_array-label` 
-for more information. 
+The ``RAJA::kernel`` interface provides mechanisms to tile loops and use
+*local arrays* in kernels so that algorithm patterns like the C-style kernel
+above can be implemented with RAJA. When using ``RAJA::kernel``, a
+``RAJA::LocalArray`` type specifies an object whose memory is created inside
+a kernel using a statement type in a RAJA kernel execution policy. The local
+array data is only usable within the kernel. See :ref:`feat-local_array-label`
+for more information.
 
-``RAJA::kernel`` methods also support loop tiling statements which determine 
+``RAJA::kernel`` methods also support loop tiling statements which determine
 the number of tiles needed to perform an operation based on tile size and
 extent of the corresponding iteration space. Moreover, lambda expressions for
-the kernel will not be invoked for iterations outside the bounds of an 
+the kernel will not be invoked for iterations outside the bounds of an
 iteration space when tile dimensions do not divide evenly the size of the
 iteration space; thus, no conditional checks on loop bounds are needed
 inside inner loops.
 
 For the RAJA version of the matrix transpose kernel above, we define the
 type of the ``RAJA::LocalArray`` used for matrix entries in a tile and
-create an object to represent it: 
+create an object to represent it:
 
 .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp
    :start-after: // _mattranspose_localarray_start
    :end-before: // _mattranspose_localarray_end
    :language: C++
 
-The template parameters that define the type are: the array data type, the 
-data stride permutation for the array indices (here the identity permutation 
-is given, so the default RAJA conventions apply; i.e., the rightmost array 
-index will be stride-1), and the array dimensions. Next, we compare two 
+The template parameters that define the type are: the array data type, the
+data stride permutation for the array indices (here the identity permutation
+is given, so the default RAJA conventions apply; i.e., the rightmost array
+index will be stride-1), and the array dimensions. Next, we compare two
 ``RAJA::kernel`` implementations of the matrix transpose operation.
 
-The complete RAJA sequential CPU variant with kernel execution policy and 
+The complete RAJA sequential CPU variant with kernel execution policy and
 kernel is:
 
 .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp
@@ -127,73 +127,73 @@ kernel is:
    :language: C++
 
 In the execution policy, the ``RAJA::statement::Tile`` types define
-tiling of the outer 'row' (iteration space tuple index '1') and 'col' 
-(iteration space tuple index '0') loops, as well as tile sizes 
-(``RAJA::tile_fixed`` types) and loop execution policies. Next, 
+tiling of the outer 'row' (iteration space tuple index '1') and 'col'
+(iteration space tuple index '0') loops, as well as tile sizes
+(``RAJA::tile_fixed`` types) and loop execution policies. Next,
 the ``RAJA::statement::InitLocalMem`` type allocates the local tile array
 based on the memory policy type (here, we use ``RAJA::cpu_tile_mem`` for
-a CPU stack-allocated array). The ``RAJA::ParamList<2>`` parameter indicates 
-that the local array object is associated with position '2' in the parameter 
+a CPU stack-allocated array). The ``RAJA::ParamList<2>`` parameter indicates
+that the local array object is associated with position '2' in the parameter
 tuple argument passed to the ``RAJA::kernel_param`` method. The first two
-entries in the parameter tuple indicate storage for the local tile indices 
+entries in the parameter tuple indicate storage for the local tile indices
 that are used in the two lambda expressions that comprise the kernel body.
-Finally, we have two sets of nested inner loops for reading the input matrix 
-entries into the local tile array and writing them out to the output matrix 
-transpose. The inner bodies of each of these loop nests are identified by 
+Finally, we have two sets of nested inner loops for reading the input matrix
+entries into the local tile array and writing them out to the output matrix
+transpose. The inner bodies of each of these loop nests are identified by
 lambda expression invocation statements ``RAJA::statement::Lambda<0>`` for
 the first lambda passed as an argument to the ``RAJA::kernel_param`` method
 and ``RAJA::statement::Lambda<1>`` for the second lambda argument.
 
-Note that the loops within tiles use ``RAJA::statement::ForICount`` types 
-rather than ``RAJA::statement::For`` types that we saw in the 
+Note that the loops within tiles use ``RAJA::statement::ForICount`` types
+rather than ``RAJA::statement::For`` types that we saw in the
 tiled matrix transpose example in :ref:`tut-tiledmatrixtranspose-label`.
-The ``RAJA::statement::ForICount`` type generates local tile indices that 
+The ``RAJA::statement::ForICount`` type generates local tile indices that
 are passed to lambda loop body expressions to index into the local tile
-memory array. As the reader will observe, there is no local tile index 
-computation needed in the lambdas for the RAJA version of the kernel as a 
-result. The first integer template parameter for each 
-``RAJA::statement::ForICount`` type indicates the item in the iteration space 
-tuple passed to the ``RAJA::kernel_param`` method to which it applies. 
-The second template parameter for each 
-``RAJA::statement::ForICount`` type indicates the position in the parameter 
-tuple passed to the ``RAJA::kernel_param`` method that will hold the 
-associated local tile index. For more detailed discussion of RAJA loop tiling 
+memory array. As the reader will observe, there is no local tile index
+computation needed in the lambdas for the RAJA version of the kernel as a
+result. The first integer template parameter for each
+``RAJA::statement::ForICount`` type indicates the item in the iteration space
+tuple passed to the ``RAJA::kernel_param`` method to which it applies.
+The second template parameter for each
+``RAJA::statement::ForICount`` type indicates the position in the parameter
+tuple passed to the ``RAJA::kernel_param`` method that will hold the
+associated local tile index. For more detailed discussion of RAJA loop tiling
 statement types, please see :ref:`feat-tiling-label`.
 
-Now that we have described the execution policy in some detail, let's pull 
-everything together by briefly walking though the call to the 
+Now that we have described the execution policy in some detail, let's pull
+everything together by briefly walking though the call to the
 ``RAJA::kernel_param`` method, which is similar to ``RAJA::kernel`` but takes
 additional arguments needed to execute the operations involving local
-tile indices and the local memory array. The first argument is a tuple of 
-iteration spaces that define the iteration ranges for the levels in the loop 
+tile indices and the local memory array. The first argument is a tuple of
+iteration spaces that define the iteration ranges for the levels in the loop
 nest. Again, the first integer parameters given to the ``RAJA::statement::Tile``
-and ``RAJA::statement::ForICount`` types identify the tuple entry to which 
+and ``RAJA::statement::ForICount`` types identify the tuple entry to which
 they apply. The second argument::
 
   RAJA::make_tuple((int)0, (int)0, Tile_Array)
 
-is a tuple of data parameters that will hold the local tile indices and 
-``RAJA::LocalArray`` tile memory. The tuple entries are 
+is a tuple of data parameters that will hold the local tile indices and
+``RAJA::LocalArray`` tile memory. The tuple entries are
 associated with various statements in the execution policy as we described
-earlier. Next, two lambda expression arguments are passed to the 
-``RAJA::kernel_param`` method for reading and writing the input and output 
+earlier. Next, two lambda expression arguments are passed to the
+``RAJA::kernel_param`` method for reading and writing the input and output
 matrix entries, respectively.
 
 .. note:: ``RAJA::kernel_param`` accepts a parameter tuple argument after
-          the iteration space tuple, which enables the parameters to be 
+          the iteration space tuple, which enables the parameters to be
           used in multiple lambda expressions in a kernel.
 
-In the kernel, both lambda expressions take the same five arguments. The first 
-two are the matrix global column and row indices associated with the iteration 
+In the kernel, both lambda expressions take the same five arguments. The first
+two are the matrix global column and row indices associated with the iteration
 space tuple. The next three arguments correspond to the parameter tuple entries.
-The first two of these are the local tile indices used to access entries in the 
-``RAJA::LocalArray`` object memory. The last argument is a reference to the 
+The first two of these are the local tile indices used to access entries in the
+``RAJA::LocalArray`` object memory. The last argument is a reference to the
 ``RAJA::LocalArray`` object itself.
 
-The next ``RAJA::kernel_param`` variant we present works the same as the one 
-above. It is different from the previous version since we include 
-additional template parameters in the ``RAJA::statement::Lambda`` types to 
-indicate which arguments each lambda expression takes and in which order. 
+The next ``RAJA::kernel_param`` variant we present works the same as the one
+above. It is different from the previous version since we include
+additional template parameters in the ``RAJA::statement::Lambda`` types to
+indicate which arguments each lambda expression takes and in which order.
 Here is the complete version including execution policy and kernel:
 
 .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-local-array_solution.cpp
@@ -202,25 +202,25 @@ Here is the complete version including execution policy and kernel:
    :language: C++
 
 Here, the two ``RAJA::statement::Lambda`` types in the execution policy show
-two different ways to specify the segments (``RAJA::Segs``) 
-associated with the matrix column and row indices. That is, we can use a 
+two different ways to specify the segments (``RAJA::Segs``)
+associated with the matrix column and row indices. That is, we can use a
 ``Segs`` statement for each argument, or include multiple segment ids in one
-statement. 
+statement.
 
 Note that we are using ``RAJA::statement::For`` types for the inner tile
 loops instead of `RAJA::statement::ForICount`` types used in the first variant.
 As a consequence of specifying lambda arguments, there are two main differences.
-The local tile indices are properly computed and passed to the lambda 
+The local tile indices are properly computed and passed to the lambda
 expressions as a result of the ``RAJA::Offsets`` types that appear
 in the lambda statement types. The ``RAJA::statement::Lambda`` type for each
-lambda shows the two ways to specify the local tile index arguments; we can 
-use an ``Offsets`` statement for each argument, or include multiple segment 
+lambda shows the two ways to specify the local tile index arguments; we can
+use an ``Offsets`` statement for each argument, or include multiple segment
 ids in one statement. Lastly, there is only one entry in the parameter
 tuple in this case, the local tile array. The placeholders in the
 previous example are not needed.
 
-.. note:: In this example, we need all five arguments in each lambda 
-          expression so the lambda expression argument lists are 
+.. note:: In this example, we need all five arguments in each lambda
+          expression so the lambda expression argument lists are
           the same. Another use case for the template parameter argument
           specification described here is to be able to pass only the
           arguments used in a lambda expression. In particular when we use
@@ -228,24 +228,30 @@ previous example are not needed.
           can have a different argument lists from the others.
 
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-``RAJA::expt::launch`` Variants
+``RAJA::launch`` Variants
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The ``RAJA::expt::launch`` interface provides mechanisms to tile loops and use 
-*local arrays* in kernels to support algorithm patterns like the C-style kernel 
-above. When, using ``RAJA::expt::launch``, the ``RAJA_TEAM_SHARED`` macro is
-used to create a GPU shared memory array or a CPU stack memory array inside
-a kernel.
-
-``RAJA::expt::launch`` support methods for tiling over an iteration space
-using ``RAJA::expt::tile`` and ``RAJA::expt::loop_icount`` methods to tile
-loops and generate global iteration indices and local tile offsets.
-Moreover, lambda expressions for these methods will not be invoked for
-iterations outside the bounds of an iteration space when tile dimensions
-do not divide evenly the size of the iteration space; thus, no conditional
-checks on loop bounds are needed inside inner loops.
-
-A complete RAJA sequential CPU variant with kernel execution policy and 
+The ``RAJA::launch`` interface provides mechanisms to tile loops and use
+*local arrays* in kernels to support algorithm patterns like the C-style kernel
+above. When, using ``RAJA::launch``, the ``RAJA_TEAM_SHARED`` macro is
+used to create a GPU a static sized shared memory array when using the CUDA and HIP backends,
+static shared memory using SYCL is currently not supported. On the CPU, allocating
+``RAJA_TEAM_SHARED`` corresponds to allocating memory on the stack.
+Alternatively, one can allocated dynamic shared memory by specifying the amount of shared memory in
+the ``RAJA::LaunchParams`` struct. Dynamic shared memory is supported with all
+backends and will be demonstrated as our second example. On the CPU dynamic
+shared memory is mapped to heap memory and allocated via malloc at kernel launch.
+
+As a first example, we illustrate the usage of static shared memory
+and the use of ``RAJA::launch`` tiling methods. RAJA tiling methods
+take a iteration space in ``RAJA::tile`` and output tiles which
+the ``RAJA::loop_icount`` method can iterate over and generate
+global and local tile index offsets.  Moreover, lambda expressions for these
+methods will not be invoked for iterations outside the bounds of an iteration
+space when tile dimensions do not evenly divide the size of the iteration space;
+thus, no conditional checks on loop bounds are needed inside inner loops.
+
+A complete RAJA sequential CPU variant with kernel execution policy and
 kernel is:
 
 .. literalinclude:: ../../../../exercises/launch-matrix-transpose-local-array_solution.cpp
@@ -253,15 +259,35 @@ kernel is:
    :end-before: // _mattranspose_localarray_raja_end
    :language: C++
 
-Here, the ``RAJA::expt::tile`` method is used to create tilings of the outer 
-'row' and 'col' iteration spaces. The ``RAJA::expt::tile`` method
-takes an additional argument specifying the tile size for the corresponding 
-loop. To traverse the tile, we use the ``RAJA::expt::loop_icount`` method, 
-which is similar to the ``RAJA::ForICount`` statement used in a 
-``RAJA::kernel`` execution policy as shown above. A 
-``RAJA::expt::loop_icount`` method call
+Here, the ``RAJA::tile`` method is used to create tilings of the outer
+'row' and 'col' iteration spaces. The ``RAJA::tile`` method
+takes an additional argument specifying the tile size for the corresponding
+loop. To traverse the tile, we use the ``RAJA::loop_icount`` method,
+which is similar to the ``RAJA::ForICount`` statement used in a
+``RAJA::kernel`` execution policy as shown above. A
+``RAJA::loop_icount`` method call
 will generate local tile index associated with the outer global index.
-The local tile index is necessary as we use it to read and write entries 
+The local tile index is necessary as we use it to read and write entries
 from/to global memory to ``RAJA_TEAM_SHARED`` memory array.
 
+As an alternative to static shared memory, the matrix transpose kernel may be
+express using dynamic shared memory. Prior to invoking the amount of shared memory
+must be specified
 
+.. literalinclude:: ../../../../examples/dynamic_mat_transpose.cpp
+   :start-after: // _dynamic_mattranspose_shared_mem_start
+   :end-before: // _dynamic_mattranspose_shared_mem_end
+   :language: C++
+
+The amount of shared memory is then specifed in the ``RAJA::LaunchParams`` struct
+and then accessed within the kernel using the LaunchContext's ``getSharedMemory`` method.
+The ``getSharedMemory`` method may be invoked multiple times each time returning an
+offset to the shared memory buffer. Since the offset moves per ``getSharedMemory``
+call (also known as bump style allocation), it becomes necessary to reset the allocator offset
+count at the end of the shared memory scope to avoid going beyond the buffer size.
+The full example of matrix transpose with dynamic shared memory is provided below
+
+.. literalinclude:: ../../../../examples/dynamic_mat_transpose.cpp
+   :start-after: // _dynamic_mattranspose_kernel_start
+   :end-before: // _dynamic_mattranspose_kernel_end
+   :language: C++
diff --git a/examples/dynamic_mat_transpose.cpp b/examples/dynamic_mat_transpose.cpp
index 26deb9c142..4a03b33c39 100644
--- a/examples/dynamic_mat_transpose.cpp
+++ b/examples/dynamic_mat_transpose.cpp
@@ -199,7 +199,7 @@ int main(int argc, char *argv[])
   //
   // Define num rows/cols in matrix, tile dimensions, and number of tiles
   //
-  // _mattranspose_localarray_dims_start
+  // _dynamic_mattranspose_localarray_dims_start
   const int N_r = 267;
   const int N_c = 251;
 
@@ -207,7 +207,7 @@ int main(int argc, char *argv[])
 
   const int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
   const int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
-  // _mattranspose_localarray_dims_end
+  // _dynamic_mattranspose_localarray_dims_end
 
   //
   // Allocate matrix data
@@ -221,10 +221,10 @@ int main(int argc, char *argv[])
   // holds a pointer to a data array and enables multi-dimensional indexing
   // into the data.
   //
-  // _mattranspose_localarray_views_start
+  // _dynamic_mattranspose_localarray_views_start
   RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
   RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
-  // _mattranspose_localarray_views_end
+  // _dynamic_mattranspose_localarray_views_end
 
   //
   // Initialize matrix data
@@ -241,7 +241,7 @@ int main(int argc, char *argv[])
 
   std::memset(At, 0, N_r * N_c * sizeof(int));
 
-  // _mattranspose_localarray_cstyle_start
+  // _dynamic_mattranspose_localarray_cstyle_start
   //
   // (0) Outer loops to iterate over tiles
   //
@@ -291,7 +291,7 @@ int main(int argc, char *argv[])
 
     }
   }
-  // _mattranspose_localarray_cstyle_end
+  // _dynamic_mattranspose_localarray_cstyle_end
 
   checkResult<int>(Atview, N_c, N_r);
   // printResult<int>(Atview, N_c, N_r);
@@ -317,9 +317,11 @@ int main(int argc, char *argv[])
   }
 #endif
 
-
+  // _dynamic_mattranspose_shared_mem_start
   constexpr size_t dynamic_shared_mem_size = TILE_DIM * TILE_DIM * sizeof(int);
+  // _dynamic_mattranspose_shared_mem_end
 
+  // _dynamic_mattranspose_kernel_start
   RAJA::launch<launch_policy>
     (select_cpu_or_gpu,
      RAJA::LaunchParams(RAJA::Teams(outer_Dimr, outer_Dimc),
@@ -368,15 +370,17 @@ int main(int argc, char *argv[])
                 });
               });
 
-            //The launch context uses bump style allocator to return different segments of shared memory
-            //to avoid requesting beyond the pre-allocated memory quantity we reset the allocator offset counter
-            //effectively releasing shared memory.
+            //The launch context uses bump style allocator in which calls
+	    //to getSharedMemory moves a memory buffer pointer to return
+	    //different segments of shared memory. To avoid requesting beyond
+	    //the pre-allocated memory quantity we reset the allocator offset counter
+	    //in the launch context effectively releasing shared memory.
             ctx.releaseSharedMemory();
-
           });
       });
 
   });
+  // _dynamic_mattranspose_kernel_end
 
 
 #if defined(RAJA_ENABLE_HIP)