C2SM · iomaganaris · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026
diff --git a/INTRODUCTION.md b/INTRODUCTION.md
@@ -0,0 +1,108 @@
+# Icon4py performance on MI300
+
+## Intro to icon4py and GT4Py
+
+In the following text we will give an overview of [icon4py](https://github.com/C2SM/icon4py), [GT4Py](https://github.com/GridTools/gt4py) and [DaCe](https://github.com/spcl/dace) and how they interact to compile our Python ICON implementation.
+
+### icon4py
+
+`icon4py` is a Python port of `ICON` implemented using the `GT4Py DSL`. Currently in `icon4py` there are only certain parts of `ICON` implemented. The most important being the `dycore`, which is the `ICON` component that takes most of the time to execute.
+For this purpose we think it makes more sense to focus in this component.
+The `icon4py` dycore implementation consists of ~20 `GT4Py Programs` or stencils. Each one of these programs consists of multiple GPU (CUDA or HIP) kernels and memory allocations/deallocations while in the full `icon4py` code there are also MPI/nccl communications. For now we will focus in the single node execution, so no communication is conducted.
+
+### GT4Py
+
+`GT4Py` is a compilation framework that provides a DSL which is used as frontend to write the stencil computations. This is done using a DSL embedded into Python code in `icon4py` as stated above.
+Here is an example of a `GT4Py Program` from `icon4py`: [vertically_implicit_solver_at_predictor_step](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219).
+`GT4Py` supports multiple backends. These are `embedded` (with numpy/JAX execution), `GTFN` (GridTools C++ implementation) and `DaCe`. For the moment the most efficient is `DaCe` so we'll focus on this one only. The code from the frontend is lowered from the `GT4Py DSL` to CUDA/HIP code after numerous transformations in `GT4Py IR (GTIR)` and then `DaCe Stateful Dataflow Graphs (SDFG)`. The lowering from `GTIR` to `DaCe SDFG` is done using the low level `DaCe` API.
+
+### DaCe
+
+`DaCe` is a programming framework that can take Python code and transform it to an SDFG, which is a representation that is easy to apply dataflow optimizations and achieve good performance in modern CPUs and GPUs. To see more information regarding how the SDFGs look like see the following [link](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html).
+`DaCe` includes also a code generator from SDFG to C++, HIP and CUDA code. The HIP generated code is CUDA code hipified basically so there are no big differences between the generated code for CUDA and HIP.
+
+
+## Benchmarking
+
+For the benchmarking we have focused on the `dycore` component of `icon4py` . We have measured the runtimes for the different `GT4Py Programs` executed in it between an `MI300A` and a `GH200 GPU` below:
+
+```
++--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
+| GT4Py Programs                                         | MI300A Time (s) | GH200 Time (s) | Acceleration of GH200 over MI300A (MI300A time / GH200 time) |
++--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
+| compute_diagnostics_from_normal_wind                   | 0.000268        | 0.000150       | 1.79                                                         |
+| compute_advection_in_predictor_vertical_momentum       | 0.000195        | 0.000129       | 1.51                                                         |
+| compute_advection_in_horizontal_momentum               | 0.004871        | 0.000174       | 27.98                                                        |
+| compute_perturbed_quantities_and_interpolation         | 0.000433        | 0.000255       | 1.70                                                         |
+| compute_hydrostatic_correction_term                    | 0.000034        | 0.000026       | 1.30                                                         |
+| compute_rho_theta_pgrad_and_update_vn                  | 0.105237        | 0.000404       | 260.40                                                       |
+| compute_horizontal_velocity_quantities_and_fluxes      | 0.000562        | 0.000324       | 1.73                                                         |
+| vertically_implicit_solver_at_predictor_step           | 0.011691        | 0.000601       | 19.46                                                        |
+| compute_advection_in_corrector_vertical_momentum       | 0.010325        | 0.000209       | 49.51                                                        |
+| compute_interpolation_and_nonhydro_buoy                | 0.000253        | 0.000135       | 1.87                                                         |
+| apply_divergence_damping_and_update_vn                 | 0.000208        | 0.000114       | 1.83                                                         |
+| vertically_implicit_solver_at_corrector_step           | 0.002938        | 0.000592       | 4.96                                                         |
++--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
+```
+
+Some of them show a dramatic slowdown in `MI300A` meanwhile in all of them the standard deviation in `MI300A` is much higher than `GH200`. The above are the median runtimes that are reported over 100 iterations (excluding the first slow one) using a C++ timer as close as possible to the kernel launches.
+
+While looking at all of them and especially the ones that are much slower than the others on the `MI300A` is useful, we think that starting from a specific `GT4Py Program` and looking at the performance of each kernel launched from it is more interesting as a first step.
+To that end, we selected one of the `GT4Py Programs` that takes most of the time in a production simulation and has kernels with different representative patterns like: neighbor reductions, 2D maps and scans.
+This is the `vertically_implicit_solver_at_predictor_step` `GT4Py program` and here is the comparison of its kernels:
+
+```
++-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
+| Name                        | MI300A Avg Time (μs)  | GH200 Mean Time (μs)   | Acceleration GH200 over MI300A (MI300A time / GH200 time) |
++-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
+| map_100_fieldop_1_0_0_514   |                225.20 |                 123.20 |                                                     1.83  |
+| map_115_fieldop_1_0_0_518   |                197.40 |                 113.04 |                                                     1.75  |
+| map_60_fieldop_0_0_504      |                142.10 |                  86.66 |                                                     1.64  |
+| map_85_fieldop_0_0_506      |                 80.45 |                  81.28 |                                                     0.99  |
+| map_0_fieldop_0_0_500       |                 63.02 |                  31.68 |                                                     1.99  |
+| map_31_fieldop_0_0_0_512    |                 54.46 |                  28.56 |                                                     1.91  |
+| map_90_fieldop_0_0_508      |                 25.57 |                  18.62 |                                                     1.37  |
+| map_91_fieldop_0_0_510      |                  7.99 |                   3.49 |                                                     2.29  |
+| map_100_fieldop_0_0_0_0_520 |                  5.59 |                   5.07 |                                                     1.10  |
+| map_13_fieldop_0_0_498      |                  5.32 |                   3.70 |                                                     1.44  |
+| map_115_fieldop_0_0_0_516   |                  4.99 |                   5.28 |                                                     0.95  |
+| map_35_fieldop_0_0_503      |                  3.62 |                   1.87 |                                                     1.93  |
++-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
+```
+
+The runtimes of the individual kernels are collected using `nsys` and `rocprofv3`.
+
+The benchmarks were run on `Santis` (`GH200 GPU`) and `Beverin` (`MI300A GPU`) using the following uenv images:
+- GH200: `icon/25.2:v3` (CUDA 12.6)
+- MI300A: `build::prgenv-gnu/25.12:2288359995` (ROCM 7.1.0)
+
+To reproduce the benchmark results on `Beverin` you can follow the instructions below:
+
+```
+# Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!*
+uenv image pull build::prgenv-gnu/25.12:2288359995
+
+# Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!*
+uenv start --view default prgenv-gnu/25.12:2288359995
+
+# Run the whole `dycore` granule and gather the runtimes of the `GT4PY Programs`
+sbatch benchmark_dycore.sh
+# The script above will generate a json file with the names of the `GT4Py Programs` and their runtimes. The first one is always slow so we skip accounting it in our analysis
+# With the following python script you can parse the json file and print the runtimes in a nice form
+# python read_gt4py_timers.py dycore_gt4py_program_metrics.json # passing --csv will save them in a csv file
+
+# Run the `vertically_implicit_solver_at_predictor_step` GT4Py program standalone. Notice the `GT4Py Timer Report` table printed from the first `pytest` invocation. The reported timers on this table are as close as possible to the kernel launches of the GT4Py program.
+# The following script will benchmark the solver, run `rocprofv3` and collect a trace of it as well as run the `rocprof-compute` tool for all its kernels
+sbatch benchmark_solver.sh
+```
+
+## Notes
+
+- To understand the code apart from the analysis the profilers there are the following sources:
+  1. Look at the generated HIP code for the `GT4Py program` `vertically_implicit_solver_at_predictor_step` in `amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/src/cuda/vertically_implicit_solver_at_predictor_step.cpp`. The code is generated from DaCe automatically and it's a bit too verbose. It would be good to have some feedback on whether the generated code is in a good form for the HIP compiler to optimize.
+  2. Look at the `icon4py` frontend code for the `vertically_implicit_solver_at_predictor_step` [here](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219)
+  3. Look at the generated SDFG by DaCe. This can give a nice overview of the computations and kernels generated. Using [the DaCe documentation](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html) can help you understand what is expressed in the SDFG. The generated SDFG is saved in `amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/program.sdfg`. To view the SDFG there is a VSCode plugin (`DaCe IOE`) or you can download it locally and open it in https://spcl.github.io/dace-webclient/.
+
+- In the `amd_profiling_solver/.gt4py_cache` directory you may see various `vertically_implicit_solver_at_predictor_step_<HASH>`. Currently there are issues with the caching the compiled programs so running the profilers might take more than necessary and generate issues. We should look together into that to figure out a solution
+
+- Installing the AMD HIP/ROCm packages for our UENV with Spack required various changes and which are done [here](https://github.com/eth-cscs/alps-uenv/pull/273)
diff --git a/benchmark_dycore.sh b/benchmark_dycore.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+#SBATCH --job-name=dycore_granule_profile
+#SBATCH --ntasks=1
+#SBATCH --time=08:00:00
+#SBATCH --gres=gpu:1
+#SBATCH --partition=mi300
+
+source setup_env.sh
+
+source .venv/bin/activate
+
+export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1"
+export GT4PY_BUILD_CACHE_LIFETIME=persistent
+export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule
+export GT4PY_DYCORE_ENABLE_METRICS="1"
+export GT4PY_ADD_GPU_TRACE_MARKERS="1"
+export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage"
+
+pytest -sv \
+    -m continuous_benchmarking \
+    -p no:tach \
+    --benchmark-only \
+    --benchmark-warmup=on \
+    --benchmark-warmup-iterations=30 \
+    --backend=dace_gpu \
+    --grid=icon_benchmark_regional \
+    --benchmark-time-unit=ms \
+    --benchmark-min-rounds 100 \
+    model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False]
+
+python read_gt4py_timers.py dycore_gt4py_program_metrics.json
diff --git a/benchmark_solver.sh b/benchmark_solver.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+#SBATCH --job-name=solver_benchmark
+#SBATCH --ntasks=1
+#SBATCH --time=08:00:00
+#SBATCH --gres=gpu:1
+#SBATCH --partition=mi300
+
+source setup_env.sh
+
+source .venv/bin/activate
+
+export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1"
+export GT4PY_BUILD_CACHE_LIFETIME=persistent
+export GT4PY_BUILD_CACHE_DIR=amd_profiling_solver
+export ORIGINAL_GT4PY_BUILD_CACHE_DIR=$GT4PY_BUILD_CACHE_DIR
+export GT4PY_COLLECT_METRICS_LEVEL=10
+export GT4PY_ADD_GPU_TRACE_MARKERS="1"
+export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=3
+export ICON4PY_STENCIL_TEST_ITERATIONS=10
+export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=100
+export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage"
+
+# Run the benchmark and collect the runtime of the whole GT4Py program (see `GT4Py Timer Report` in the output)
+# The compiled GT4Py programs will be cached in the directory specified by `GT4PY_BUILD_CACHE_DIR` to be reused for running the `rocprof-compute` later
+pytest -sv \
+    -m continuous_benchmarking \
+    -p no:tach \
+    --backend=dace_gpu \
+    --grid=icon_benchmark_regional \
+    model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \
+    -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]"
+
+# Run the benchmark and collect its trace
+# TODO(AMD/CSCS): Figure out why reusing the cached compiled stencils doesn't work under rocprofv3 and the GT4Py programs get recompiled every time we rerun the profiler
+# TODO(AMD): Generating `rocpd` output fails with segfaults
+export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=0
+export ICON4PY_STENCIL_TEST_ITERATIONS=1
+export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=10
+export GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}_rocprofv3 # Separate cache directory for the rocprofv3 run to avoid clashes with kernel names
+rocprofv3 --kernel-trace on --hip-trace on --marker-trace on --memory-copy-trace on --memory-allocation-trace on --output-format pftrace -o rocprofv3_${GT4PY_BUILD_CACHE_DIR} -- \
+    $(which python3.12) -m pytest -sv \
+    -m continuous_benchmarking \
+    -p no:tach \
+    --backend=dace_gpu \
+    --grid=icon_benchmark_regional \
+    model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \
+    -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]"
+
+# Get the kernel names of the GT4Py program so that we can filter them with rocprof-compute
+LAST_COMPILED_DIRECTORY=$(realpath $(ls -td ${ORIGINAL_GT4PY_BUILD_CACHE_DIR}/.gt4py_cache/*/ | head -1))
+echo "# Last compiled GT4Py directory: $LAST_COMPILED_DIRECTORY"
+LAST_COMPILED_KERNEL_NAMES=$(grep -r -e "__global__ void.*map.*(" ${LAST_COMPILED_DIRECTORY}/src/cuda -o | sed 's/.*\s\([a-zA-Z_][a-zA-Z0-9_]*\)(.*/\1/')
+echo "# Last compiled GT4Py kernel names:"
+echo "$LAST_COMPILED_KERNEL_NAMES"
+ROCPROF_COMPUTE_KERNEL_NAME_FILTER="-k $LAST_COMPILED_KERNEL_NAMES"
+
+# Run rocprof-compute filtering the kernels of interest
+# TODO(AMD/CSCS): Figure out why reusing the cached compiled stencils doesn't work under rocprofv3 and the GT4Py programs get recompiled every time we rerun the profiler
+#                 This is problematic when gathering the data for the rocprof-compute analysis as different compilations may result in different kernel names
+export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=0
+export ICON4PY_STENCIL_TEST_ITERATIONS=1
+export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=1
+export GT4PY_BUILD_CACHE_DIR=${ORIGINAL_GT4PY_BUILD_CACHE_DIR} # Reuse the compiled stencils of the first run
+rocprof-compute profile --name rcu_${GT4PY_BUILD_CACHE_DIR} ${ROCPROF_COMPUTE_KERNEL_NAME_FILTER} --format-rocprof-output rocpd --kernel-names -R FP64 -- \
+    $(which python3.12) -m pytest -sv \
+    -m continuous_benchmarking \
+    -p no:tach \
+    --backend=dace_gpu \
+    --grid=icon_benchmark_regional \
+    model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \
+    -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]"
+
+# TODO(AMD): Roofline generation fails with
+#   File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 998, in standalone_roofline
+#    self.empirical_roofline(ret_df=t_df)
+#  File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/utils/logger.py", line 66, in wrap_function
+#    result = function(*args, **kwargs)
+#             ^^^^^^^^^^^^^^^^^^^^^^^^^
+#  File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 463, in empirical_roofline
+#    flops_figure.write_image(
+#  File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/basedatatypes.py", line 3895, in write_image
+#    return pio.write_image(self, *args, **kwargs)
+#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#  File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/io/_kaleido.py", line 555, in write_image
+#    path.write_bytes(img_data)
+#  File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1036, in write_bytes
+#    with self.open(mode='wb') as f:
+#         ^^^^^^^^^^^^^^^^^^^^
+# File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1013, in open
+#    return io.open(self, mode, buffering, encoding, errors, newline)
+#           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# OSError: [Errno 36] File name too long: '/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/workloads/rcu_amd_profiling_solver/MI300A_A1/empirRoof_gpu-0_FP64_map_0_fieldop_0_0_500_map_100_fieldop_0_0_0_514_map_100_fieldop_1_0_0_0_520_map_115_fieldop_0_0_0_516_map_115_fieldop_1_0_0_518_map_13_fieldop_0_0_498_map_31_fieldop_0_0_0_512_map_35_fieldop_0_0_503_map_60_fieldop_0_0_504_map_85_fieldop_0_0_506_map_90_fieldop_0_0_508_map_91_fieldop_0_0_510.pdf'