Skip to content
Draft
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
be4cee4
update gt4py version
edopao Jan 29, 2026
6ecff32
switch gt4py branch
edopao Jan 29, 2026
1c9c744
update uv lock
edopao Jan 29, 2026
a1e753f
edit import metrics
edopao Jan 29, 2026
b45b9b1
switch gt4py branch
edopao Jan 29, 2026
517d122
edit import metrics
edopao Jan 29, 2026
672b4f0
edit import metrics
edopao Jan 29, 2026
9b2662d
Merge branch 'main' into update_dace_version
iomaganaris Feb 4, 2026
532c125
Update DaCe version
iomaganaris Feb 5, 2026
991b6b8
Update the gt4py commit
iomaganaris Feb 5, 2026
f194d83
Initial amd notes and scripts
iomaganaris Feb 5, 2026
1eb4708
Pre-compilation fix with_backend
havogt Feb 5, 2026
30fe86c
Fixes to the notes
iomaganaris Feb 5, 2026
4d13d82
Additional comments in the scripts
iomaganaris Feb 5, 2026
81e7a24
Fix gtx_metrics
iomaganaris Feb 5, 2026
47e5e48
Clean up setup script
iomaganaris Feb 5, 2026
cfc5d89
Move scripts in amd_scripts and renamed instructions' file
iomaganaris Feb 6, 2026
adae364
Added quickstart guide
iomaganaris Feb 6, 2026
d7a6aa2
Added goals section
iomaganaris Feb 6, 2026
8ed9403
Added note about scratch directory
iomaganaris Feb 6, 2026
ffc0d51
Use revised `with_compilation_option` naming
tehrengruber Feb 6, 2026
6ded3a9
Merge remote-tracking branch 'origin/update_dace_version_pre_compile_…
iomaganaris Feb 6, 2026
634ddfe
Cleaned up scripts
iomaganaris Feb 6, 2026
31271fe
Edited notes of instructions
iomaganaris Feb 6, 2026
f450589
Fix GT4PY_BUILD_CACHE_DIR in solver script
iomaganaris Feb 6, 2026
aa13236
Update gt4py branch to fix gtir indeterminism
iomaganaris Feb 6, 2026
1016363
Update branch in pyproject.toml as well
iomaganaris Feb 6, 2026
290c4d0
GT4Py 1.1.4: Pre-compilation fix with_backend (#1048)
havogt Feb 6, 2026
42cd8e4
update gt4py version
havogt Feb 6, 2026
acb1d3b
Merge remote-tracking branch 'upstream/main' into update_dace_version
havogt Feb 6, 2026
f37daab
Use gt4py main branch
iomaganaris Feb 6, 2026
2ae0c61
Updated comment in the benchmark script
iomaganaris Feb 6, 2026
17b41d4
fix typo
havogt Feb 6, 2026
126090e
more typos and fix test
havogt Feb 6, 2026
4c19ce5
Fix print_gt4py_timers script
iomaganaris Feb 6, 2026
e235735
Merge remote-tracking branch 'origin/update_dace_version' into amd_pr…
iomaganaris Feb 6, 2026
7650867
Add rocm7_0 extra
havogt Feb 6, 2026
15cf58d
add missing uv.lock
havogt Feb 6, 2026
def7749
Updated text regarding very slow kernels
iomaganaris Feb 6, 2026
c5b8669
Added profiling of whole dycore with rocprofv3
iomaganaris Feb 6, 2026
c463434
Mention in the Notes about the TODOs
iomaganaris Feb 6, 2026
cb43ccc
Use gt4py amd_staging_branch that includes fix_indeterministic_get_cl…
iomaganaris Feb 6, 2026
e8f4142
Updated to the introduction notes
iomaganaris Feb 9, 2026
1326d6e
Refactor scripts for setting the benchmarked grid
iomaganaris Feb 9, 2026
e6a5c9f
Updated kernel times in the md file
iomaganaris Feb 9, 2026
e87273f
Commented out problematic profiling command for dycore
iomaganaris Feb 9, 2026
e12a182
Updated results with persistent memory
iomaganaris Feb 9, 2026
f2e3c56
Mention new uenv that enables thread tracing with rocprofv3
iomaganaris Feb 19, 2026
6029f6c
Merge remote-tracking branch 'origin/main' into amd_profiling
iomaganaris Feb 19, 2026
aa5a195
Fix metrics discovery after new gt4py v1.1.4 release
iomaganaris Feb 16, 2026
0a5d2d2
Update uv.lock with gt4py version
iomaganaris Feb 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions AMD_INTRODUCTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Icon4py performance on MI300

## Quickstart

```
# Connect to Beverin (CSCS system with MI300A)
ssh beverin.cscs.ch
```

In Beverin:
```
# Enter scratch directory
cd $SCRATCH

# Clone icon4py and checkout the correct branch
git clone [email protected]:C2SM/icon4py.git
cd icon4py
git checkout amd_profiling

# Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!*
uenv image pull build::prgenv-gnu/25.12:2288359995

# Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!*
uenv start --view default prgenv-gnu/25.12:2288359995

# Install the necessary venv
bash amd_scripts/install_icon4py_venv.sh

# Source venv
source .venv/bin/activate

# Source other necessary environment variables
source amd_scripts/setup_env.sh

# Set GT4Py related environment variables
export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1"
export GT4PY_BUILD_CACHE_LIFETIME=persistent
export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule
export GT4PY_COLLECT_METRICS_LEVEL=10
export GT4PY_DYCORE_ENABLE_METRICS="1"
export GT4PY_ADD_GPU_TRACE_MARKERS="1"
export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage"

# Benchmark dycore
pytest -sv \
-m continuous_benchmarking \
-p no:tach \
--benchmark-only \
--benchmark-warmup=on \
--benchmark-warmup-iterations=30 \
--benchmark-json=pytest_benchmark_results.json \
--backend=dace_gpu \
--grid=icon_benchmark_regional \
--benchmark-time-unit=ms \
--benchmark-min-rounds 100 \
model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False]

# Print GT4Py timers
python read_gt4py_timers.py dycore_gt4py_program_metrics.json
```

For more information regarding benchmarking read the [Benchmarking](#benchmarking) chapter

## Intro to icon4py and GT4Py

In the following text we will give an overview of [icon4py](https://github.com/C2SM/icon4py), [GT4Py](https://github.com/GridTools/gt4py) and [DaCe](https://github.com/spcl/dace) and how they interact to compile our Python ICON implementation.

### icon4py

`icon4py` is a Python port of `ICON` implemented using the `GT4Py DSL`. Currently in `icon4py` there are only certain parts of `ICON` implemented. The most important being the `dycore`, which is the `ICON` component that takes most of the time to execute.
For this purpose we think it makes more sense to focus in this component.
The `icon4py` dycore implementation consists of ~20 `GT4Py Programs` or stencils. Each one of these programs consists of multiple GPU (CUDA or HIP) kernels and memory allocations/deallocations while in the full `icon4py` code there are also MPI/nccl communications. For now we will focus in the single node execution, so no communication is conducted.

### GT4Py

`GT4Py` is a compilation framework that provides a DSL which is used as frontend to write the stencil computations. This is done using a DSL embedded into Python code in `icon4py` as stated above.
Here is an example of a `GT4Py Program` from `icon4py`: [vertically_implicit_solver_at_predictor_step](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219).
`GT4Py` supports multiple backends. These are `embedded` (with numpy/JAX execution), `GTFN` (GridTools C++ implementation) and `DaCe`. For the moment the most efficient is `DaCe` so we'll focus on this one only. The code from the frontend is lowered from the `GT4Py DSL` to CUDA/HIP code after numerous transformations in `GT4Py IR (GTIR)` and then `DaCe Stateful Dataflow Graphs (SDFG)`. The lowering from `GTIR` to `DaCe SDFG` is done using the low level `DaCe` API.

### DaCe

`DaCe` is a programming framework that can take Python code and transform it to an SDFG, which is a representation that is easy to apply dataflow optimizations and achieve good performance in modern CPUs and GPUs. To see more information regarding how the SDFGs look like see the following [link](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html).
`DaCe` includes also a code generator from SDFG to C++, HIP and CUDA code. The HIP generated code is CUDA code hipified basically so there are no big differences between the generated code for CUDA and HIP.


## Benchmarking

For the benchmarking we have focused on the `dycore` component of `icon4py` . We have measured the runtimes for the different `GT4Py Programs` executed in it between an `MI300A` and a `GH200 GPU` below:

```
+--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
| GT4Py Programs | MI300A Time (s) | GH200 Time (s) | Acceleration of GH200 over MI300A (MI300A time / GH200 time) |
+--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
| compute_diagnostics_from_normal_wind | 0.000268 | 0.000150 | 1.79 |
| compute_advection_in_predictor_vertical_momentum | 0.000195 | 0.000129 | 1.51 |
| compute_advection_in_horizontal_momentum | 0.004871 | 0.000174 | 27.98 |
| compute_perturbed_quantities_and_interpolation | 0.000433 | 0.000255 | 1.70 |
| compute_hydrostatic_correction_term | 0.000034 | 0.000026 | 1.30 |
| compute_rho_theta_pgrad_and_update_vn | 0.105237 | 0.000404 | 260.40 |
| compute_horizontal_velocity_quantities_and_fluxes | 0.000562 | 0.000324 | 1.73 |
| vertically_implicit_solver_at_predictor_step | 0.011691 | 0.000601 | 19.46 |
| compute_advection_in_corrector_vertical_momentum | 0.010325 | 0.000209 | 49.51 |
| compute_interpolation_and_nonhydro_buoy | 0.000253 | 0.000135 | 1.87 |
| apply_divergence_damping_and_update_vn | 0.000208 | 0.000114 | 1.83 |
| vertically_implicit_solver_at_corrector_step | 0.002938 | 0.000592 | 4.96 |
+--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+
```

Some of them show a dramatic slowdown in `MI300A` meanwhile in all of them the standard deviation in `MI300A` is much higher than `GH200`. The above are the median runtimes that are reported over 100 iterations (excluding the first slow one) using a C++ timer as close as possible to the kernel launches.

While looking at all of them and especially the ones that are much slower than the others on the `MI300A` is useful, we think that starting from a specific `GT4Py Program` and looking at the performance of each kernel launched from it is more interesting as a first step.
To that end, we selected one of the `GT4Py Programs` that takes most of the time in a production simulation and has kernels with different representative patterns like: neighbor reductions, 2D maps and scans.
This is the `vertically_implicit_solver_at_predictor_step` `GT4Py program` and here is the comparison of its kernels:

```
+-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
| Name | MI300A Avg Time (μs) | GH200 Mean Time (μs) | Acceleration GH200 over MI300A (MI300A time / GH200 time) |
+-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
| map_100_fieldop_1_0_0_514 | 225.20 | 123.20 | 1.83 |
| map_115_fieldop_1_0_0_518 | 197.40 | 113.04 | 1.75 |
| map_60_fieldop_0_0_504 | 142.10 | 86.66 | 1.64 |
| map_85_fieldop_0_0_506 | 80.45 | 81.28 | 0.99 |
| map_0_fieldop_0_0_500 | 63.02 | 31.68 | 1.99 |
| map_31_fieldop_0_0_0_512 | 54.46 | 28.56 | 1.91 |
| map_90_fieldop_0_0_508 | 25.57 | 18.62 | 1.37 |
| map_91_fieldop_0_0_510 | 7.99 | 3.49 | 2.29 |
| map_100_fieldop_0_0_0_0_520 | 5.59 | 5.07 | 1.10 |
| map_13_fieldop_0_0_498 | 5.32 | 3.70 | 1.44 |
| map_115_fieldop_0_0_0_516 | 4.99 | 5.28 | 0.95 |
| map_35_fieldop_0_0_503 | 3.62 | 1.87 | 1.93 |
+-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+
```

The runtimes of the individual kernels are collected using `nsys` and `rocprofv3`.

The benchmarks were run on `Santis` (`GH200 GPU`) and `Beverin` (`MI300A GPU`) using the following uenv images:
- GH200: `icon/25.2:v3` (CUDA 12.6)
- MI300A: `build::prgenv-gnu/25.12:2288359995` (ROCM 7.1.0)

To reproduce the benchmark results on `Beverin` you can follow the instructions below:

```
# Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!*
uenv image pull build::prgenv-gnu/25.12:2288359995

# Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!*
uenv start --view default prgenv-gnu/25.12:2288359995

# Run the whole `dycore` granule and gather the runtimes of the `GT4PY Programs`
sbatch benchmark_dycore.sh
# The script above will generate a json file with the names of the `GT4Py Programs` and their runtimes. The first one is always slow so we skip accounting it in our analysis
# With the following python script you can parse the json file and print the runtimes in a nice form
# python read_gt4py_timers.py dycore_gt4py_program_metrics.json # passing --csv will save them in a csv file

# Run the `vertically_implicit_solver_at_predictor_step` GT4Py program standalone. Notice the `GT4Py Timer Report` table printed from the first `pytest` invocation. The reported timers on this table are as close as possible to the kernel launches of the GT4Py program.
# The following script will benchmark the solver, run `rocprofv3` and collect a trace of it as well as run the `rocprof-compute` tool for all its kernels
sbatch benchmark_solver.sh
```

## Hackathon goals

- Understand what is the bottleneck in our currently generated code
- Discuss what changes we can do either in the code generation, kernel configuration or memory layout to address these bottlenecks
- What further code changes do we have to do to take advantage of the full MI300A performance (shared memory, warp shuffling, etc)
- Fix any issues with ROCm profilers and learn how to effectively use them

## Notes

- To understand the code apart from the analysis the profilers there are the following sources:
1. Look at the generated HIP code for the `GT4Py program` `vertically_implicit_solver_at_predictor_step` in `<icon4py_root_dir>/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/src/cuda/vertically_implicit_solver_at_predictor_step.cpp`. The code is generated from DaCe automatically and it's a bit too verbose. It would be good to have some feedback on whether the generated code is in a good form for the HIP compiler to optimize.
2. Look at the generated assembly and HIP kernel characteristics (outputs of `-save-temps -Rpass-analysis=kernel-resource-usage`) in `<icon4py_root_dir>/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/build/vertically_implicit_solver_at_predictor_step_cuda-hip-amdgcn-amd-amdhsa-gfx942.s`.
3. Look at the `icon4py` frontend code for the `vertically_implicit_solver_at_predictor_step` [here](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219)
4. Look at the generated SDFG by DaCe. This can give a nice overview of the computations and kernels generated. Using [the DaCe documentation](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html) can help you understand what is expressed in the SDFG. The generated SDFG is saved in `<icon4py_root_dir>/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/program.sdfg`. To view the SDFG there is a VSCode plugin (`DaCe IOE`) or you can download it locally and open it in https://spcl.github.io/dace-webclient/.

- Installing the AMD HIP/ROCm packages for our UENV with Spack required various changes which are done [here](https://github.com/eth-cscs/alps-uenv/pull/273)
36 changes: 36 additions & 0 deletions amd_scripts/benchmark_dycore.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash
#SBATCH --job-name=dycore_granule_profile
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=mi300

# Go to the root of the icon4py repository to run the script from there
ICON4PY_GIT_ROOT=$(git rev-parse --show-toplevel)
cd $ICON4PY_GIT_ROOT

# Set necessasry flags for compilation
source amd_scripts/setup_env.sh

source .venv/bin/activate

export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1"
export GT4PY_BUILD_CACHE_LIFETIME=persistent
export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule
export GT4PY_COLLECT_METRICS_LEVEL=10
export GT4PY_ADD_GPU_TRACE_MARKERS="1"
export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage"

pytest -sv \
-m continuous_benchmarking \
-p no:tach \
--benchmark-only \
--benchmark-warmup=on \
--benchmark-warmup-iterations=30 \
--backend=dace_gpu \
--grid=icon_benchmark_regional \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--grid=icon_benchmark_regional \
--grid=icon_benchmark_global \

Since global is our main target for now, maybe we can switch to that.

--benchmark-time-unit=ms \
--benchmark-min-rounds 100 \
model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False]

python amd_scripts/read_gt4py_timers.py dycore_gt4py_program_metrics.json
Loading