diff --git a/AMD_INTRODUCTION.md b/AMD_INTRODUCTION.md new file mode 100644 index 0000000000..ca4a408a1c --- /dev/null +++ b/AMD_INTRODUCTION.md @@ -0,0 +1,204 @@ +# Icon4py performance on MI300 + +## Quickstart + +``` +# Connect to Beverin (CSCS system with MI300A) +ssh beverin.cscs.ch +``` + +In Beverin: +``` +# Enter scratch directory +cd $SCRATCH + +# Clone icon4py and checkout the correct branch +git clone git@github.com:C2SM/icon4py.git +cd icon4py +git checkout amd_profiling + +# Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!* +uenv image pull build::prgenv-gnu/25.12:2333839235 + +# Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!* +uenv start --view default prgenv-gnu/25.12:2333839235 + +# Install the necessary venv +bash amd_scripts/install_icon4py_venv.sh + +# Source venv +source .venv/bin/activate + +# Source other necessary environment variables +source amd_scripts/setup_env.sh + +# Set GT4Py related environment variables +export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1" +export GT4PY_BUILD_CACHE_LIFETIME=persistent +export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule +export GT4PY_COLLECT_METRICS_LEVEL=10 +export GT4PY_DYCORE_ENABLE_METRICS="1" +export GT4PY_ADD_GPU_TRACE_MARKERS="1" +export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage" + +# Benchmark dycore +pytest -sv \ + -m continuous_benchmarking \ + -p no:tach \ + --benchmark-only \ + --benchmark-warmup=on \ + --benchmark-warmup-iterations=30 \ + --backend=dace_gpu \ + --grid=icon_benchmark_regional \ + --benchmark-time-unit=ms \ + --benchmark-min-rounds 100 \ + model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False] + +# Print GT4Py timers +python print_gt4py_timers.py dycore_gt4py_program_metrics.json +``` + +For more information regarding benchmarking read the [Benchmarking](#benchmarking) chapter + +## Intro to icon4py and GT4Py + +In the following text we will give an overview of [icon4py](https://github.com/C2SM/icon4py), [GT4Py](https://github.com/GridTools/gt4py) and [DaCe](https://github.com/spcl/dace) and how they interact to compile our Python ICON implementation. + +### icon4py + +`icon4py` is a Python port of `ICON` implemented using the `GT4Py DSL`. Currently in `icon4py` there are only certain parts of `ICON` implemented. The most important being the `dycore`, which is the `ICON` component that takes most of the time to execute. +For this purpose we think it makes more sense to focus in this component. +The `icon4py` dycore implementation consists of ~20 `GT4Py Programs` or stencils. Each one of these programs consists of multiple GPU (CUDA or HIP) kernels and memory allocations/deallocations while in the full `icon4py` code there are also MPI/nccl communications. For now we will focus in the single node execution, so no communication is conducted. + +### GT4Py + +`GT4Py` is a compilation framework that provides a DSL which is used as frontend to write the stencil computations. This is done using a DSL embedded into Python code in `icon4py` as stated above. +Here is an example of a `GT4Py Program` from `icon4py`: [vertically_implicit_solver_at_predictor_step](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219). +`GT4Py` supports multiple backends. These are `embedded` (with numpy/JAX execution), `GTFN` (GridTools C++ implementation) and `DaCe`. For the moment the most efficient is `DaCe` so we'll focus on this one only. The code from the frontend is lowered from the `GT4Py DSL` to CUDA/HIP code after numerous transformations in `GT4Py IR (GTIR)` and then `DaCe Stateful Dataflow Graphs (SDFG)`. The lowering from `GTIR` to `DaCe SDFG` is done using the low level `DaCe` API. + +### DaCe + +`DaCe` is a programming framework that can take Python code and transform it to an SDFG, which is a representation that is easy to apply dataflow optimizations and achieve good performance in modern CPUs and GPUs. To see more information regarding how the SDFGs look like see the following [link](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html). +`DaCe` includes also a code generator from SDFG to C++, HIP and CUDA code. The HIP generated code is CUDA code hipified basically so there are no big differences between the generated code for CUDA and HIP. + + +## Benchmarking + +For the benchmarking we have focused on the `dycore` component of `icon4py` . We have measured the runtimes for the different `GT4Py Programs` executed in it between an `MI300A` and a `GH200 GPU` below: + +``` ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| GT4Py Program | MI300A [persistent] | GH200 (ns) | Acceleration | +| | (ns) | | | ++=======================================================+=====================+=============+=================+ +| compute_advection_in_horizontal_momentum | 0.00440 | 0.000176 | 25.01 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| vertically_implicit_solver_at_corrector_step | 0.00088 | 0.000578 | 1.53 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| vertically_implicit_solver_at_predictor_step | 0.00080 | 0.000555 | 1.44 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_rho_theta_pgrad_and_update_vn | 0.00074 | 0.000410 | 1.81 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_horizontal_velocity_quantities_and_fluxes | 0.00049 | 0.000329 | 1.51 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_perturbed_quantities_and_interpolation | 0.00039 | 0.000265 | 1.47 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_advection_in_corrector_vertical_momentum | 0.00030 | 0.000212 | 1.43 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_interpolation_and_nonhydro_buoy | 0.00023 | 0.000137 | 1.71 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +| compute_hydrostatic_correction_term | 0.00003 | 0.000029 | 1.03 | ++-------------------------------------------------------+---------------------+-------------+-----------------+ +``` + +**Warning** By default some of `GT4Py Programs` executed on the `MI300A` show a dramatic slowdown meanwhile in all of them the standard deviation in `MI300A` is much higher than `GH200`. We figured out that the souruce of the issue is a call to `hipMallocAsync` which allocates temporaries necessary for each program. The call to this HIP API has a very high variability and some times it takes much longer (100x times) to execute. For the `MI300A` results above we have disabled these allocations/deallocations taking place for each `GT4Py Programs` to see more clear the runtimes however this is not feasible for real simulations, where the memory footprint of the temporary data of all `GT4Py Programs` cannot be preallocated. Since in `GH200` the memory allocations and deallocations are taken into account in the timings, the above results should be taken with a grain of salt. + +We didn't have time yet to look into the `compute_advection_in_horizontal_momentum` regression. + +In both cases in the table above we present the median runtimes that are reported over 100 iterations (excluding the first slow one) using a C++ timer as close as possible to the kernel launches. + +What is interesting for us to look into is analyzing the performance of the kernels of a specific `GT4Py Program`. +To that end, we selected one of the `GT4Py Programs` that takes most of the time in a production simulation and has kernels with different representative patterns like: neighbor reductions, 2D maps and scans. +This is the `vertically_implicit_solver_at_predictor_step` `GT4Py program` and here is the comparison of its kernels: + +``` ++------------------------------------------+--------------------------+--------------------------+---------------------------------------------------------------+ +| Kernel Name | MI300A Median Time (ns) | GH200 Median Time (ns) | Acceleration of GH200 over MI300A (MI300A time / GH200 time) | ++------------------------------------------+--------------------------+--------------------------+---------------------------------------------------------------+ +| map_100_fieldop_1_0_0_514 | 240026 | 125024 | 1.92 | +| map_115_fieldop_1_0_0_518 | 208261 | 113056 | 1.84 | +| map_60_fieldop_0_0_504 | 148626 | 86624 | 1.72 | +| map_0_fieldop_0_0_500 | 66400 | 31456 | 2.11 | +| map_85_fieldop_0_0_506 | 59680 | 66336 | 0.90 | +| map_31_fieldop_0_0_0_512 | 46960 | 28768 | 1.63 | +| map_90_fieldop_0_0_508 | 26960 | 20832 | 1.29 | +| map_91_fieldop_0_0_510 | 8560 | 3552 | 2.41 | +| map_13_fieldop_0_0_498 | 5600 | 3744 | 1.50 | +| map_100_fieldop_0_0_0_0_520 | 5560 | 5472 | 1.02 | +| map_115_fieldop_0_0_0_516 | 4560 | 5184 | 0.88 | +| map_35_fieldop_0_0_503 | 3360 | 1856 | 1.81 | ++------------------------------------------+--------------------------+--------------------------+---------------------------------------------------------------+ +``` + +The runtimes of the individual kernels are collected using `nsys` and `rocprofv3`. + +The benchmarks were run on `Santis` (`GH200 GPU`) and `Beverin` (`MI300A GPU`) using the following uenv images: +- GH200: `icon/25.2:v3` (CUDA 12.6) +- MI300A: `build::prgenv-gnu/25.12:2333839235` (ROCM 7.1.0) + +To reproduce the benchmark results on `Beverin` you can follow the instructions below: + +``` +# Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!* +uenv image pull build::prgenv-gnu/25.12:2333839235 + +# Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!* +uenv start --view default prgenv-gnu/25.12:2333839235 + +# Run the whole `dycore` granule and gather the runtimes of the `GT4PY Programs` +sbatch benchmark_dycore.sh +# The script above will generate a json file with the names of the `GT4Py Programs` and their runtimes. The first one is always slow so we skip accounting it in our analysis +# With the following python script you can parse the json file and print the runtimes in a nice form +# python print_gt4py_timers.py dycore_gt4py_program_metrics.json # passing --csv will save them in a csv file + +# Run the `vertically_implicit_solver_at_predictor_step` GT4Py program standalone. Notice the `GT4Py Timer Report` table printed from the first `pytest` invocation. The reported timers on this table are as close as possible to the kernel launches of the GT4Py program. +# The following script will benchmark the solver, run `rocprofv3` and collect a trace of it as well as run the `rocprof-compute` tool for all its kernels +sbatch benchmark_solver.sh +``` + +The generated code for the results above can be found in `Beverin` in: + +`dycore` +``` +/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/amd_profiling_solver_persistent_mem # GT4Py cache folder +/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/slurm-247696.out # Slurm output +``` + +`solver` +``` +/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/amd_profiling_solver_regional # GT4Py cache folder +/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/slurm-247518.out # Slurm output +``` + +## Hackathon goals + +- Understand what is the bottleneck in our currently generated kernel code +- Discuss what changes we can do either in the code generation, kernel configuration or memory layout to address these bottlenecks and make sure we have reached performance better comparable with GH200 +- What further code changes do we have to do to take advantage of the full MI300A performance (shared memory, warp shuffling, etc) +- Fix any issues with ROCm profilers and learn how to effectively use them + +## Notes + +- To understand the code apart from the analysis the profilers there are the following sources: + 1. Look at the generated HIP code for the `GT4Py program` `vertically_implicit_solver_at_predictor_step` in `/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_/src/cuda/vertically_implicit_solver_at_predictor_step.cpp`. The code is generated from DaCe automatically and it's a bit too verbose. It would be good to have some feedback on whether the generated code is in a good form for the HIP compiler to optimize. + 2. Look at the generated assembly and HIP kernel characteristics (outputs of `-save-temps -Rpass-analysis=kernel-resource-usage`) in `/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_/build/vertically_implicit_solver_at_predictor_step_cuda-hip-amdgcn-amd-amdhsa-gfx942.s`. + 3. Look at the `icon4py` frontend code for the `vertically_implicit_solver_at_predictor_step` [here](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219) + 4. Look at the generated SDFG by DaCe. This can give a nice overview of the computations and kernels generated. Using [the DaCe documentation](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html) can help you understand what is expressed in the SDFG. The generated SDFG is saved in `/amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_/program.sdfg`. To view the SDFG there is a VSCode plugin (`DaCe IOE`) or you can download it locally and open it in https://spcl.github.io/dace-webclient/. + +- Installing the AMD HIP/ROCm packages for our UENV with Spack required various changes which are done [here](https://github.com/eth-cscs/alps-uenv/pull/273). Maybe it would be worth to discuss with the packaging team how to streamline the spack package installation of some of the packages + +- There are some TODOs in the scripts that mention some issues with the profilers. It would be great if you could help us fix them + +- The kernel names may vary from execution to execution so in some cases differences in the kernel names can be expected + +- The provided scripts are for guidance and should be handled with care diff --git a/amd_scripts/benchmark_dycore.sh b/amd_scripts/benchmark_dycore.sh new file mode 100644 index 0000000000..d2680d779b --- /dev/null +++ b/amd_scripts/benchmark_dycore.sh @@ -0,0 +1,56 @@ +#!/bin/bash +#SBATCH --job-name=dycore_granule_profile +#SBATCH --ntasks=1 +#SBATCH --time=08:00:00 +#SBATCH --gres=gpu:1 +#SBATCH --partition=mi300 + +# Go to the root of the icon4py repository to run the script from there +ICON4PY_GIT_ROOT=$(git rev-parse --show-toplevel) +cd $ICON4PY_GIT_ROOT + +# Set necessasry flags for compilation +source amd_scripts/setup_env.sh + +source .venv/bin/activate + +export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1" +export GT4PY_BUILD_CACHE_LIFETIME=persistent +export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule_regional +export GT4PY_COLLECT_METRICS_LEVEL=10 +export GT4PY_ADD_GPU_TRACE_MARKERS="1" +export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage" + +export ICON_GRID="icon_benchmark_regional" # TODO(CSCS): Fix `icon_benchmark_global` GPU memory issue: `Memory access fault by GPU node-4 (Agent handle: 0x5514890) on address 0x1463a8000000. Reason: Unknown. Failed to allocate file: Bad file descriptor` + +export DYCORE_GT4PY_PROGRAMS_TIMER_FILE="dycore_gt4py_program_metrics.json" + +rm ${DYCORE_GT4PY_PROGRAMS_TIMER_FILE} || true + +pytest -sv \ + -m continuous_benchmarking \ + -p no:tach \ + --benchmark-only \ + --benchmark-warmup=on \ + --benchmark-warmup-iterations=30 \ + --backend=dace_gpu \ + --grid=${ICON_GRID} \ + --benchmark-time-unit=ms \ + --benchmark-min-rounds 100 \ + model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[False-False] + +python amd_scripts/print_gt4py_timers.py ${DYCORE_GT4PY_PROGRAMS_TIMER_FILE} + +# TODO(AMD): The trace generated by the following command doesn't inclde the GPU activity. Perfetto UI output warning about import errors and data losses. +# rocprofv3 --kernel-trace on --hip-trace on --marker-trace on --memory-copy-trace on --memory-allocation-trace on --output-format pftrace -o rocprofv3_${GT4PY_BUILD_CACHE_DIR} -- \ +# $(which python3.12) -m pytest -sv \ +# -m continuous_benchmarking \ +# -p no:tach \ +# --benchmark-only \ +# --benchmark-warmup=on \ +# --benchmark-warmup-iterations=30 \ +# --backend=dace_gpu \ +# --grid=${ICON_GRID} \ +# --benchmark-time-unit=ms \ +# --benchmark-min-rounds 10 \ +# model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[False-False] diff --git a/amd_scripts/benchmark_solver.sh b/amd_scripts/benchmark_solver.sh new file mode 100644 index 0000000000..5cf17c0389 --- /dev/null +++ b/amd_scripts/benchmark_solver.sh @@ -0,0 +1,96 @@ +#!/bin/bash +#SBATCH --job-name=solver_benchmark +#SBATCH --ntasks=1 +#SBATCH --time=08:00:00 +#SBATCH --gres=gpu:1 +#SBATCH --partition=mi300 + +# Go to the root of the icon4py repository to run the script from there +ICON4PY_GIT_ROOT=$(git rev-parse --show-toplevel) +cd $ICON4PY_GIT_ROOT + +# Set necessasry flags for compilation +source amd_scripts/setup_env.sh + +source .venv/bin/activate + +export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1" +export GT4PY_BUILD_CACHE_LIFETIME=persistent +export GT4PY_BUILD_CACHE_DIR=amd_profiling_solver_regional +export GT4PY_COLLECT_METRICS_LEVEL=10 +export GT4PY_ADD_GPU_TRACE_MARKERS="1" +export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=3 +export ICON4PY_STENCIL_TEST_ITERATIONS=10 +export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=100 +export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage" + +export ICON_GRID="icon_benchmark_regional" # TODO(CSCS): Check also `icon_benchmark_global` when the dycore GPU memory issue is fixed + +# Run the benchmark and collect the runtime of the whole GT4Py program (see `GT4Py Timer Report` in the output) +# The compiled GT4Py programs will be cached in the directory specified by `GT4PY_BUILD_CACHE_DIR` to be reused for running the profilers afterwards +pytest -sv \ + -m continuous_benchmarking \ + -p no:tach \ + --backend=dace_gpu \ + --grid=${ICON_GRID} \ + model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ + -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" + +# Run the benchmark and collect its trace +# TODO(AMD): Generating `rocpd` output fails with segfaults +export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=30 +export ICON4PY_STENCIL_TEST_ITERATIONS=10 +export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=100 +# Can also add `--att` for thread tracing +rocprofv3 --kernel-trace on --hip-trace on --marker-trace on --memory-copy-trace on --memory-allocation-trace on --output-format pftrace -o rocprofv3_${GT4PY_BUILD_CACHE_DIR} -- \ + $(which python3.12) -m pytest -sv \ + -m continuous_benchmarking \ + -p no:tach \ + --backend=dace_gpu \ + --grid=${ICON_GRID} \ + model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ + -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" +# Alternatively, export the data to `csv` and print kernel runtimes with the following command +# python amd_scripts/median_rocprof_csv.py rocprofv3_${GT4PY_BUILD_CACHE_DIR}_kernel_trace.csv + +# Get the kernel names of the GT4Py program so that we can filter them with rocprof-compute +LAST_COMPILED_DIRECTORY=$(realpath $(ls -td ${GT4PY_BUILD_CACHE_DIR}/.gt4py_cache/*/ | head -1)) +echo "# Last compiled GT4Py directory: $LAST_COMPILED_DIRECTORY" +LAST_COMPILED_KERNEL_NAMES=$(grep -r -e "__global__ void.*map.*(" ${LAST_COMPILED_DIRECTORY}/src/cuda -o | sed 's/.*\s\([a-zA-Z_][a-zA-Z0-9_]*\)(.*/\1/') +echo "# Last compiled GT4Py kernel names:" +echo "$LAST_COMPILED_KERNEL_NAMES" +ROCPROF_COMPUTE_KERNEL_NAME_FILTER="-k $LAST_COMPILED_KERNEL_NAMES" + +# Run rocprof-compute filtering the kernels of interest +export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=0 +export ICON4PY_STENCIL_TEST_ITERATIONS=1 +export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=1 +rocprof-compute profile --name rcu_${GT4PY_BUILD_CACHE_DIR} ${ROCPROF_COMPUTE_KERNEL_NAME_FILTER} --format-rocprof-output rocpd --kernel-names -R FP64 -- \ + $(which python3.12) -m pytest -sv \ + -m continuous_benchmarking \ + -p no:tach \ + --backend=dace_gpu \ + --grid=${ICON_GRID} \ + model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ + -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" + +# TODO(AMD): Roofline generation fails with +# File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 998, in standalone_roofline +# self.empirical_roofline(ret_df=t_df) +# File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/utils/logger.py", line 66, in wrap_function +# result = function(*args, **kwargs) +# ^^^^^^^^^^^^^^^^^^^^^^^^^ +# File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 463, in empirical_roofline +# flops_figure.write_image( +# File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/basedatatypes.py", line 3895, in write_image +# return pio.write_image(self, *args, **kwargs) +# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +# File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/io/_kaleido.py", line 555, in write_image +# path.write_bytes(img_data) +# File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1036, in write_bytes +# with self.open(mode='wb') as f: +# ^^^^^^^^^^^^^^^^^^^^ +# File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1013, in open +# return io.open(self, mode, buffering, encoding, errors, newline) +# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +# OSError: [Errno 36] File name too long: '/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/workloads/rcu_amd_profiling_solver/MI300A_A1/empirRoof_gpu-0_FP64_map_0_fieldop_0_0_500_map_100_fieldop_0_0_0_514_map_100_fieldop_1_0_0_0_520_map_115_fieldop_0_0_0_516_map_115_fieldop_1_0_0_518_map_13_fieldop_0_0_498_map_31_fieldop_0_0_0_512_map_35_fieldop_0_0_503_map_60_fieldop_0_0_504_map_85_fieldop_0_0_506_map_90_fieldop_0_0_508_map_91_fieldop_0_0_510.pdf' diff --git a/amd_scripts/install_icon4py_venv.sh b/amd_scripts/install_icon4py_venv.sh new file mode 100644 index 0000000000..4c86fc481d --- /dev/null +++ b/amd_scripts/install_icon4py_venv.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e + +date + +# Go to the root of the icon4py repository to run the installation from there +ICON4PY_GIT_ROOT=$(git rev-parse --show-toplevel) +cd $ICON4PY_GIT_ROOT + +# Set necessasry flags for compilation +source $ICON4PY_GIT_ROOT/amd_scripts/setup_env.sh + +# Install uv locally +export PATH="$PWD/bin:$PATH" +if [ ! -x "$PWD/bin/uv" ]; then + curl -LsSf https://astral.sh/uv/install.sh | UV_UNMANAGED_INSTALL="$PWD/bin" sh +else + echo "# uv already installed at $PWD/bin/uv" +fi + +# Install icon4py, gt4py, DaCe and other basic dependencies using uv +uv sync --extra rocm7_0 --python $(which python3.12) + +# Activate virtual environment +source .venv/bin/activate + +# Install the requirements for rocprofiler-compute so we can run the profiler from the same environment +uv pip install -r /user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/requirements.txt + +echo "# install done" +date diff --git a/amd_scripts/median_rocprof_csv.py b/amd_scripts/median_rocprof_csv.py new file mode 100644 index 0000000000..146e07b705 --- /dev/null +++ b/amd_scripts/median_rocprof_csv.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +import csv +import sys +import statistics + +if len(sys.argv) < 2: + print("Usage: python script.py ", file=sys.stderr) + sys.exit(1) + +path = sys.argv[1] +kernels = {} + +with open(path, newline="") as f: + reader = csv.DictReader(f) + for row in reader: + name = row["Kernel_Name"] + if name.startswith("map"): + name = name.split("(")[0] + if name not in kernels: + kernels[name] = [] + duration = int(row["End_Timestamp"]) - int(row["Start_Timestamp"]) + kernels[name].append(duration) + +if not kernels: + print("No kernels starting with 'map' found", file=sys.stderr) +else: + for kernel_name, durations in sorted(kernels.items()): + median = statistics.median(durations) + stdev = statistics.stdev(durations) if len(durations) > 1 else 0 + print(f"{kernel_name},{median:.0f},{stdev:.0f}") diff --git a/amd_scripts/print_gt4py_timers.py b/amd_scripts/print_gt4py_timers.py new file mode 100644 index 0000000000..353c32d6fa --- /dev/null +++ b/amd_scripts/print_gt4py_timers.py @@ -0,0 +1,31 @@ +import json +import numpy +import csv +import sys + +if len(sys.argv) < 2: + print("Usage: python print_gt4py_timers.py [--csv]") + sys.exit(1) + +input_file = sys.argv[1] +data = json.load(open(input_file)) + +if len(sys.argv) > 2 and sys.argv[2] == '--csv': + with open('output.csv', 'w', newline='') as f: + writer = csv.writer(f) + writer.writerow(['Function', 'Median', 'Std']) + for k, v in data.items(): + if v.get('metrics').get('compute'): + arr = numpy.array(v.get('metrics').get('compute')[1:]) + if len(arr) > 0: + median = numpy.median(arr) + if not numpy.isnan(median): + writer.writerow([k.split('<')[0], median, arr.std()]) +else: + for k, v in data.items(): + if v.get('metrics').get('compute'): + arr = numpy.array(v.get('metrics').get('compute')[1:]) + if len(arr) > 0: + median = numpy.median(arr) + if not numpy.isnan(median): + print(f"{k.split('<')[0]}: Median = {median}, Std = {arr.std()}") diff --git a/amd_scripts/setup_env.sh b/amd_scripts/setup_env.sh new file mode 100644 index 0000000000..4f5656003d --- /dev/null +++ b/amd_scripts/setup_env.sh @@ -0,0 +1,14 @@ +export CC="$(which gcc)" +export MPICH_CC="$(which gcc)" +export CXX="$(which g++)" +export MPICH_CXX="$(which g++)" +export HUGETLB_ELFMAP="no" +export HUGETLB_MORECORE="no" +export PYTHONOPTIMIZE="2" +export HCC_AMDGPU_TARGET="gfx942" +export ROCM_HOME="/user-environment/env/default" +export HIPCC=$(which hipcc) +export ROCM_VERSION="7.1.0" +export LD_LIBRARY_PATH=/user-environment/linux-zen3/rocprofiler-dev-7.1.0-i7wbbbgrx7jjp4o2xroyj5j263dkzplv/lib:$LD_LIBRARY_PATH +# export LD_LIBRARY_PATH=rocprof-trace-decoder-manylinux-2.28-0.1.6-Linux:$LD_LIBRARY_PATH # TODO(iomaganaris): Add package to uenv properly +export LD_PRELOAD=/user-environment/env/default/lib/libomp.so:$LD_PRELOAD diff --git a/model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py b/model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py index 3224ce7065..d945d948b7 100644 --- a/model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py +++ b/model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py @@ -9,9 +9,11 @@ from __future__ import annotations import functools +import os from typing import TYPE_CHECKING, Any import gt4py.next as gtx +from gt4py.next.instrumentation import metrics as gtx_metrics import pytest @@ -346,3 +348,6 @@ def test_benchmark_solve_nonhydro( at_first_substep=at_first_substep, at_last_substep=at_last_substep, ) + + if gtx_metrics.sources: + gtx_metrics.dump_json("dycore_gt4py_program_metrics.json") diff --git a/model/common/pyproject.toml b/model/common/pyproject.toml index d794adcb6a..b0220b2eb1 100644 --- a/model/common/pyproject.toml +++ b/model/common/pyproject.toml @@ -42,6 +42,7 @@ version = "0.0.6" all = ["icon4py-common[distributed,io]"] cuda11 = ['cupy-cuda11x>=13.0', 'gt4py[cuda11]'] cuda12 = ['cupy-cuda12x>=13.0', 'gt4py[cuda12]'] +rocm7_0 = ['amd-cupy>=13.0'] # TODO(havogt): add gt4py[rocm7_0] once available distributed = ["ghex>=0.5.0", "mpi4py>=3.1.5"] io = [ # external dependencies diff --git a/model/common/src/icon4py/model/common/model_options.py b/model/common/src/icon4py/model/common/model_options.py index 37ff480746..6269e06a83 100644 --- a/model/common/src/icon4py/model/common/model_options.py +++ b/model/common/src/icon4py/model/common/model_options.py @@ -52,12 +52,15 @@ def get_dace_options( # due to it falling into a less optimized code generation (on santis). if program_name == "compute_rho_theta_pgrad_and_update_vn": backend_descriptor["use_zero_origin"] = True + # TODO(AMD): For now disable problematic `hipMallocAsync` calls on each GT4Py Program call that have high runtime variability. + # Needs to be fixed for realistic simulations due to increased memory footprint of persistent memory. + if backend_descriptor["device"] == model_backends.DeviceType.ROCM: + optimization_args["gpu_memory_pool"] = False + optimization_args["make_persistent"] = True if program_name == "graupel_run": backend_descriptor["use_zero_origin"] = True optimization_args["fuse_tasklets"] = True optimization_args["gpu_maxnreg"] = 128 - optimization_args["gpu_memory_pool"] = False - optimization_args["make_persistent"] = True if optimization_hooks: optimization_args["optimization_hooks"] = optimization_hooks if optimization_args: diff --git a/model/testing/src/icon4py/model/testing/stencil_tests.py b/model/testing/src/icon4py/model/testing/stencil_tests.py index 027a0ac5b1..a10bdc3325 100644 --- a/model/testing/src/icon4py/model/testing/stencil_tests.py +++ b/model/testing/src/icon4py/model/testing/stencil_tests.py @@ -133,9 +133,8 @@ def test_and_benchmark( # Get the pool key necessary to find the right metrics key. There should be only one compiled program in _configured_program pool_key = next(iter(compiled_programs.keys())) # Get the metrics key from the pool key to read the corresponding metrics - metrics_key = _configured_program._compiled_programs._metrics_key_from_pool_key( - pool_key - ) + compiled_programs_root = _configured_program._compiled_programs.root + metrics_key = f"{compiled_programs_root[0]}<{compiled_programs_root[1]}>[{hash(pool_key)}]" metrics_data = gtx_metrics.sources compute_samples = metrics_data[metrics_key].metrics["compute"].samples # exclude warmup iterations, one extra iteration for calibrating pytest-benchmark and one for validation (if executed) diff --git a/pyproject.toml b/pyproject.toml index e349356eb7..bffb3b238c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -97,6 +97,7 @@ version = "0.0.6" all = ["icon4py[distributed,fortran,io,testing,profiling]"] cuda11 = ["icon4py-common[cuda11]"] cuda12 = ["icon4py-common[cuda12]"] +rocm7_0 = ["icon4py-common[rocm7_0]"] distributed = ["icon4py-common[distributed]"] fortran = ["icon4py-tools>=0.0.6"] io = ["icon4py-common[io]"] @@ -359,11 +360,17 @@ explicit = true name = 'gridtools' url = 'https://gridtools.github.io/pypi/' +[[tool.uv.index]] +explicit = true +name = 'amd' +url = 'https://pypi.amd.com/rocm-7.0.2/simple' + [tool.uv.sources] dace = {index = "gridtools"} ghex = {git = "https://github.com/msimberg/GHEX.git", branch = "async-mpi"} -# gt4py = {git = "https://github.com/GridTools/gt4py", branch = "main"} +gt4py = {git = "https://github.com/GridTools/gt4py", branch = "amd_profiling_staging"} # gt4py = {index = "test.pypi"} +amd-cupy = {index = "amd" } icon4py-atmosphere-advection = {workspace = true} icon4py-atmosphere-diffusion = {workspace = true} icon4py-atmosphere-dycore = {workspace = true} diff --git a/uv.lock b/uv.lock index 4b38bfa69d..fde03389c4 100644 --- a/uv.lock +++ b/uv.lock @@ -36,6 +36,20 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/32/34/d4e1c02d3bee589efb5dfa17f88ea08bdb3e3eac12bc475462aec52ed223/alabaster-0.7.16-py3-none-any.whl", hash = "sha256:b46733c07dce03ae4e150330b975c75737fa60f0a7c591b6c8bf4928a28e2c92", size = 13511 }, ] +[[package]] +name = "amd-cupy" +version = "13.5.1" +source = { registry = "https://pypi.amd.com/rocm-7.0.2/simple" } +dependencies = [ + { name = "fastrlock" }, + { name = "numpy" }, +] +wheels = [ + { url = "https://pypi.amd.com/rocm-7.0.2/packages/amd-cupy/amd_cupy-13.5.1-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:eca984c7b8176eecaff0dd84504b322828bedd40c177d736753295e8a4b672de" }, + { url = "https://pypi.amd.com/rocm-7.0.2/packages/amd-cupy/amd_cupy-13.5.1-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:468ca95416f57d5bbf6663ad8ca69a6ac46b4a34166833f01e5535068fa1b4e8" }, + { url = "https://pypi.amd.com/rocm-7.0.2/packages/amd-cupy/amd_cupy-13.5.1-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:de3138281e2711e06efaf49a31310d0d4824998e18d43e13e288a0e52ca75ec0" }, +] + [[package]] name = "annotated-types" version = "0.7.0" @@ -1388,13 +1402,13 @@ name = "gridtools-cpp" version = "2.3.9" source = { registry = "https://pypi.org/simple" } wheels = [ - { url = "https://files.pythonhosted.org/packages/b6/8f/08ae062f2b7714c2753faeeead091fc5aa6344ec919ff63d242b95990573/gridtools_cpp-2.3.9-py3-none-any.whl", hash = "sha256:e4deefd804670e101083df9eb27d6b454e3d39ae903b4f7b043846a181452286", size = 1044245 }, + { url = "https://files.pythonhosted.org/packages/b6/8f/08ae062f2b7714c2753faeeead091fc5aa6344ec919ff63d242b95990573/gridtools_cpp-2.3.9-py3-none-any.whl", hash = "sha256:e4deefd804670e101083df9eb27d6b454e3d39ae903b4f7b043846a181452286", size = 1044245, upload-time = "2025-04-11T13:55:39.966Z" }, ] [[package]] name = "gt4py" -version = "1.1.4" -source = { registry = "https://pypi.org/simple" } +version = "1.1.4.post6+d531967e" +source = { git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging#d531967e28d79cc161cad91f215f6480a8e0d9d7" } dependencies = [ { name = "attrs" }, { name = "black" }, @@ -1424,10 +1438,6 @@ dependencies = [ { name = "versioningit" }, { name = "xxhash" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/37/cc/f8f2c086324d0eb75a0fc0a68635693eab9ed7339ac9021c0cd0ecf88fc8/gt4py-1.1.4.tar.gz", hash = "sha256:f9200185dfea5690385ac6afabf54ef05364ac7276ddc80a470276439ed65028", size = 789822 } -wheels = [ - { url = "https://files.pythonhosted.org/packages/c2/76/f392e26f8a6304468f499bd93fbb267d1013f66a66b10f1df267b33a3b6a/gt4py-1.1.4-py3-none-any.whl", hash = "sha256:0a2906372b6d73784b486a116dcb13cd7ffc34af4571256daad0081734732519", size = 999302 }, -] [package.optional-dependencies] cuda11 = [ @@ -1553,6 +1563,9 @@ io = [ profiling = [ { name = "viztracer" }, ] +rocm7-0 = [ + { name = "icon4py-common", extra = ["rocm7-0"] }, +] testing = [ { name = "icon4py-testing" }, ] @@ -1648,13 +1661,14 @@ requires-dist = [ { name = "icon4py-common", extras = ["cuda12"], marker = "extra == 'cuda12'", editable = "model/common" }, { name = "icon4py-common", extras = ["distributed"], marker = "extra == 'distributed'", editable = "model/common" }, { name = "icon4py-common", extras = ["io"], marker = "extra == 'io'", editable = "model/common" }, + { name = "icon4py-common", extras = ["rocm7-0"], marker = "extra == 'rocm7-0'", editable = "model/common" }, { name = "icon4py-driver", editable = "model/driver" }, { name = "icon4py-standalone-driver", editable = "model/standalone_driver" }, { name = "icon4py-testing", marker = "extra == 'testing'", editable = "model/testing" }, { name = "icon4py-tools", marker = "extra == 'fortran'", editable = "tools" }, { name = "viztracer", marker = "extra == 'profiling'", specifier = ">=1.1.0" }, ] -provides-extras = ["all", "cuda11", "cuda12", "distributed", "fortran", "io", "profiling", "testing"] +provides-extras = ["all", "cuda11", "cuda12", "rocm7-0", "distributed", "fortran", "io", "profiling", "testing"] [package.metadata.requires-dev] build = [ @@ -1746,7 +1760,7 @@ dependencies = [ [package.metadata] requires-dist = [ - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", editable = "model/common" }, { name = "packaging", specifier = ">=20.0" }, ] @@ -1763,7 +1777,7 @@ dependencies = [ [package.metadata] requires-dist = [ - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", editable = "model/common" }, { name = "packaging", specifier = ">=20.0" }, ] @@ -1780,7 +1794,7 @@ dependencies = [ [package.metadata] requires-dist = [ - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", editable = "model/common" }, { name = "packaging", specifier = ">=20.0" }, ] @@ -1797,7 +1811,7 @@ dependencies = [ [package.metadata] requires-dist = [ - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", editable = "model/common" }, { name = "packaging", specifier = ">=20.0" }, ] @@ -1815,7 +1829,7 @@ dependencies = [ [package.metadata] requires-dist = [ - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", extras = ["io"], editable = "model/common" }, { name = "numpy", specifier = ">=1.23.3" }, { name = "packaging", specifier = ">=20.0" }, @@ -1871,9 +1885,13 @@ io = [ { name = "uxarray" }, { name = "xarray", extra = ["complete"] }, ] +rocm7-0 = [ + { name = "amd-cupy" }, +] [package.metadata] requires-dist = [ + { name = "amd-cupy", marker = "extra == 'rocm7-0'", specifier = ">=13.0", index = "https://pypi.amd.com/rocm-7.0.2/simple" }, { name = "cartopy", marker = "extra == 'io'", specifier = ">=0.22.0" }, { name = "cftime", marker = "extra == 'io'", specifier = ">=1.6.3" }, { name = "cupy-cuda11x", marker = "extra == 'cuda11'", specifier = ">=13.0" }, @@ -1881,9 +1899,9 @@ requires-dist = [ { name = "dace", specifier = "==43!2026.2.12", index = "https://gridtools.github.io/pypi/" }, { name = "datashader", marker = "extra == 'io'", specifier = ">=0.16.1" }, { name = "ghex", marker = "extra == 'distributed'", git = "https://github.com/msimberg/GHEX.git?branch=async-mpi" }, - { name = "gt4py", specifier = "==1.1.4" }, - { name = "gt4py", extras = ["cuda11"], marker = "extra == 'cuda11'" }, - { name = "gt4py", extras = ["cuda12"], marker = "extra == 'cuda12'" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, + { name = "gt4py", extras = ["cuda11"], marker = "extra == 'cuda11'", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, + { name = "gt4py", extras = ["cuda12"], marker = "extra == 'cuda12'", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "holoviews", marker = "extra == 'io'", specifier = ">=1.16.0" }, { name = "icon4py-common", extras = ["distributed", "io"], marker = "extra == 'all'", editable = "model/common" }, { name = "mpi4py", marker = "extra == 'distributed'", specifier = ">=3.1.5" }, @@ -1897,7 +1915,7 @@ requires-dist = [ { name = "uxarray", marker = "extra == 'io'", specifier = "==2024.3.0" }, { name = "xarray", extras = ["complete"], marker = "extra == 'io'", specifier = ">=2024.3.0" }, ] -provides-extras = ["all", "cuda11", "cuda12", "distributed", "io"] +provides-extras = ["all", "cuda11", "cuda12", "rocm7-0", "distributed", "io"] [[package]] name = "icon4py-driver" @@ -1919,7 +1937,7 @@ dependencies = [ requires-dist = [ { name = "click", specifier = ">=8.0.1" }, { name = "devtools", specifier = ">=0.12" }, - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-atmosphere-diffusion", editable = "model/atmosphere/diffusion" }, { name = "icon4py-atmosphere-dycore", editable = "model/atmosphere/dycore" }, { name = "icon4py-common", editable = "model/common" }, @@ -1947,7 +1965,7 @@ dependencies = [ [package.metadata] requires-dist = [ { name = "devtools", specifier = ">=0.12" }, - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-atmosphere-diffusion", editable = "model/atmosphere/diffusion" }, { name = "icon4py-atmosphere-dycore", editable = "model/atmosphere/dycore" }, { name = "icon4py-common", editable = "model/common" }, @@ -1976,7 +1994,7 @@ dependencies = [ [package.metadata] requires-dist = [ { name = "filelock", specifier = ">=3.18.0" }, - { name = "gt4py", specifier = "==1.1.4" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-common", extras = ["io"], editable = "model/common" }, { name = "numpy", specifier = ">=1.23.3" }, { name = "packaging", specifier = ">=20.0" }, @@ -2025,9 +2043,9 @@ requires-dist = [ { name = "cupy-cuda11x", marker = "extra == 'cuda11'", specifier = ">=13.0" }, { name = "cupy-cuda12x", marker = "extra == 'cuda12'", specifier = ">=13.0" }, { name = "fprettify", specifier = ">=0.3.7" }, - { name = "gt4py", specifier = "==1.1.4" }, - { name = "gt4py", extras = ["cuda11"], marker = "extra == 'cuda11'" }, - { name = "gt4py", extras = ["cuda12"], marker = "extra == 'cuda12'" }, + { name = "gt4py", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, + { name = "gt4py", extras = ["cuda11"], marker = "extra == 'cuda11'", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, + { name = "gt4py", extras = ["cuda12"], marker = "extra == 'cuda12'", git = "https://github.com/GridTools/gt4py?branch=amd_profiling_staging" }, { name = "icon4py-atmosphere-advection", editable = "model/atmosphere/advection" }, { name = "icon4py-atmosphere-diffusion", editable = "model/atmosphere/diffusion" }, { name = "icon4py-atmosphere-dycore", editable = "model/atmosphere/dycore" },