-
Notifications
You must be signed in to change notification settings - Fork 10
Notes and scripts for AMD profiling of dycore #1047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
iomaganaris
wants to merge
51
commits into
main
Choose a base branch
from
amd_profiling
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 15 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
be4cee4
update gt4py version
edopao 6ecff32
switch gt4py branch
edopao 1c9c744
update uv lock
edopao a1e753f
edit import metrics
edopao b45b9b1
switch gt4py branch
edopao 517d122
edit import metrics
edopao 672b4f0
edit import metrics
edopao 9b2662d
Merge branch 'main' into update_dace_version
iomaganaris 532c125
Update DaCe version
iomaganaris 991b6b8
Update the gt4py commit
iomaganaris f194d83
Initial amd notes and scripts
iomaganaris 1eb4708
Pre-compilation fix with_backend
havogt 30fe86c
Fixes to the notes
iomaganaris 4d13d82
Additional comments in the scripts
iomaganaris 81e7a24
Fix gtx_metrics
iomaganaris 47e5e48
Clean up setup script
iomaganaris cfc5d89
Move scripts in amd_scripts and renamed instructions' file
iomaganaris adae364
Added quickstart guide
iomaganaris d7a6aa2
Added goals section
iomaganaris 8ed9403
Added note about scratch directory
iomaganaris ffc0d51
Use revised `with_compilation_option` naming
tehrengruber 6ded3a9
Merge remote-tracking branch 'origin/update_dace_version_pre_compile_…
iomaganaris 634ddfe
Cleaned up scripts
iomaganaris 31271fe
Edited notes of instructions
iomaganaris f450589
Fix GT4PY_BUILD_CACHE_DIR in solver script
iomaganaris aa13236
Update gt4py branch to fix gtir indeterminism
iomaganaris 1016363
Update branch in pyproject.toml as well
iomaganaris 290c4d0
GT4Py 1.1.4: Pre-compilation fix with_backend (#1048)
havogt 42cd8e4
update gt4py version
havogt acb1d3b
Merge remote-tracking branch 'upstream/main' into update_dace_version
havogt f37daab
Use gt4py main branch
iomaganaris 2ae0c61
Updated comment in the benchmark script
iomaganaris 17b41d4
fix typo
havogt 126090e
more typos and fix test
havogt 4c19ce5
Fix print_gt4py_timers script
iomaganaris e235735
Merge remote-tracking branch 'origin/update_dace_version' into amd_pr…
iomaganaris 7650867
Add rocm7_0 extra
havogt 15cf58d
add missing uv.lock
havogt def7749
Updated text regarding very slow kernels
iomaganaris c5b8669
Added profiling of whole dycore with rocprofv3
iomaganaris c463434
Mention in the Notes about the TODOs
iomaganaris cb43ccc
Use gt4py amd_staging_branch that includes fix_indeterministic_get_cl…
iomaganaris e8f4142
Updated to the introduction notes
iomaganaris 1326d6e
Refactor scripts for setting the benchmarked grid
iomaganaris e6a5c9f
Updated kernel times in the md file
iomaganaris e87273f
Commented out problematic profiling command for dycore
iomaganaris e12a182
Updated results with persistent memory
iomaganaris f2e3c56
Mention new uenv that enables thread tracing with rocprofv3
iomaganaris 6029f6c
Merge remote-tracking branch 'origin/main' into amd_profiling
iomaganaris aa5a195
Fix metrics discovery after new gt4py v1.1.4 release
iomaganaris 0a5d2d2
Update uv.lock with gt4py version
iomaganaris File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # Icon4py performance on MI300 | ||
|
|
||
| ## Intro to icon4py and GT4Py | ||
|
|
||
| In the following text we will give an overview of [icon4py](https://github.com/C2SM/icon4py), [GT4Py](https://github.com/GridTools/gt4py) and [DaCe](https://github.com/spcl/dace) and how they interact to compile our Python ICON implementation. | ||
|
|
||
| ### icon4py | ||
|
|
||
| `icon4py` is a Python port of `ICON` implemented using the `GT4Py DSL`. Currently in `icon4py` there are only certain parts of `ICON` implemented. The most important being the `dycore`, which is the `ICON` component that takes most of the time to execute. | ||
| For this purpose we think it makes more sense to focus in this component. | ||
| The `icon4py` dycore implementation consists of ~20 `GT4Py Programs` or stencils. Each one of these programs consists of multiple GPU (CUDA or HIP) kernels and memory allocations/deallocations while in the full `icon4py` code there are also MPI/nccl communications. For now we will focus in the single node execution, so no communication is conducted. | ||
|
|
||
| ### GT4Py | ||
|
|
||
| `GT4Py` is a compilation framework that provides a DSL which is used as frontend to write the stencil computations. This is done using a DSL embedded into Python code in `icon4py` as stated above. | ||
| Here is an example of a `GT4Py Program` from `icon4py`: [vertically_implicit_solver_at_predictor_step](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219). | ||
| `GT4Py` supports multiple backends. These are `embedded` (with numpy/JAX execution), `GTFN` (GridTools C++ implementation) and `DaCe`. For the moment the most efficient is `DaCe` so we'll focus on this one only. The code from the frontend is lowered from the `GT4Py DSL` to CUDA/HIP code after numerous transformations in `GT4Py IR (GTIR)` and then `DaCe Stateful Dataflow Graphs (SDFG)`. The lowering from `GTIR` to `DaCe SDFG` is done using the low level `DaCe` API. | ||
|
|
||
| ### DaCe | ||
|
|
||
| `DaCe` is a programming framework that can take Python code and transform it to an SDFG, which is a representation that is easy to apply dataflow optimizations and achieve good performance in modern CPUs and GPUs. To see more information regarding how the SDFGs look like see the following [link](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html). | ||
| `DaCe` includes also a code generator from SDFG to C++, HIP and CUDA code. The HIP generated code is CUDA code hipified basically so there are no big differences between the generated code for CUDA and HIP. | ||
|
|
||
|
|
||
| ## Benchmarking | ||
|
|
||
| For the benchmarking we have focused on the `dycore` component of `icon4py` . We have measured the runtimes for the different `GT4Py Programs` executed in it between an `MI300A` and a `GH200 GPU` below: | ||
|
|
||
| ``` | ||
| +--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+ | ||
| | GT4Py Programs | MI300A Time (s) | GH200 Time (s) | Acceleration of GH200 over MI300A (MI300A time / GH200 time) | | ||
| +--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+ | ||
| | compute_diagnostics_from_normal_wind | 0.000268 | 0.000150 | 1.79 | | ||
| | compute_advection_in_predictor_vertical_momentum | 0.000195 | 0.000129 | 1.51 | | ||
| | compute_advection_in_horizontal_momentum | 0.004871 | 0.000174 | 27.98 | | ||
| | compute_perturbed_quantities_and_interpolation | 0.000433 | 0.000255 | 1.70 | | ||
| | compute_hydrostatic_correction_term | 0.000034 | 0.000026 | 1.30 | | ||
| | compute_rho_theta_pgrad_and_update_vn | 0.105237 | 0.000404 | 260.40 | | ||
| | compute_horizontal_velocity_quantities_and_fluxes | 0.000562 | 0.000324 | 1.73 | | ||
| | vertically_implicit_solver_at_predictor_step | 0.011691 | 0.000601 | 19.46 | | ||
| | compute_advection_in_corrector_vertical_momentum | 0.010325 | 0.000209 | 49.51 | | ||
| | compute_interpolation_and_nonhydro_buoy | 0.000253 | 0.000135 | 1.87 | | ||
| | apply_divergence_damping_and_update_vn | 0.000208 | 0.000114 | 1.83 | | ||
| | vertically_implicit_solver_at_corrector_step | 0.002938 | 0.000592 | 4.96 | | ||
| +--------------------------------------------------------+-----------------+----------------+--------------------------------------------------------------+ | ||
| ``` | ||
|
|
||
| Some of them show a dramatic slowdown in `MI300A` meanwhile in all of them the standard deviation in `MI300A` is much higher than `GH200`. The above are the median runtimes that are reported over 100 iterations (excluding the first slow one) using a C++ timer as close as possible to the kernel launches. | ||
|
|
||
| While looking at all of them and especially the ones that are much slower than the others on the `MI300A` is useful, we think that starting from a specific `GT4Py Program` and looking at the performance of each kernel launched from it is more interesting as a first step. | ||
| To that end, we selected one of the `GT4Py Programs` that takes most of the time in a production simulation and has kernels with different representative patterns like: neighbor reductions, 2D maps and scans. | ||
| This is the `vertically_implicit_solver_at_predictor_step` `GT4Py program` and here is the comparison of its kernels: | ||
|
|
||
| ``` | ||
| +-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+ | ||
| | Name | MI300A Avg Time (μs) | GH200 Mean Time (μs) | Acceleration GH200 over MI300A (MI300A time / GH200 time) | | ||
| +-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+ | ||
| | map_100_fieldop_1_0_0_514 | 225.20 | 123.20 | 1.83 | | ||
| | map_115_fieldop_1_0_0_518 | 197.40 | 113.04 | 1.75 | | ||
| | map_60_fieldop_0_0_504 | 142.10 | 86.66 | 1.64 | | ||
| | map_85_fieldop_0_0_506 | 80.45 | 81.28 | 0.99 | | ||
| | map_0_fieldop_0_0_500 | 63.02 | 31.68 | 1.99 | | ||
| | map_31_fieldop_0_0_0_512 | 54.46 | 28.56 | 1.91 | | ||
| | map_90_fieldop_0_0_508 | 25.57 | 18.62 | 1.37 | | ||
| | map_91_fieldop_0_0_510 | 7.99 | 3.49 | 2.29 | | ||
| | map_100_fieldop_0_0_0_0_520 | 5.59 | 5.07 | 1.10 | | ||
| | map_13_fieldop_0_0_498 | 5.32 | 3.70 | 1.44 | | ||
| | map_115_fieldop_0_0_0_516 | 4.99 | 5.28 | 0.95 | | ||
| | map_35_fieldop_0_0_503 | 3.62 | 1.87 | 1.93 | | ||
| +-----------------------------+-----------------------+------------------------+-----------------------------------------------------------+ | ||
| ``` | ||
|
|
||
| The runtimes of the individual kernels are collected using `nsys` and `rocprofv3`. | ||
|
|
||
| The benchmarks were run on `Santis` (`GH200 GPU`) and `Beverin` (`MI300A GPU`) using the following uenv images: | ||
| - GH200: `icon/25.2:v3` (CUDA 12.6) | ||
| - MI300A: `build::prgenv-gnu/25.12:2288359995` (ROCM 7.1.0) | ||
|
|
||
| To reproduce the benchmark results on `Beverin` you can follow the instructions below: | ||
|
|
||
| ``` | ||
| # Pull the correct `uenv` image. *!* NECESSARY ONLY ONCE *!* | ||
havogt marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| uenv image pull build::prgenv-gnu/25.12:2288359995 | ||
|
|
||
| # Start the uenv and mount the ROCm 7.1.0 environment. *!* This needs to be executed before running anything everytime *!* | ||
| uenv start --view default prgenv-gnu/25.12:2288359995 | ||
|
|
||
| # Run the whole `dycore` granule and gather the runtimes of the `GT4PY Programs` | ||
| sbatch benchmark_dycore.sh | ||
| # The script above will generate a json file with the names of the `GT4Py Programs` and their runtimes. The first one is always slow so we skip accounting it in our analysis | ||
| # With the following python script you can parse the json file and print the runtimes in a nice form | ||
| # python read_gt4py_timers.py dycore_gt4py_program_metrics.json # passing --csv will save them in a csv file | ||
|
|
||
| # Run the `vertically_implicit_solver_at_predictor_step` GT4Py program standalone. Notice the `GT4Py Timer Report` table printed from the first `pytest` invocation. The reported timers on this table are as close as possible to the kernel launches of the GT4Py program. | ||
| # The following script will benchmark the solver, run `rocprofv3` and collect a trace of it as well as run the `rocprof-compute` tool for all its kernels | ||
| sbatch benchmark_solver.sh | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - To understand the code apart from the analysis the profilers there are the following sources: | ||
| 1. Look at the generated HIP code for the `GT4Py program` `vertically_implicit_solver_at_predictor_step` in `amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/src/cuda/vertically_implicit_solver_at_predictor_step.cpp`. The code is generated from DaCe automatically and it's a bit too verbose. It would be good to have some feedback on whether the generated code is in a good form for the HIP compiler to optimize. | ||
| 2. Look at the `icon4py` frontend code for the `vertically_implicit_solver_at_predictor_step` [here](https://github.com/C2SM/icon4py/blob/e88b14d8be6eed814faf14c5e8a96aca6dfa991e/model/atmosphere/dycore/src/icon4py/model/atmosphere/dycore/stencils/vertically_implicit_dycore_solver.py#L219) | ||
| 3. Look at the generated SDFG by DaCe. This can give a nice overview of the computations and kernels generated. Using [the DaCe documentation](https://spcldace.readthedocs.io/en/latest/sdfg/ir.html) can help you understand what is expressed in the SDFG. The generated SDFG is saved in `amd_profiling_solver/.gt4py_cache/vertically_implicit_solver_at_predictor_step_<HASH>/program.sdfg`. To view the SDFG there is a VSCode plugin (`DaCe IOE`) or you can download it locally and open it in https://spcl.github.io/dace-webclient/. | ||
|
|
||
| - In the `amd_profiling_solver/.gt4py_cache` directory you may see various `vertically_implicit_solver_at_predictor_step_<HASH>`. Currently there are issues with the caching the compiled programs so running the profilers might take more than necessary and generate issues. We should look together into that to figure out a solution | ||
|
|
||
| - Installing the AMD HIP/ROCm packages for our UENV with Spack required various changes and which are done [here](https://github.com/eth-cscs/alps-uenv/pull/273) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| #!/bin/bash | ||
iomaganaris marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| #SBATCH --job-name=dycore_granule_profile | ||
| #SBATCH --ntasks=1 | ||
| #SBATCH --time=08:00:00 | ||
| #SBATCH --gres=gpu:1 | ||
| #SBATCH --partition=mi300 | ||
|
|
||
| source setup_env.sh | ||
|
|
||
| source .venv/bin/activate | ||
|
|
||
| export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1" | ||
| export GT4PY_BUILD_CACHE_LIFETIME=persistent | ||
| export GT4PY_BUILD_CACHE_DIR=amd_profiling_granule | ||
| export GT4PY_DYCORE_ENABLE_METRICS="1" | ||
| export GT4PY_ADD_GPU_TRACE_MARKERS="1" | ||
| export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage" | ||
|
|
||
| pytest -sv \ | ||
| -m continuous_benchmarking \ | ||
| -p no:tach \ | ||
| --benchmark-only \ | ||
| --benchmark-warmup=on \ | ||
| --benchmark-warmup-iterations=30 \ | ||
| --backend=dace_gpu \ | ||
| --grid=icon_benchmark_regional \ | ||
| --benchmark-time-unit=ms \ | ||
| --benchmark-min-rounds 100 \ | ||
| model/atmosphere/dycore/tests/dycore/integration_tests/test_benchmark_solve_nonhydro.py::test_benchmark_solve_nonhydro[True-False] | ||
|
|
||
| python read_gt4py_timers.py dycore_gt4py_program_metrics.json | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=solver_benchmark | ||
| #SBATCH --ntasks=1 | ||
| #SBATCH --time=08:00:00 | ||
| #SBATCH --gres=gpu:1 | ||
| #SBATCH --partition=mi300 | ||
|
|
||
| source setup_env.sh | ||
|
|
||
| source .venv/bin/activate | ||
|
|
||
| export GT4PY_UNSTRUCTURED_HORIZONTAL_HAS_UNIT_STRIDE="1" | ||
| export GT4PY_BUILD_CACHE_LIFETIME=persistent | ||
| export GT4PY_BUILD_CACHE_DIR=amd_profiling_solver | ||
| export ORIGINAL_GT4PY_BUILD_CACHE_DIR=$GT4PY_BUILD_CACHE_DIR | ||
| export GT4PY_COLLECT_METRICS_LEVEL=10 | ||
| export GT4PY_ADD_GPU_TRACE_MARKERS="1" | ||
| export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=3 | ||
| export ICON4PY_STENCIL_TEST_ITERATIONS=10 | ||
| export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=100 | ||
| export HIPFLAGS="-std=c++17 -fPIC -O3 -march=native -Wno-unused-parameter -save-temps -Rpass-analysis=kernel-resource-usage" | ||
|
|
||
| # Run the benchmark and collect the runtime of the whole GT4Py program (see `GT4Py Timer Report` in the output) | ||
| # The compiled GT4Py programs will be cached in the directory specified by `GT4PY_BUILD_CACHE_DIR` to be reused for running the `rocprof-compute` later | ||
| pytest -sv \ | ||
| -m continuous_benchmarking \ | ||
| -p no:tach \ | ||
| --backend=dace_gpu \ | ||
| --grid=icon_benchmark_regional \ | ||
| model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ | ||
| -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" | ||
|
|
||
| # Run the benchmark and collect its trace | ||
| # TODO(AMD/CSCS): Figure out why reusing the cached compiled stencils doesn't work under rocprofv3 and the GT4Py programs get recompiled every time we rerun the profiler | ||
| # TODO(AMD): Generating `rocpd` output fails with segfaults | ||
| export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=0 | ||
| export ICON4PY_STENCIL_TEST_ITERATIONS=1 | ||
| export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=10 | ||
| export GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}_rocprofv3 # Separate cache directory for the rocprofv3 run to avoid clashes with kernel names | ||
| rocprofv3 --kernel-trace on --hip-trace on --marker-trace on --memory-copy-trace on --memory-allocation-trace on --output-format pftrace -o rocprofv3_${GT4PY_BUILD_CACHE_DIR} -- \ | ||
| $(which python3.12) -m pytest -sv \ | ||
| -m continuous_benchmarking \ | ||
| -p no:tach \ | ||
| --backend=dace_gpu \ | ||
| --grid=icon_benchmark_regional \ | ||
| model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ | ||
| -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" | ||
|
|
||
| # Get the kernel names of the GT4Py program so that we can filter them with rocprof-compute | ||
| LAST_COMPILED_DIRECTORY=$(realpath $(ls -td ${ORIGINAL_GT4PY_BUILD_CACHE_DIR}/.gt4py_cache/*/ | head -1)) | ||
| echo "# Last compiled GT4Py directory: $LAST_COMPILED_DIRECTORY" | ||
| LAST_COMPILED_KERNEL_NAMES=$(grep -r -e "__global__ void.*map.*(" ${LAST_COMPILED_DIRECTORY}/src/cuda -o | sed 's/.*\s\([a-zA-Z_][a-zA-Z0-9_]*\)(.*/\1/') | ||
| echo "# Last compiled GT4Py kernel names:" | ||
| echo "$LAST_COMPILED_KERNEL_NAMES" | ||
| ROCPROF_COMPUTE_KERNEL_NAME_FILTER="-k $LAST_COMPILED_KERNEL_NAMES" | ||
|
|
||
| # Run rocprof-compute filtering the kernels of interest | ||
| # TODO(AMD/CSCS): Figure out why reusing the cached compiled stencils doesn't work under rocprofv3 and the GT4Py programs get recompiled every time we rerun the profiler | ||
| # This is problematic when gathering the data for the rocprof-compute analysis as different compilations may result in different kernel names | ||
| export ICON4PY_STENCIL_TEST_WARMUP_ROUNDS=0 | ||
| export ICON4PY_STENCIL_TEST_ITERATIONS=1 | ||
| export ICON4PY_STENCIL_TEST_BENCHMARK_ROUNDS=1 | ||
| export GT4PY_BUILD_CACHE_DIR=${ORIGINAL_GT4PY_BUILD_CACHE_DIR} # Reuse the compiled stencils of the first run | ||
| rocprof-compute profile --name rcu_${GT4PY_BUILD_CACHE_DIR} ${ROCPROF_COMPUTE_KERNEL_NAME_FILTER} --format-rocprof-output rocpd --kernel-names -R FP64 -- \ | ||
| $(which python3.12) -m pytest -sv \ | ||
| -m continuous_benchmarking \ | ||
| -p no:tach \ | ||
| --backend=dace_gpu \ | ||
| --grid=icon_benchmark_regional \ | ||
| model/atmosphere/dycore/tests/dycore/stencil_tests/test_vertically_implicit_dycore_solver_at_predictor_step.py \ | ||
| -k "test_TestVerticallyImplicitSolverAtPredictorStep[compile_time_domain-at_first_substep[False]__is_iau_active[False]__divdamp_type[32]]" | ||
|
|
||
| # TODO(AMD): Roofline generation fails with | ||
| # File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 998, in standalone_roofline | ||
| # self.empirical_roofline(ret_df=t_df) | ||
| # File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/utils/logger.py", line 66, in wrap_function | ||
| # result = function(*args, **kwargs) | ||
| # ^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| # File "/user-environment/linux-zen3/rocprofiler-compute-7.1.0-rjjjgkz67w66bp46jw7bvlfyduzr6vhv/libexec/rocprofiler-compute/roofline.py", line 463, in empirical_roofline | ||
| # flops_figure.write_image( | ||
| # File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/basedatatypes.py", line 3895, in write_image | ||
| # return pio.write_image(self, *args, **kwargs) | ||
| # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| # File "/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/.venv/lib/python3.12/site-packages/plotly/io/_kaleido.py", line 555, in write_image | ||
| # path.write_bytes(img_data) | ||
| # File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1036, in write_bytes | ||
| # with self.open(mode='wb') as f: | ||
| # ^^^^^^^^^^^^^^^^^^^^ | ||
| # File "/user-environment/linux-zen3/python-3.12.12-jpkfwhqo6njvbpw7gjcs22qkvxwexnv5/lib/python3.12/pathlib.py", line 1013, in open | ||
| # return io.open(self, mode, buffering, encoding, errors, newline) | ||
| # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| # OSError: [Errno 36] File name too long: '/capstor/scratch/cscs/ioannmag/HPCAIAdvisory/icon4py/workloads/rcu_amd_profiling_solver/MI300A_A1/empirRoof_gpu-0_FP64_map_0_fieldop_0_0_500_map_100_fieldop_0_0_0_514_map_100_fieldop_1_0_0_0_520_map_115_fieldop_0_0_0_516_map_115_fieldop_1_0_0_518_map_13_fieldop_0_0_498_map_31_fieldop_0_0_0_512_map_35_fieldop_0_0_503_map_60_fieldop_0_0_504_map_85_fieldop_0_0_506_map_90_fieldop_0_0_508_map_91_fieldop_0_0_510.pdf' |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.