Skip to content

pytorch README.md path to rocm updated#1

Open
sfinktah wants to merge 1 commit into
scottt:gfx1151from
sfinktah:sfink-extbuild-rocmpath
Open

pytorch README.md path to rocm updated#1
sfinktah wants to merge 1 commit into
scottt:gfx1151from
sfinktah:sfink-extbuild-rocmpath

Conversation

@sfinktah

@sfinktah sfinktah commented Jun 25, 2025

Copy link
Copy Markdown

The path

export CMAKE_PREFIX_PATH="$(realpath ../../build/dist/rocm)"

Seems now to be

export CMAKE_PREFIX_PATH="$(realpath ../../build/base/rocm-cmake/dist/share/rocm)"

Also, some questions:

I'm trying to replicate the windows builds for gfx110x, etc by @jammm but it seems that the upstream rocm is too much of a moving target for the hipBLAS patches. I also wasn't sure what the recommended branch was, so tried checking out the exact commit the releases were made under, and then copying the .tar.gz sources over the top.

I've returned to building from this "main" (gfx1151) branch as that seems to be fine, though only has external-build files for pytorch 2.6.

The other question I have is related to a number of bash commands/scripts that appear in the build process. Are you actually using bash (cygwin or msys2 or smth else?). I have no issue with doing likewise, it's just that I'm never sure where to draw the line: e.g. use cygwin's python or windows python? cygwin's cmake or windows cmake? it seems that eventually something is going to get upset about the pathing being incompatible.

@sfinktah

Copy link
Copy Markdown
Author

Some more notes about compiling with Windows. These may have been addressed in another branch already, but I am including them here so I don't lose them.

Step 0: Prep venv

It is highly recommended to use a virtual environment unless if in a throw-away
container/CI environment.

python -m venv .venv
source .venv/bin/activate

cmd.exe

python.exe -m venv venv
call venv\Scripts\activate
python.exe -m pip install --upgrade pip

Step 1: Preparing sources

# Checks out the most recent stable release branch of PyTorch, hipifies and
# applies patches.
./ptbuild checkout

cmd.exe

python ptbuild checkout
Set-ItemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem -Name LongPathsEnabled -Value 1

then launch new shell (this is to enable long paths, otherwise you will hit the error "While running ptbuild: error: unable to create file library/src/tensor_operation_instance/gpu/gemm_multiply_multiply/device_gemm_multiply_multiply_xdl_f8_f8_bf16/device_gemm_multiply_multiply_xdl_f8_f8_bf16_mk_nk_mn_comp_default_instance.cpp: Filename too long")

Step 2: Install Deps

Python deps:

pip install -r src/requirements.txt
pip install mkl-static mkl-include

Step 3: Setup and Build

export CMAKE_PREFIX_PATH="$(realpath ../../build/base/rocm-cmake/dist/share/rocm)"
(cd src && USE_KINETO=OFF python setup.py develop)

cmd.exe

for %F in ("..\..\build\base\rocm-cmake\dist\share\rocm") do set CMAKE_PREFIX_PATH=%~fF
set USE_KINETO=OFF
:: required to avoid errors from cmake projects like protobuf that dislike newer version of cmake
set CMAKE_POLICY_VERSION_MINIMUM=3.5
cd src
python setup.py develop

Note: /cygdrive/r/rocm-TheRock/external-builds/pytorch/src/torch/csrc/jit/ir/ir.cpp needed

#define USE_ROCM

added, which probably indicates that this is not the correct branch to be using, as I'm sure you would have addressed the issue at the source already.

@sfinktah

sfinktah commented Jun 25, 2025

Copy link
Copy Markdown
Author

... for completeness, here is torchaudio and torchvision. Please note, actual wheel didn't work, so ...yeah.

And then...

Building the wheel... (will be saved to external-builds/src/dist)

cmd.exe

pip install build wheel
python -m build --wheel

Building torchaudio and vision

Anyway, you'll need to get rid of all trace of HIP and such like, suggest:

cmd.exe

set HIP_PATH=
set HIP_PATH_62=
set ROCM_HOME=
set USE_ROCM=
set USE_CUDA=
set USE_PYTORCH_QNNPACK=0

Then, assuming you have HIP installed, either remove it from your Path, or (easier), rename it on disk.
Just renaming C:\Program Files\AMD\ROCm\6.2 to C:\Program Files\AMD\ROCm\6.2 - Temporarily Disabled will do it.

You need to checkout the branches that match the pytorch you just built. To double check that it is there, and working:

cmd.exe

python -c "import torch; print(f'v{torch.__version__.split(\"+\")[0]}')"

If this is... 2.6.0a0 (which it is for me, because I'm building from the wrong
branch), then just copy what I do. Otherwise, figure it out. vision should be
v0.22.0 for pytorch 2.7.0, v0.21.0 for pytorch 2.6.0.

cmd.exe

set PYTHONPATH=R:\rocm-TheRock\external-builds\pytorch\src
cd ..\..\..\external-builds
git clone --recursive https://github.com/pytorch/audio.git torchaudio
git clone --recursive https://github.com/pytorch/vision.git torchvision

:: torchaudio
cd torchaudio
git checkout v2.6.0
pip install -r requirements.txt
python setup.py bdist_wheel

:: vision
cd ..\torchvision
git checkout v0.21.0
set DISTUTILS_USE_SDK=1
python setup.py bdist_wheel

You should now have the following wheels. I'm leaving my absolute path in (again) just for clarity.

R:\rocm-TheRock\external-builds\pytorch\src\dist\torch-2.6.0a0+git8382190-cp311-cp311-win_amd64.whl
R:\rocm-TheRock\external-builds\torchaudio\dist\torchaudio-2.6.0a0+d883142-cp311-cp311-win_amd64.whl
R:\rocm-TheRock\external-builds\torchvision\dist\torchvision-0.21.0+7af6987-cp311-cp311-win_amd64.whl

Now to check if it all works

cmd.exe

:: deactivate our development venv
%VIRTUAL_ENV%\scripts\deactivate
:: go to where jammm's wheels are installed
cd c:\zluda\comfy-rock
c:
:: (a) don't laugh at my path
:: (b) yes, you really can change directories like that, it's just super weird

:: activate our production venv
venv\Scripts\activate
pip uninstall torch torchvision torchaudio -y --quiet
pip install ^
    R:\rocm-TheRock\external-builds\pytorch\src\dist\torch-2.6.0a0+git8382190-cp311-cp311-win_amd64.whl ^
    R:\rocm-TheRock\external-builds\torchaudio\dist\torchaudio-2.6.0a0+d883142-cp311-cp311-win_amd64.whl ^
    R:\rocm-TheRock\external-builds\torchvision\dist\torchvision-0.21.0+7af6987-cp311-cp311-win_amd64.whl

Epic fail

cmd.exe

comfy --normalvram
  File "C:\zluda\comfy-rock\venv\Lib\site-packages\torch\cuda\__init__.py", line 310, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Help!!!

I manually nopped a few checks so comfyui would run, but something is NQR (not quite right). dev.type == 'hip', btw.

@sfinktah

sfinktah commented Jun 25, 2025

Copy link
Copy Markdown
Author

update

I cherrypicked some scripts, and am trying again with those.

git restore --source 5d61dbb --worktree --staged \
  build_pytorch_all.sh \
  build_pytorch_audio.sh \
  build_pytorch_torch.sh \
  build_pytorch_vision.sh \
  checkout_and_build_all.sh \
  checkout_pytorch_all.sh \
  env_init.sh \
  pytorch_audio_repo.py \
  pytorch_torch_repo.py \
  pytorch_vision_repo.py \
  repo_management.py
  
python pytorch_torch_repo.py checkout

Seems to give me 2.7.0 install scripts, but the latest patches.

@jammm

jammm commented Jun 25, 2025

Copy link
Copy Markdown

I would highly recommend you build from TheRock main branch. These forks are not really maintained and you won't get much support from here.

Follow the instructions to build pytorch from there.
torchvision should build out of the box with FORCE_CUDA=1 and setting your CUDA_HOME and ROCM_HOME to the rocm folder that you built. torchaudio won't build out of the box on Windows and needs some work. You could try this branch, https://github.com/jammm/audio/tree/jam/v2.7.0_windows but it's another fork so be prepared to tinker around a bit. Otherwise I would wait for those changes to be upstreamed.

@sfinktah

sfinktah commented Jun 25, 2025

Copy link
Copy Markdown
Author

Hi @jammm ... I'll check it out. I feel I got close with torch 2.7.0 using 5d61dbb#diff-625c093a3a6309b1e5adae73696da1c7c693ffffc65812f3d57924ee07ff641b but apparently I didn't cherrypick all the patches and ended up without aotriton compiled in.

I am guessing this fork was all about the gfx1151?

lshqqytiger has apparently build some stand-alone wheels for older AMDs, so maybe he will sort it all out and I can go back to being lazy. I would rather like gfx1030 and gfx1100 support in one wheel. :)

@jammm

jammm commented Jun 25, 2025

Copy link
Copy Markdown

yeah mainly for gfx1151. Though technically it could support other archs.

@sfinktah

Copy link
Copy Markdown
Author

Totally, and it supported the gfx1100 great. It's only failings were that it didn't have LAPACK for some cpu tensor stuff used for automatic masking in Wan2GP (easily fixed), and that it got really sulky about large Conv3D's (which I believe is something of a traditional AMD thing involving lengths or widths above 512, and easily solved by using VAE Tiled Decoding).

That and CPU Text Encoding being really slow without Triton to compile CPU code. However having just read ROCm#409 I can see how you have all been working super hard, and how much everyone has contributed. It's also great to see AMD and non-AMD (and ex-AMD) people getting together to make community development possible.

btw, I also find boost wierd... but there are wonderful things in there. Though those mostly end up being mainlined into the C++ standard or extracted into things like fmt. Thankfully Microsoft vcpkg takes care of boost when I need it, and I no longer have to keep 37 versions.

Again, excellent work. I released guides on /r/comfyui for using the wheels on ComfyUI and Wan2GP, but there are only a handful of AMD regulars in there.

@sfinktah

Copy link
Copy Markdown
Author

@jammm

We (the collective community) did some more work on this, and we have had a working 6.5 for gfx1100 and gfx1030 for a while now, with flash_attention, sage_attention, and triton. the sordid history of that development is at patientx/ComfyUI-Zluda#170 (comment) and the resultant automated script is at https://github.com/user-attachments/files/21155122/patientx-native-rocm-3.zip

Not sure if any of that would be of interest to you or scottt.

I also took some time to expand ADLX's Pybind demo into a more conventional python module https://pypi.org/project/ADLXPybind/ that can build itself from an sdist (if required) but has wheels now.

That was a necessary step to create https://pypi.org/project/pynvml-amd-windows/ which is a drop-in replacement (actually, it hijacks the pynvml module name) for pynvml in so-far as pyvnml is used in Crystool performance monitor.
image

As I say, I have no idea if any of this is of interest or use to you, but I figure we owe you and scottt for your work and for inspiring the rest of us to try a little harder to contribute. Hopefully some if it is of some use to you.

Obviously we also owe "other" scott and the rest of the "TheRock" crew an incalculable debt, and those at AMD who are working with the community on the ROCm project, but I'm not sure they would have any use for our small efforts.

And if you would be open to it, I might want to have chat about whether it would be practical for me to attempt port nanchaku (inference engine for 4-bit neural networks quantized with SVDQuant) to HIP.

@Ginxchan

Ginxchan commented Sep 6, 2025

Copy link
Copy Markdown

oh man, thanks a lot for the v6.5.0rc-pytorch-gfx110x wheels, it has been quite a decent speed up but most importantly fixed most of my vae decode issues and can fully use the vram on my 7900xt.

@sfinktah

sfinktah commented Sep 8, 2025

Copy link
Copy Markdown
Author

@Ginxchan If you have a 7900xt (which I assume is basically the same as the 7900xtx), then you'll probably (read: definately) get faster results with ZLUDA + Triton + sageattention, and with less memory use.

Though this wheel is very handy for nodes that don't work under ZLUDA (well, I've only found one, which was a stem-maker for songs, but I'm sure there's more).

Note: you can also install Triton and sageattention for this native wheel, but it will still not be as efficient. No idea why.

scottt pushed a commit that referenced this pull request Sep 8, 2025
Running these recently added Python unit tests on CI will help encourage
good development practices (see also
ROCm#750). I just noticed older tests
already running here:
https://github.com/ROCm/TheRock/blob/13ef7021af1f183e9344ec177ccb79c16426385e/.github/workflows/build_linux_packages.yml#L120-L123

Sample logs from
https://github.com/ROCm/TheRock/actions/runs/15339038605/job/43221678538#step:12:12:
```
Run ctest --test-dir build --output-on-failure
Internal ctest changing into directory: /__w/TheRock/TheRock/build
Test project /__w/TheRock/TheRock/build
      Start  1: build_tools_fileset_tool_test
 1/25 Test  #1: build_tools_fileset_tool_test .........................   Passed    0.30 sec
      Start  2: build_tools_artifacts_test
 2/25 Test  ROCm#2: build_tools_artifacts_test ............................   Passed    0.06 sec
      Start  3: therock-validate-shared-lib-librocm-openblas.so
 3/25 Test  ROCm#3: therock-validate-shared-lib-librocm-openblas.so .......   Passed    0.04 sec
      Start  4: therock-validate-shared-lib-libamd.so
 4/25 Test  ROCm#4: therock-validate-shared-lib-libamd.so .................   Passed    0.03 sec
...
```

We'll probably want to run the Python unit tests much earlier in the
build, but this is better than not running them anywhere. We could also
run these via pytest instead of ctest.
scottt pushed a commit that referenced this pull request Sep 8, 2025
…s gtest folder (ROCm#1398)

This reverts commit 35444a3.

See discussion at
ROCm#1248 (comment).

We suspect this is causing flaky build failures on Windows gfx1151 like
https://github.com/ROCm/TheRock/actions/runs/17471481134/job/49620818900?pr=1349#step:11:36751.

```
[MIOpen] [894/920] Building CXX object test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj
[MIOpen] FAILED: test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj 
[MIOpen] ccache B:\build\core\clr\dist\lib\llvm\bin\clang++.exe -DBOOST_ALL_NO_LIB=1 -DBOOST_ATOMIC_NO_LIB -DBOOST_FILESYSTEM_NO_LIB -DBOOST_SYSTEM_NO_LIB -DHIP_COMPILER_FLAGS=" -x hip   -D__HIP_PLATFORM_AMD__=1  -DUSE_PROF_API=1 C:/2E1C510A-F3EC-4287-AB5A-59025DAF1B15/build/core/clr/dist/lib/llvm/lib/clang/20/lib/windows/clang_rt.builtins-x86_64.lib --hip-link  C:/2E1C510A-F3EC-4287-AB5A-59025DAF1B15/build/core/clr/dist/lib/llvm/lib/clang/20/lib/windows/clang_rt.builtins-x86_64.lib  -fno-offload-uniform-block " -DMIOPEN_BETA_API=1 -DMIOPEN_BUILD_TESTING -DMIOPEN_TEST_DRIVER_MODE=1 -DNOMINMAX -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -IB:/build/math-libs/BLAS/hipBLAS/stage/include -IB:/build/math-libs/BLAS/hipBLAS-common/stage/include -IB:/build/math-libs/rocRAND/stage/include -IC:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/.. -IC:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/../../src/kernels -IC:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/include -IB:/build/ml-libs/MIOpen/build/include -IC:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/include -isystem B:/build/base/half/stage/include -isystem B:/build/third-party/frugally-deep/dist/include -isystem B:/build/third-party/FunctionalPlus/dist/include -isystem B:/build/third-party/eigen/dist/include/eigen3 -isystem B:/build/third-party/nlohmann-json/dist/include -isystem B:/build/math-libs/BLAS/rocBLAS/dist/include -isystem B:/build/core/clr/dist/include -isystem B:/build/third-party/googletest/dist/include -isystem B:/build/compiler/amd-comgr/dist/include -isystem B:/build/third-party/boost/cmake_project/dist/include/boost-1_87 -isystem B:/build/third-party/sysdeps/windows/sqlite3/build/dist/lib/rocm_sysdeps/include -isystem B:/build/third-party/sysdeps/windows/bzip2/build/dist/lib/rocm_sysdeps/include -DWIN32 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_WARNINGS -DNOMINMAX -fms-extensions -fms-compatibility -D_ENABLE_EXTENDED_ALIGNED_STORAGE  -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class -Wno-ignored-attributes -Wno-unknown-attributes -Wno-duplicate-decl-specifier --hip-path=B:/build/core/clr/dist --hip-device-lib-path=B:/build/core/clr/dist/lib/llvm/amdgcn/bitcode -O3 -DNDEBUG -std=c++20 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt   -U__HCC__ -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-ignored-qualifiers -Wno-sign-compare -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-extra-semi-stmt -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-option-ignored -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unused-result -Wno-unsafe-buffer-usage -Wno-deprecated-declarations -Wno-shadow-uncaptured-local -Wno-global-constructors -Wno-reserved-identifier -Wno-zero-as-null-pointer-constant -Wno-ignored-attributes -Wno-deprecated -Wno-incompatible-pointer-types -Wno-old-style-cast -Wno-unknown-attributes -Wno-microsoft-cpp-macro -Wno-microsoft-enum-value -Wno-language-extension-token -Wno-c++11-narrowing -Wno-float-equal -Wno-redundant-parens -Wno-format-nonliteral -Wno-unused-template -Wno-comma -Wno-suggest-destructor-override -Wno-switch-enum -Wno-shift-sign-overflow -Wno-suggest-override -Wno-inconsistent-missing-destructor-override -Wno-cast-function-type -Wno-nonportable-system-include-path -Wno-documentation -Wno-deprecated-builtins -Wno-enum-constexpr-conversion -Wno-unused-value -Wno-unused-parameter -Wno-missing-noreturn -Wno-tautological-constant-out-of-range-compare -Wno-c++20-extensions -Wno-unique-object-duplication -Wno-switch-default -Wno-nontrivial-memcall -fms-extensions -fms-compatibility -Wno-undef -U__LP64__ -x hip --offload-arch=gfx1151 -MD -MT test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj -MF test\gtest\CMakeFiles\miopen_gtest.dir\smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj.d -o test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj -c C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp
[MIOpen] PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
[MIOpen] Stack dump:
[MIOpen] 0.	Program arguments: C:\\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\\build\\core\\clr\\dist\\lib\\llvm\\bin\\clang++.exe -cc1 -triple x86_64-pc-windows-msvc19.44.35215 -aux-triple amdgcn-amd-amdhsa -emit-obj -mincremental-linker-compatible -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp -mrelocation-model pic -pic-level 2 -mframe-pointer=none -relaxed-aliasing -fmath-errno -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64 -tune-cpu generic -fdebug-compilation-dir=B:\\build\\ml-libs\\MIOpen\\build -fcoverage-compilation-dir=B:\\build\\ml-libs\\MIOpen\\build -resource-dir C:\\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\\build\\core\\clr\\dist\\lib\\llvm\\lib\\clang\\20 -dependency-file test\\gtest\\CMakeFiles\\miopen_gtest.dir\\smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj.d -MT test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj -sys-header-deps -internal-isystem C:\\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\\build\\core\\clr\\dist\\lib\\llvm\\lib\\clang\\20\\include\\cuda_wrappers -idirafter B:/build/core/clr/dist\\include -include __clang_hip_runtime_wrapper.h -isystem B:/build/base/half/stage/include -isystem B:/build/third-party/frugally-deep/dist/include -isystem B:/build/third-party/FunctionalPlus/dist/include -isystem B:/build/third-party/eigen/dist/include/eigen3 -isystem B:/build/third-party/nlohmann-json/dist/include -isystem B:/build/math-libs/BLAS/rocBLAS/dist/include -isystem B:/build/core/clr/dist/include -isystem B:/build/third-party/googletest/dist/include -isystem B:/build/compiler/amd-comgr/dist/include -isystem B:/build/third-party/boost/cmake_project/dist/include/boost-1_87 -isystem B:/build/third-party/sysdeps/windows/sqlite3/build/dist/lib/rocm_sysdeps/include -isystem B:/build/third-party/sysdeps/windows/bzip2/build/dist/lib/rocm_sysdeps/include -D BOOST_ALL_NO_LIB=1 -D BOOST_ATOMIC_NO_LIB -D BOOST_FILESYSTEM_NO_LIB -D BOOST_SYSTEM_NO_LIB -D "HIP_COMPILER_FLAGS= -x hip   -D__HIP_PLATFORM_AMD__=1  -DUSE_PROF_API=1 C:/2E1C510A-F3EC-4287-AB5A-59025DAF1B15/build/core/clr/dist/lib/llvm/lib/clang/20/lib/windows/clang_rt.builtins-x86_64.lib --hip-link  C:/2E1C510A-F3EC-4287-AB5A-59025DAF1B15/build/core/clr/dist/lib/llvm/lib/clang/20/lib/windows/clang_rt.builtins-x86_64.lib  -fno-offload-uniform-block " -D MIOPEN_BETA_API=1 -D MIOPEN_BUILD_TESTING -D MIOPEN_TEST_DRIVER_MODE=1 -D NOMINMAX -D USE_PROF_API=1 -D __HIP_PLATFORM_AMD__=1 -I B:/build/math-libs/BLAS/hipBLAS/stage/include -I B:/build/math-libs/BLAS/hipBLAS-common/stage/include -I B:/build/math-libs/rocRAND/stage/include -I C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/.. -I C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/../../src/kernels -I C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/src/include -I B:/build/ml-libs/MIOpen/build/include -I C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/include -D WIN32 -D WIN32_LEAN_AND_MEAN -D _CRT_SECURE_NO_WARNINGS -D NOMINMAX -D _ENABLE_EXTENDED_ALIGNED_STORAGE -D NDEBUG -D _DLL -D _MT -U __HCC__ -U __LP64__ -internal-isystem C:\\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\\build\\core\\clr\\dist\\lib\\llvm\\lib\\clang\\20\\include -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\ATLMFC\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Auxiliary\\VS\\include" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.26100.0\\ucrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\um" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\shared" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\winrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\cppwinrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\NETFXSDK\\4.8\\include\\um" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\ATLMFC\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Auxiliary\\VS\\include" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.26100.0\\ucrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\um" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\shared" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\winrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\cppwinrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\NETFXSDK\\4.8\\include\\um" -internal-isystem C:\\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\\build\\core\\clr\\dist\\lib\\llvm\\lib\\clang\\20\\include -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\ATLMFC\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Auxiliary\\VS\\include" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.26100.0\\ucrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\um" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\shared" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\winrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\cppwinrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\NETFXSDK\\4.8\\include\\um" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.44.35207\\ATLMFC\\include" -internal-isystem "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Auxiliary\\VS\\include" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.26100.0\\ucrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\um" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\shared" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\winrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\10\\\\include\\10.0.26100.0\\\\cppwinrt" -internal-isystem "C:\\Program Files (x86)\\Windows Kits\\NETFXSDK\\4.8\\include\\um" -O3 -Wno-documentation-unknown-command -Wno-documentation-pedantic -Wno-unused-command-line-argument -Wno-explicit-specialization-storage-class -Wno-ignored-attributes -Wno-unknown-attributes -Wno-duplicate-decl-specifier -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-ignored-qualifiers -Wno-sign-compare -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-extra-semi-stmt -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-option-ignored -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unused-result -Wno-unsafe-buffer-usage -Wno-deprecated-declarations -Wno-shadow-uncaptured-local -Wno-global-constructors -Wno-reserved-identifier -Wno-zero-as-null-pointer-constant -Wno-ignored-attributes -Wno-deprecated -Wno-incompatible-pointer-types -Wno-old-style-cast -Wno-unknown-attributes -Wno-microsoft-cpp-macro -Wno-microsoft-enum-value -Wno-language-extension-token -Wno-c++11-narrowing -Wno-float-equal -Wno-redundant-parens -Wno-format-nonliteral -Wno-unused-template -Wno-comma -Wno-suggest-destructor-override -Wno-switch-enum -Wno-shift-sign-overflow -Wno-suggest-override -Wno-inconsistent-missing-destructor-override -Wno-cast-function-type -Wno-nonportable-system-include-path -Wno-documentation -Wno-deprecated-builtins -Wno-enum-constexpr-conversion -Wno-unused-value -Wno-unused-parameter -Wno-missing-noreturn -Wno-tautological-constant-out-of-range-compare -Wno-c++20-extensions -Wno-unique-object-duplication -Wno-switch-default -Wno-nontrivial-memcall -Wno-undef -std=c++20 -ferror-limit 19 -fhip-new-launch-api -fno-use-cxa-atexit -fms-extensions -fms-compatibility -fms-compatibility-version=19.44.35215 -fno-implicit-modules -fskip-odr-check-in-gmf -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp --dependent-lib=msvcrt -fcuda-include-gpubinary C:\\Users\\ContainerAdministrator\\AppData\\Local\\Temp\\smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16-809b30.hipfb -cuid=bf2425f0600af3e8 -fcuda-allow-variadic-functions -faddrsig -o test/gtest/CMakeFiles/miopen_gtest.dir/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp.obj -x hip C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest/smoke_solver_ConvAsmImplicitGemmGTCDynamicXdlopsNHWC_bf16.cpp
[MIOpen] 1.	<eof> parser at end of file
[MIOpen] 2.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\gtest_common.hpp:279:6: instantiating function definition 'invoke_with_params<conv2d_driver, GPU_Conv2dTuning_BFP16, void (&)(const std::basic_string<char> &)>'
[MIOpen] 3.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:1356:6: instantiating function definition 'test_drive<conv2d_driver>'
[MIOpen] 4.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:1337:6: instantiating function definition 'test_drive_impl<conv2d_driver<double>>'
[MIOpen] 5.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:1233:6: instantiating function definition 'test_drive_impl_1<conv2d_driver<double>>'
[MIOpen] 6.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:938:10: instantiating function definition 'test_driver::base_run<conv2d_driver<double>>'
[MIOpen] 7.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\..\conv_common.hpp:1962:10: instantiating function definition 'conv_driver<double>::run'
[MIOpen] 8.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:910:10: instantiating function definition 'test_driver::verify<verify_backward_weights_conv<ConvApi::Find_1_0, double>>'
[MIOpen] 9.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:798:10: instantiating function definition 'test_driver::verify_impl<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\driver.hpp:913:13), verify_backward_weights_conv<ConvApi::Find_1_0, double> &>'
[MIOpen] 10.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\../driver.hpp:746:10: instantiating function definition 'test_driver::run_cpu<verify_backward_weights_conv<ConvApi::Find_1_0, double>>'
[MIOpen] 11.	C:/home/runner/_work/TheRock/TheRock/rocm-libraries/projects/miopen/test/gtest\..\ford.hpp:56:6: instantiating function definition 'then<tensor<double>, (lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\driver.hpp:768:46)>'
[MIOpen] 12.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:1362:81: instantiating function definition 'std::async<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>'
[MIOpen] 13.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:1350:41: instantiating function definition 'std::_Get_associated_state<tensor<double>, std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>>'
[MIOpen] 14.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:597:5: instantiating function definition 'std::_Deferred_async_state<tensor<double>>::_Deferred_async_state<std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>>'
[MIOpen] 15.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\type_traits:53:16: instantiating variable definition 'std::conjunction_v<std::negation<std::is_same<std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>, std::function<tensor<double> ()>>>, std::_Is_invocable_r<tensor<double>, std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)> &>>'
[MIOpen] 16.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\type_traits:45:8: instantiating class definition 'std::conjunction<std::negation<std::is_same<std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>, std::function<tensor<double> ()>>>, std::_Is_invocable_r<tensor<double>, std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)> &>>'
[MIOpen] 17.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\type_traits:35:8: instantiating class definition 'std::_Conjunction<true, std::negation<std::is_same<std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>, std::function<tensor<double> ()>>>, std::_Is_invocable_r<tensor<double>, std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)> &>>'
[MIOpen] 18.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\type_traits:1827:8: instantiating class definition 'std::_Is_invocable_r<tensor<double>, std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)> &>'
[MIOpen] 19.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:1341:20: instantiating function definition 'std::_Fake_no_copy_callable_adapter<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>::operator()'
[MIOpen] 20.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:1314:16: instantiating function definition 'std::_Invoke_stored<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23)>'
[MIOpen] 21.	C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.44.35207\include\future:1308:16: instantiating function definition 'std::_Invoke_stored_explicit<(lambda at C:\home\runner\_work\TheRock\TheRock\rocm-libraries\projects\miopen\test\gtest\..\ford.hpp:59:23), 0ULL>'
[MIOpen] Exception Code: 0xC0000005
[MIOpen]   #0 0x00007ff64a1882be (C:\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\build\core\clr\dist\lib\llvm\bin\clang++.exe+0x15082be)
[MIOpen]   #1 0x00007ff64bb9af20 (C:\2E1C510A-F3EC-4287-AB5A-59025DAF1B15\build\core\clr\dist\lib\llvm\bin\clang++.exe+0x2f1af20)
```

It also increased MIOpen test times substantially:

Before: https://github.com/ROCm/TheRock/actions/runs/17447546026
* Linux mi325 50m
* Linux mi355 1h7m

After: https://github.com/ROCm/TheRock/actions/runs/17458318068
* Linux mi325 1h30m
* Linux mi355 1h54m (very close to a 2 hour timeout)
@Horus-p

Horus-p commented Nov 26, 2025

Copy link
Copy Markdown

DivLOGs.txt
For what it is worth
After a freeze a few times: python -c "import torch; torch.tensor([1.0]).cuda(); print('Did it work?')",
I did take a look in the sys log Ubuntu 24.10, it seems that

GPU detected as amdgpu 0000:0a:00.0 (RDNA4 - gfx_v12_0)
VRAM: 16304M recognized correctly
All GPU IP blocks loaded
Kernel mode setting (KFD) initialized
Ring buffers tested and passed
GPU fully initialized: Initialized amdgpu 3.61.0
The issue happens AFTER successful initialization - when actual compute workloads start.
Critical lines:

text
amdgpu 0000:0a:00.0: amdgpu: PCIE atomic ops is not supported
This suggests Thunderbolt limitations in PCIe atomic operations, which ROCm compute heavily relies on.

The problem appears to be Thunderbolt PCIe limitations preventing proper ROCm compute operations, even though basic GPU initialization works fine.

This after installing, it BTW never worked before also with the wheel from here,

https://download.pytorch.org/whl/nightly/rocm7.1/torch-2.10.0.dev20251118%2Brocm7.1-cp311-cp311-manylinux_2_28_x86_64.whl

More digging:

Tell GitHub:
"ROCm misidentifies RX 9060 XT (RDNA4/gfx1200) as RDNA3/gfx1100, leading to architecture mismatch and system freezes. Also missing OpenCL compiler headers."

You can follow the story at
ROCm/ROCm#5657?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants