Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_NOT_SUPPORTED when run pytorch==2.5.1+cu118 #53

Open
fbotp opened this issue Dec 12, 2024 · 4 comments
Open

CUBLAS_STATUS_NOT_SUPPORTED when run pytorch==2.5.1+cu118 #53

fbotp opened this issue Dec 12, 2024 · 4 comments
Assignees

Comments

@fbotp
Copy link

fbotp commented Dec 12, 2024

My graphics card is AMD Radeon RX 6750 GRE 10GB (gfx1031), with version 24.12.1 AMD graphics card driver installed. The system is Windows 11 LTSC. I want to run deep learning code on this computer and call pytorch library, so I did the following:

  1. Install HIP SDK 6.1.2 and add HIP_PATH/bin to the PATH path.
  2. From ROCmLibs I downloaded rocm.gfx1031.for.hip.sdk.6.1.2.7z and replaced the library folder under %HIP_PATH%/bin/rocblas with all the files in it (rocblas.dll was also placed in the same location), and restarted the computer.
  3. Installed Miniforge 3 and created a Python environment using the mamba create - n ai Python=3.10 command. Installed torch using pip install torch --index-url https://download.pytorch.org/whl/cu118.
  4. I downloaded ROCm-6-ZLUDA and decompressed it, and also added the path of zluda to the PATH environment variable.
  5. I replaced cublas64_11.dll, cusparse64_11.dll, and nvrtc64_112_0dll in microforge3/env/ai/lib/site packages/torch/lib with cublas.dll, cusparsedll, and nvrtc.dll from zluda, respectively.
  6. Call python -c import torch;print(torch.cuda.is_available()) and output True.
  7. The following content has been added at the beginning of the deep learning code:
torch.backends.cudnn.enabled = False
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
  1. I loaded the BERT model using transformers and trained it, but found that it cannot run. The error message is as follows:
Traceback (most recent call last):
File "C:\Users\rouxiaobei\Downloads\address-291b83f2838c047820c9456a3e7250f601b60328\address-291b83f2838c047820c9456a3e7250f601b60328\main.py", line 362, in <module>
main()
File "C:\Users\rouxiaobei\Downloads\address-291b83f2838c047820c9456a3e7250f601b60328\address-291b83f2838c047820c9456a3e7250f601b60328\main.py", line 293, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "C:\Users\rouxiaobei\Downloads\address-291b83f2838c047820c9456a3e7250f601b60328\address-291b83f2838c047820c9456a3e7250f601b60328\main.py", line 91, in train
outputs = model(**inputs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\rouxiaobei\Downloads\address-291b83f2838c047820c9456a3e7250f601b60328\address-291b83f2838c047820c9456a3e7250f601b60328\models.py", line 16, in forward
outputs = self.bert(input_ids, attention_mask, token_type_ids)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\transformers\models\bert\modeling_bert.py", line 1142, in forward
encoder_outputs = self.encoder(
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\transformers\models\bert\modeling_bert.py", line 695, in forward
layer_outputs = layer_module(
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\transformers\models\bert\modeling_bert.py", line 585, in forward
self_attention_outputs = self.attention(
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\transformers\models\bert\modeling_bert.py", line 515, in forward
self_outputs = self.self(
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\transformers\models\bert\modeling_bert.py", line 395, in forward
query_layer = self.transpose_for_scores(self.query(hidden_states))
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\miniforge3\envs\temp\lib\site-packages\torch\nn\modules\linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`

I want to know if I have made a mistake. I have been troubled for many days and have not been able to solve this problem. I am not sure if it is related to the Windows 11 LTSC version I am using, and I noticed that when using Dependencies to view the DLL file, I found that ext-ms-win-oobe-query-l1-1-0.dll is missing. I don't know if this is the reason why the code cannot run.

Sorry to bother you. Any suggestions would be greatly appreciated.

@lshqqytiger
Copy link
Owner

torch>=2.4 is broken. The official release started to require cuBLASLt, which is unavailable on ZLUDA right now. It is available only on Linux build as we cannot build hipBLASLt on Windows.

@fbotp
Copy link
Author

fbotp commented Dec 12, 2024

OK, after replacing the torch to version 2.3.1, it can run normally, but not as fast as expected. The same code is about 12.5% slower than Ubuntu with ROCm installed, and there will be a 1Torch was not compiled with flash attention UserWarning, but it does not affect the operation. I look forward to this project getting better in the future.
Thank you again for your reply!

@lshqqytiger lshqqytiger self-assigned this Jan 3, 2025
@Exodeadh
Copy link

Exodeadh commented Jan 9, 2025

hey :) i see that in these latest releases you are working very hard and are adding hipBLASLt too.
dows this mean that when time comes, we'll be able to update pytorch to 2.5.1+cu118?

thank you really much for your work. Really.

@lshqqytiger
Copy link
Owner

lshqqytiger commented Jan 9, 2025

Now we have cuBLASLt on dev branch. After #61 is merged, I'll upload the nightly build that includes cublasLt.dll. For now, you can download v3.8.5 with cuBLASLt or build from dev branch. However, because AMD hasn't officially released hipBLASLt for Windows yet, you should build it yourself or download unofficial build from the Internet.
Additionally, although torch 2.5.1 is available, the performance is worse than 2.3.1 if you are using powerful cards (gfx1100) because rocBLAS tensile libraries are more well optimized than BLASLt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants