-
Notifications
You must be signed in to change notification settings - Fork 74.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDNN, cuFFT, and cuBLAS Errors #62075
Comments
Hi @Ke293-x2Ek-Qe-7-aE-B , Starting from TF2.14 tensorflow provides CUDA package which can install all the cuDNN,cuFFT and cubLas libraries. You can use Please try this command let us know if it helps. Thankyou! |
@SuryanarayanaY I did not know that it now came bundled with cuDNN. I installed tensorflow with the [and-cuda] part, though, but I also installed cuda toolkit and cuDNN separately. I will try just installing the cuda toolkit and then installing tensorflow[and-cuda]. |
@SuryanarayanaY I tried several times, reinstalling Ubuntu, but it still doesn't work. |
I also have the same issue, and this seems not to be due to cuda environment as I rebulid cuda and cudnn to make them suit for tf-2.14.0. This is log out I find:
|
@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU. |
Yeah. But I just found that when I downgrade to 2.13.0 version, errors in register won't appear again. It looks like this:
Although I haven't figured out how to solve NUMA node error, I found some clues from another issue (as I operated all above in WSL Ubuntu). This bug seems not to be significant as explaination from NVIDIA forums . So I guess errors in register might have something with the latest version and errors in NUMA might be caused by OS enviroment. Hope this information would help some guys. |
@AthiemoneZero I tried downgrading as well, but it didn't work for me. The NUMA errors are (as stated in the error message) because the kernel provided by Microsoft for WSL2 is not built with NUMA support. I tried cloning the repo (here) and building from source my own with NUMA support, but that didn't work, so I am just ignoring those errors for now. |
@Ke293-x2Ek-Qe-7-aE-B I rebuilt all in an independent conda environment as TF. My steps were to create a TF env with |
@AthiemoneZero Thanks for the instructions. I'll try and see if it works on my system. I have been using |
@Ke293-x2Ek-Qe-7-aE-B I didnt execute |
But I did double check version of cuda and cudnn. For this I even downgrade them again and again. |
@AthiemoneZero Usually, I would install the CUDA toolkit according to these instructions (here), then install cuDNN according to these instructions (here). I installed CUDA toolkit version 11.8 and cuDNN version 8.7, because they are the latest supported by TensorFlow, according to their support table here. I guess using [and-cuda] installs all of that for you. |
@Ke293-x2Ek-Qe-7-aE-B Apologize for my misunderstanding. I did the same in installing cuda toolkit as what you described above before I went directly to debug tf_gpu. I made sure my gpu and cuda could perform well as I have tried another task smoothly using cuda but without tf. What I concerned is some dependencies of tf have to be pre-installed in a conda env and this might be treated by [and-cuda] (my naive guess |
@AthiemoneZero I always install CUDA toolkit and cuDNN globally for the whole system, and then install TensorFlow in a miniconda environment. This doesn't work anymore with the newest versions of TensorFlow, so I'll try your instructions. It does make sense to install everything in a conda env, I just hadn't thought of that since my other method had worked in the past. Thanks for sharing what you did to make it work. |
@Ke293-x2Ek-Qe-7-aE-B You're welcomed. BTW, I also followed the instruction to configure development including suitable version of bazel and clang-16, just before all my operation digging into conda env. |
@AthiemoneZero Thanks, but it didn't work. |
Hello, I'm experiencing the same issue, even though I meticulously followed all the instructions for setting up CUDA 11.8 and CuDNN 8.7. The error messages I'm encountering are as follows: Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered. I've tried this with different versions of Python. Surprisingly, when I used Python 3.11, TensorFlow 2.13 was installed without these errors. However, when I used Python 3.10 or 3.9, I ended up with TensorFlow 2.14 and the aforementioned errors. I've come across information suggesting that I may not need to manually install CUDA and CuDNN, as [and-cuda] should handle the installation of these components automatically. Could someone please guide me on the correct approach to resolve this issue? I've tried various methods, but unfortunately, none of them have yielded a working solution. P.S. I'm using conda in WSL 2 on Windows 11. |
I am having the same issue as FaisalAlj above, on Windows 10 with the same versions of CUDA and CuDNN. The package Edit 1: Edit 2: |
@belitskiy @ddunl Is there any update on the bug fix? |
I see the same warnings after installing tensorflow with
I wasted some hours to try figure this out but then later when I gave up I trained a keras model and saw this among the later prints:
So does it mean that it actually work and we get a confusing message at init? |
Yes that indicated it worked @YanivZeg. But this is not 100% depending on system config of other users. Till tf team fix this, I am afraid we are all collectively going to be wasting time. Sorry you wasted precious hours. Tensorflow team any response? |
I was running into this issue within the tensorflow image: tensorflow/tensorflow:2.18.0-gpu-jupyter |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Please help , I am using wsl ubuntu, I installed cuda 12 and cudnn 12.6 , I created virtual environment in home directory, everything works perfect but 3 warnings pops up • cudnn factory unable to register Can anyone help |
Sadly the only version that doesn't show those warnings is |
Could you please explain, what actually cuda factory registration actually is ? |
I have no idea what those warnings mean, there are other reported issues on this GitHub repo that explain it, but I'm lazy to read that. But I do know that even if those warnings appear, the GPU is recognized and you can train models with it. I don't know if other functionalities are the ones that are affected by those or if they mean performance issues. |
Ran the classification problem (garment ID) in parallel on an internal machine installation (tensorflow 2.18) and collab (tensorflow 2.17). Received warnings in 2.18 but not in 2.17. Otherwise, the results are essentially the same. The prediction results in the internal calculations with version 2.18 are, in many cases, significantly better. |
I would appreciate this being fixed :( |
What version are you using ? |
After testing, I concluded that the warnings were just warnings. I have ignored them with a performance on par or better than collab. Suboptimal use of GPU RAM is a potential problem. But so far, 12 GB GPU RAM has provided plenty of headroom. |
The message disappeared in tf-nightly 2.20.0.dev20250305. It can be installed by the following command: |
To install together with PyTorch, use the command |
tensorflow 2.19 installed with poetry as
still reports the same error
but GPU is used and the code works |
Install this : pip install tf-nightly This works absolutely fine with no error flags ! |
Thanks @lakshyaverma2414, I will be waiting for 2.20 release version. I need a stable version. |
registration logs from INFO to VLOG(1), fully silencing them during normal usage. Upstream already reduced these from ERROR to INFO, but they still create unnecessary log noise when XLA and GPU backends initialize. Since the duplicate registration is safe and expected this change preserves visibility only for debugging sessions. Co-inspired by ChatGPT during a deep dive into TensorFLow's logging system. Fixes: tensorflow#62075
tensorflow==2.15.0 ===>>> cuda-12.* (cuda-12.1, cuda-12.2) |
Hi, I'm installing I have a simple snippet which I run inside an image to test tensorflow works properly:
If I use version version
Can anyone tell me if this is now expected or how to fix this? |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
GIT_VERSION:v2.14.0-rc1-21-g4dacf3f368e VERSION:2.14.0
Custom code
No
OS platform and distribution
WSL2 Linux Ubuntu 22
Mobile device
No response
Python version
3.10, but I can try different versions
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
CUDA version: 11.8, cuDNN version: 8.7
GPU model and memory
NVIDIA Geforce GTX 1660 Ti, 8GB Memory
Current behavior?
When I run the GPU test from the TensorFlow install instructions, I get several errors and warnings.
I don't care about the NUMA stuff, but the first 3 errors are that TensorFlow was not able to load cuDNN. I would really like to be able to use it to speed up training some RNNs and FFNNs. I do get my GPU in the list of physical devices, so I can still train, but not as fast as with cuDNN.
Standalone code to reproduce the issue
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Relevant log output
The text was updated successfully, but these errors were encountered: