Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Found no NVIDIA driver on your system #92

Closed
TobTobXX opened this issue Apr 8, 2024 · 12 comments
Closed

Found no NVIDIA driver on your system #92

TobTobXX opened this issue Apr 8, 2024 · 12 comments

Comments

@TobTobXX
Copy link

TobTobXX commented Apr 8, 2024

I'm trying to run this on a Linux server with a RTX3060 12G.

The server runs on NixOS and has the NVIDIA driver configured:

# ...
	nixpkgs.config = {
		allowUnfree = true;
		cudaSupport = true;
	};
	services.xserver.videoDrivers = [ "nvidia" ];
	hardware.nvidia = {
		nvidiaSettings = false;

		# Optionally, you may need to select the appropriate driver version for your specific GPU.
		package = config.boot.kernelPackages.nvidiaPackages.beta;
	};
# ...

And it seems to work:

[root@server:~]# nvidia-smi
Mon Apr  8 20:36:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:06:00.0 Off |                  N/A |
| 34%   41C    P0              34W / 170W |      1MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, when I run InvokeAI, it always chooses CPU. And if I explicitely configure cuda or cuda:1 (what's the difference?) I get this error:

...
  File "/nix/store/f3iw0nk6bcx51mzzz6bqw6r0hvvfxyb7-python3.11-torch-2.0.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

What should I do?

@MatthewCroughan
Copy link
Member

@TobTobXX Use the NixOS module in the Flake and report back. Consider donating via GitHub sponsors if you want documentation, it's one of the goals.

@TobTobXX
Copy link
Author

TobTobXX commented Apr 8, 2024

Nah still doesn't work:

{ pkgs, ... }:

{
	imports = [
		(builtins.getFlake "github:nixified-ai/flake").nixosModules.invokeai-nvidia
	];
	nixpkgs.config = {
		allowUnfree = true;
		cudaSupport = true;
	};
	nix.settings.trusted-substituters = ["https://ai.cachix.org"];
	nix.settings.trusted-public-keys = ["ai.cachix.org-1:N9dzRK+alWwoKXQlnn0H6aUx0lU/mspIoz8hMvGvbbc="];
	services.invokeai = {
		enable = true;
		settings = {
			host = "[::]";
			port = 9090;
		};
	};
}

I'll try to investigate further, but if you have any pointers, I'd be glad.

While I would like to contribute, I'm not in a situation to do so financially. However, I could very well work on expanding the documentation for you.

@MatthewCroughan
Copy link
Member

@TobTobXX If you're doing a lot of nixos-rebuild switches, make sure to reboot the system when messing with kernel modules. I'm not 100% sure, but it could also be that your driver is too new for interacting with this codebase. This is where a VM with GPU passthrough could resolve the impurity and incompatibility, something I'd also like to provide as part of the flake. I see you're using Cuda 12, but I built this codebase with Cuda 11.

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

Ok, so I did some more tests and I think the problem is most likely the mismatch between the driver's CUDA version and torch's CUDA version.

Torch appears to be compiled with CUDA 11.8, as you hinted:

[root@server:~]# nix develop github:nixified-ai/flake#invokeai-nvidia

[root@server:~]# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: NixOS 23.11 (Tapir) (x86_64)
GCC version: (GCC) 12.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.38

Python version: 3.11.6 (main, Oct  2 2023, 13:45:54) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.1.84-x86_64-with-glibc2.38
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 470.223.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

...

However my driver has the CUDA version 12.3, as seen above. I tried downgrading the driver to the version 470 (and rebooting, of course), but then I have the CUDA version 11.4, which yields the same error. (Which driver version do you use?)

Is there a way to upgrade torch instead?

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

Apparently you really can't run pytorch with mismatching CUDA versions, even if the driver's one is higher: https://stackoverflow.com/a/76726156

@MatthewCroughan
Copy link
Member

That's really great to have found out, thank you for the research.

Perhaps we can set this up in the nixosModule, to:

  • force the CUDA driver to match the one our torch is using
  • overlay such that torch is recompiled based on your system's CUDA

Providing a GPU pass through VM script or module is also possible, but then you have to run a VM.

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

(I'm new to nix, so correct me on anything I get wrong)

Option A (changing the CUDA driver):

Option B (changing torch):

  • Pro: Better cooperation with the rest of the system
  • Con: Pytorch takes aaaageees to compile (that's why I've landed here)

Aside from the build time, Option B appears to be the better option?

InvokeAI runs with pytorch==2.0.1 (see log above). Is that specified anywhere? I tried searching this repo and the InvokeAI repo, but didn't find any version information. The latest version would be 2.2.2.

pytorch 2.0.1 only has compatibility with CUDA 11.7 and 11.8 ref
pytorch 2.2.2 has compatibility with CUDA 11.8 and 12.1 ref

@MatthewCroughan
Copy link
Member

@TobTobXX A third option is to fix the backwards compatibility in PyTorch if you have the C++/Python skills to do so.

https://docs.nvidia.com/deploy/cuda-compatibility/index.html

Yes, the torch version is specified in Nixpkgs.

user: matthew 🌐 swordfish in ~ took 37s 
❯ nix repl -L
Welcome to Nix 2.20.5. Type :? for help.

nix-repl> :lf github:nixified-ai/flake
Added 24 variables.

nix-repl> inputs.nixpkgs.legacyPackages.x86_64-linux.python3Packages.torch.version
"2.0.1"

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

Ooff, you mean backporting pytorch? No, I don't think I'm able to do that.

However, there is yet another option... waiting.

NixOS 24.05 isn't too far off and in this channel pytorch should be 2.2.1 (currently in unstable).

The 23.11 channel is weird anyway, because the NVIDIA driver and pytorch are essentially incompatible. I think I'll drop a question about that to the cuda maintainers.

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

By the way, which driver do you use? Why doesn't this occur for your GPU?

@SomeoneSerge
Copy link

driver to match the one our torch is using

Pytorch doesn't (directly) link to the driver, instead it uses the impure runpath (addDriverRunpath in the nixpkgs manual).

"Found no NVIDIA driver on your system"

Must mean libcuda.so simply wasn't found (in /run/opengl-driver/lib or through LD_LIBRARY_PATH). There's also a slight chance that the message is wrong and libcuda.so was found but didn't match the kernel module of the currently running system.

Start by verifying if /run/opengl-driver/lib/libcuda.so exists (e.g. I don't see hardware.opengl.enable in your snippet, so maybe it doesn't). Test if simpler things like nvidia-smi and nix run -f '<nixpkgs>' --config '{ allowUnfree = true; }' cudaPackages.saxpy work. If errors persist, run the offending commands with the LD_DEBUG=libs environment variable set and publish the logs

@TobTobXX
Copy link
Author

TobTobXX commented Apr 9, 2024

Thanks a lot for dropping in!

(e.g. I don't see hardware.opengl.enable in your snippet, so maybe it doesn't)

... sigh...
Do you know these times when you would like to smack your past self rather hard?

I'm terribly sorry for wasting all of your time. Thank you a lot.
The generation time just went down from 550s to 13s. You rock!

@TobTobXX TobTobXX closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants