-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RX 470/480/570/580/590 support #13
Comments
Come to think if it, maybe adding the flag to the rocm-runtime will help... |
@414owen for difficult GPUs, there is an imperative step required prior to However, #7 bypasses this entirely on NixOS. It would be good if you could test #7 to see if it solves your problem. I don't have an AMD GPU, and haven't had the time to research a fundamental fix for this situation by modifying/patching the ROCM libraries, it's certainly possible though. |
Thanks @MatthewCroughan, I'll be able to test this out properly tomorrow afternoon :) |
It looks like NixOS already has the After following #2 (comment), I'm getting segfaults after loading a model with Tail of
I'll see if I can dive a bit deeper tomorrow. |
According to this:
The only references to I suppose there's a reason we're using those prebuilt packages instead of nixpkgs' |
Okay, few updates: Setting:
in the Maybe it would be worth adding an Back to AMD: The I've checked out
Because apparently After that, I have to figure out why
And I'm not entirely sure where the other torch is coming from... Maybe it's because |
I'm struggling to use After switching, I get:
Does anyone know why? If I do |
@414owen Nix implements isolation by patching all the references in programs to explicit |
I was thinking of providing an It's really a big problem upstream, and I'm aware of all these edge cases, I appreciate you looking into it, now you know about the dirty inner workings just like we do. |
Oh no I wasn't expecting the In my working branch, I've removed the That's what I want to get working, because it's a small step from there to enabling |
I'm pretty sure I've tried that, and it segfaults. My expectation was that these issues would eventually be fixed upstream and that we could use a later Nixpkgs and everything would just work. |
If you've tested some difference in torchvision and it works correctly, please tell me what revision of nixpkgs you're claiming works, because I can potentially test it then. nixified.ai is currently based on
|
As mentioned, I don't own an AMD GPU to solve this issue on, have you got a Discord/Matrix you can reach me on so I can try solutions with you? |
I'm happy to have a call in a bit (not free for the next hour or so), but the place I'm stuck at isn't really amd-specific. If you can figure out why this diff breaks with dependency errors, then I think getting these amd cards working will be a tractable problem. I've made the change to the nvidia version for you :) I'll send you an email about a possible call, too. |
I've done some experimenting and I've found in which file PyTorch hardcodes the |
@deftdawg The point of this repository is to avoid using Podman/Docker or container technologies or running unreproducible steps like the ones you just posted. When something is fixed in this repository, it is fixed for everyone at the same time, and when something is broken, it is broken for everyone. In your example step 5 is not going to work for everyone depending on the time of day that they run it at, and step 4 is going to produce different results every single time you perform the task. Step 2 depends on the host kernel version. And suggesting people pull a 38GB Ubuntu container is the antithesis of what this repository and project is about. Please don't give unreproducible instructions to people. |
@MatthewCroughan deleted per your wishes. All the best in your efforts to fix this for AMD, as for me, I need something that works today, not something that may work in a perfectly reproducible way at some point in the future. To each his own. |
@deftdawg As I said, when it works, it will continue to work forever and will not stop working for spooky reasons, that's what Nix does. If you want something that works today, please contribute Nix code. I don't own an AMD GPU, and per reports Nixified.Ai does work on the majority of AMD GPUs, just not your specific one at the moment. Let this fire put itself out, and either wait or figure out the cause. I'm pretty sure this is going to magically fix itself when we upgrade Nixpkgs at some point. |
I have the same issue with a segfault / core dump at the same point without any other error messages. Although for some reason, the first time running it after a reboot, there's this line as well bfore the core dump:
Indeed "fixes" that, but as @414owen said, it's CPU-only. On my Ryzen 3600X my first test image took 643 seconds, so that's not really viable for me though :D
There seems to be new versions of e.g. ROCm since the versions in the flake. I tried to update the inputs myself, but that seems to generate some problems with finding the right nixpkgs revision with the right versions of pytorch... Don't have time right now to get deeper into it, but might try again some time. |
I feel like I'm soo close (but I'm probably still soo far away...). I got it to compile with nixpkgs version of Rocm+torch (torch took forever though), and when I enter a develop shell, and ask
I found this issue, which also talks about naive_conv.cpp, but they actually get a useful compilation error message: ROCm/ROCm#1889. |
@yboettcher nice work, thanks! I'll definitely have to play around with this on my 580 In the meantime: Seems like the compiler needs a flag set for older gpus: https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/hip_rtc.html#hiprtc-specific-options Do you think that flag could be added via |
Just grepping through the pytorch sources (for nvrtcCompile), it does not look like the options passed to the runtime compiler can be changed. Then again, I'm not familiar with how torch handles all of this. |
I made a small test program to see how to get hiprtc to compile stuff for my rx480 and it indeed worked with "--gpu-architecture=gfx803", as you suggested. Since I see no way of injecting this argument in an already built version of pytorch or by using CMake options, I now made a patch that just adds that argument every time nvrtcCompileProgram is called. |
I'm still getting the same error, it's just a bit more info than before (so there is some effect)
I'll try to dig a bit more later. Edit: Edit2:
Looks like I did manage to convince it to build for gfx803, but it fails due to another reason. In fact this appears to be an issue with how MIOpen/Rocm is packaged (same error as in the issue I linked in #13 (comment)). |
👋 Hi. I'm a nixOS user with a radeon RX 590.
I get the error:
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
.This GPU isn't exactly top-of-the-line, but people have managed to run stable diffusion et al on it.
The process, documented here, and a bit here, seems to involve building ROCm with the ROC_ENABLE_PRE_VEGA flag.
Then again, according to this issue other OSes have patched the
ROCm
packages, so maybe this is an issue fornixpkgs
.Any tips/insights welcome. Does anyone else have this issue?
The text was updated successfully, but these errors were encountered: