Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiGPU support #184

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

MultiGPU support #184

wants to merge 2 commits into from

Conversation

MrReclusive
Copy link

This just adds the device selection to multiple nodes for those with multi gpu's so we can set nodes to run on specific cuda device(s).

@kijai
Copy link
Owner

kijai commented Dec 23, 2024

Thank you for this, but I'm still hesitant to merge this as I can't test it myself, and I don't know what happens for non-cuda users when trying to populate device selection like this? Also it should not be a required input as that will force everyone to remake the node in old workflows. I would rather have it as either separate node, or maybe best would be an optional device selection input to the nodes so it won't have any effect for everyone not using it, which is still the vast majority of the users.

@MrReclusive
Copy link
Author

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's

right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

@kijai
Copy link
Owner

kijai commented Dec 24, 2024

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's

right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

Yeah exactly what I was thinking, it's the most non-invasive way to add it I can think of. Whatever way the given node currently chooses the device shouldn't change, and when the optional input is given it just would override it.

@MrReclusive
Copy link
Author

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's
right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

Yeah exactly what I was thinking, it's the most non-invasive way to add it I can think of. Whatever way the given node currently chooses the device shouldn't change, and when the optional input is given it just would override it.

alright, will do, have it ready in a day or so.
Thanks!

@zazoum-art
Copy link

I downloaded you repo, erased kijai's, opened your civit workflow and queue is giving an error that the cuda: doesn't belong in the group of the input. If I fix or Fix v2 or reload returns to kijai's normal nodes. I have left the custom nodes git to its original name.

I have a 4090 and a 3060 and can't test it with your civit workflow for the above reasons.

@zazoum-art
Copy link

zazoum-art commented Dec 24, 2024

@MrReclusive There is in manager a multi gpu nodes by pollockjj . Should I use this? The nodes normally detect my gpus but I have to re-create your workflows. Maybe put in your repo a notice readme with bold "FOR TESTING ONLY" and post safe steps on to test it.

EDIT: OK, I found it, I downloaded the zip and didn;t notice the missing files!
Gonna test the HECK out of it!!!
This is very important, especially before 5090 and i2v by Tencent,

@kendrick90
Copy link

I have a 4090 and quadro M6000 I can test with when the next update with the separate optional node is ready. If it's helpful let me know.

@Subarasheese
Copy link

Subarasheese commented Dec 28, 2024

I have a dual 3090 setup and I am getting this:

Exception Message: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Will this only work on newer Ada arch cards? Thank you

@MrReclusive
Copy link
Author

MrReclusive commented Dec 31, 2024

Hey everyone, I have the new setup coded for optional so it doesn't interfere with existing setup, just going through other nodes I don't use to see if they benefit from it to, and testing what can and can't be moved (expecting same device errors.)
I have been kind of taking some time off for the holidays so haven't been at my computer much.

I can't test anything besides the 4xxxx series of cards, as I gave away all my 3xxx series cards last year, but nothing in what I've changed in code should effect the ability to use other cards if they already work in this video wrapper.
all its doing is telling it what device to use.

and for the polluckjj multigpu, I looked at it, its based off neuratech-ai's multigpu code, which is what I based this on as well.

Ill have the new fork and new template up tomorrow, just need to remove all my other custom nodes from the template I use. (i've been digging way into custom prompting on this.)

image

@MrReclusive MrReclusive reopened this Jan 1, 2025
@MrReclusive
Copy link
Author

Fully up to date.
All device selection is optional with defaults so ignored if cuda device selector isn't connected.
text encoder example.
image
Full example that ill be uploading to civitai.
Screenshot 2024-12-31 232121

@MrReclusive
Copy link
Author

I am also working on a image splitter that accounts for the required frames when splitting for the purpose of rendering initial video with high frame count at low resolution, then splitting it into 2/3/4 separate batches so you can upscale, if anyone is interested in that ill be posting that soon.
I thought about doing it at the latent level, but that first latent confuses me, this 4*X + 1 is becoming the bane of my existence.

Updated with new sampler stuff.
@zazoum-art
Copy link

@MrReclusive
Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

@zazoum-art
Copy link

zazoum-art commented Jan 6, 2025

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be.
I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ?
I take as a constant my specific set-up.

@MrReclusive
Copy link
Author

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling.
I will need to reset my fork again, and put all my code back in.

@MrReclusive
Copy link
Author

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable.
this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

@MrReclusive
Copy link
Author

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

To clarify,
it added in the new dpm scheduler to the sampler (Amazing by the way, so much clearer, much better contrast)
and the tea cache (I haven't had much luck with that yet, yeah it can make the flow match schedular faster but at a cost to quality, and doesn't work with dpm, at least it didn't last night, looks like they made more changes today.)

@zazoum-art
Copy link

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data.
Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

@MrReclusive
Copy link
Author

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data. Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

@zazoum-art
Copy link

zazoum-art commented Jan 7, 2025

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

why go so big. For resolution? You can spend 300 euros on Topaz and produce recommended res. You don;t need to make like 8k out of comfy.
I don't fully understand.

GgqKhb9XQAAIZv5

@MrReclusive
Copy link
Author

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data. Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

but to continue more on the idea of doing this without nvlink and pooling cards on a single sampler. Internally the 4090 runs 1000GB/s to do what it does, pcie gen 5's total bandwidth over x16 is 128GB/s and considering it would need to constantly transfer data, that's only 64GB/s each way, and on top of that, 99% of consumer mother boards and/or processors are not capable running 2 X16 slots at full speed, as soon as you add a second card the most you will get is X8, so your down to 32GB/s, and at this point your probably maxing out your cpu and using 128gb of ram.

The only way this becomes viable is on server hardware or thread ripper, but if you can afford all that, buy a couple RTX A6000's, 48gb card with nvlink, so with 2 you have 96gb and its a lot cheaper then buying a h100, or even go a4500, 20gb card with nvlink.

@MrReclusive
Copy link
Author

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

why go so big. For resolution? You can spend 300 euros on Topaz and produce recommended res. You don;t need to make like 8k out of comfy. I don't fully understand.

GgqKhb9XQAAIZv5

I have topaz, up-scaling is crap compared to doing this natively, the problem is balancing good starting resolution and length, yeah I can get an amazing 2 second clip that can be up scaled as much as I want, take a look at civitai, people are only doing this in vertical because getting good scaling in horizontal is problematic, yeah, I can run vfi and get it to 4 seconds but now its all in slow motion, the need is from me not doing this as just something to play with, or make 2 to 4 second "fun" clips, im trying to actually use this for something more.
right now i am running 4 to 5 second clips at 768by320 (2.39:1), without vfi, and its the limit of the 4090 really, i can upscale and it looks okay if the faces are close to the camera, but if there not, the faces are just distorted, and that's how they come out, hard to upscale garbage, just like until flux, people far from the camera always looked like garbage.

and for the price of the 5090, let me know if you can even find a 4090 at msrp ;) it never actually made it to msrp of $1599, the 2 i have I paid $1850 for nearly 2 years ago.

@zazoum-art
Copy link

zazoum-art commented Jan 7, 2025

Question about your pull; Does your sampler have multi-GPU (DP/DDP) support for proccessing too? You got me into thoughts with your 3090 NVlink approach you described.

Just for info: I am chatting the same tine with GTP-o1 to explain to me. Personally I have no idea about these stuff :P

@MrReclusive
Copy link
Author

Question about your pull; Does your sampler have multi-GPU (DP/DDP) support for proccessing too? You got me into thoughts with your 3090 NVlink approach you described.

Just for info: I am chatting the same tine with GTP-o1 to explain to me. Personally I have no idea about these stuff :P

I am not aware of any sampler that supports that, and no this one doesn't either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants