MultiGPU support #184

MrReclusive · 2024-12-19T19:04:19Z

This just adds the device selection to multiple nodes for those with multi gpu's so we can set nodes to run on specific cuda device(s).

kijai · 2024-12-23T12:19:19Z

Thank you for this, but I'm still hesitant to merge this as I can't test it myself, and I don't know what happens for non-cuda users when trying to populate device selection like this? Also it should not be a required input as that will force everyone to remake the node in old workflows. I would rather have it as either separate node, or maybe best would be an optional device selection input to the nodes so it won't have any effect for everyone not using it, which is still the vast majority of the users.

MrReclusive · 2024-12-24T00:09:39Z

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's

right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

kijai · 2024-12-24T00:12:48Z

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's

right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

Yeah exactly what I was thinking, it's the most non-invasive way to add it I can think of. Whatever way the given node currently chooses the device shouldn't change, and when the optional input is given it just would override it.

MrReclusive · 2024-12-24T00:19:55Z

I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.)
I've been thinking about the issue with non cuda users myself, and those who don't have multiple gpu's
right now for 1 gpu it just only lists 1 device, not a big deal.
but I have been thinking of another way to do this so it doesn't effect anyone not using multiple gpu's, or those not running cuda.
I was leaning towards the optional, and then just adding a stand alone gpu selection node that would just connect to an optional input.
and this just run an if around this device = mm.get_torch_device() so it runs that if no input is provided.
Does that sound okay?

Yeah exactly what I was thinking, it's the most non-invasive way to add it I can think of. Whatever way the given node currently chooses the device shouldn't change, and when the optional input is given it just would override it.

alright, will do, have it ready in a day or so.
Thanks!

zazoum-art · 2024-12-24T06:54:05Z

I downloaded you repo, erased kijai's, opened your civit workflow and queue is giving an error that the cuda: doesn't belong in the group of the input. If I fix or Fix v2 or reload returns to kijai's normal nodes. I have left the custom nodes git to its original name.

I have a 4090 and a 3060 and can't test it with your civit workflow for the above reasons.

zazoum-art · 2024-12-24T07:10:04Z

@MrReclusive There is in manager a multi gpu nodes by pollockjj . Should I use this? The nodes normally detect my gpus but I have to re-create your workflows. Maybe put in your repo a notice readme with bold "FOR TESTING ONLY" and post safe steps on to test it.

EDIT: OK, I found it, I downloaded the zip and didn;t notice the missing files!
Gonna test the HECK out of it!!!
This is very important, especially before 5090 and i2v by Tencent,

kendrick90 · 2024-12-27T11:10:23Z

I have a 4090 and quadro M6000 I can test with when the next update with the separate optional node is ready. If it's helpful let me know.

Subarasheese · 2024-12-28T19:44:14Z

I have a dual 3090 setup and I am getting this:

Exception Message: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Will this only work on newer Ada arch cards? Thank you

MrReclusive · 2024-12-31T04:04:37Z

Hey everyone, I have the new setup coded for optional so it doesn't interfere with existing setup, just going through other nodes I don't use to see if they benefit from it to, and testing what can and can't be moved (expecting same device errors.)
I have been kind of taking some time off for the holidays so haven't been at my computer much.

I can't test anything besides the 4xxxx series of cards, as I gave away all my 3xxx series cards last year, but nothing in what I've changed in code should effect the ability to use other cards if they already work in this video wrapper.
all its doing is telling it what device to use.

and for the polluckjj multigpu, I looked at it, its based off neuratech-ai's multigpu code, which is what I based this on as well.

Ill have the new fork and new template up tomorrow, just need to remove all my other custom nodes from the template I use. (i've been digging way into custom prompting on this.)

MrReclusive · 2025-01-01T04:40:40Z

Fully up to date.
All device selection is optional with defaults so ignored if cuda device selector isn't connected.
text encoder example.

Full example that ill be uploading to civitai.

MrReclusive · 2025-01-01T05:23:14Z

I am also working on a image splitter that accounts for the required frames when splitting for the purpose of rendering initial video with high frame count at low resolution, then splitting it into 2/3/4 separate batches so you can upscale, if anyone is interested in that ill be posting that soon.
I thought about doing it at the latent level, but that first latent confuses me, this 4*X + 1 is becoming the bane of my existence.

Updated with new sampler stuff.

zazoum-art · 2025-01-06T12:54:36Z

@MrReclusive
Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

zazoum-art · 2025-01-06T12:59:57Z

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be.
I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ?
I take as a constant my specific set-up.

MrReclusive · 2025-01-07T08:45:41Z

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling.
I will need to reset my fork again, and put all my code back in.

MrReclusive · 2025-01-07T08:50:25Z

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable.
this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

MrReclusive · 2025-01-07T09:01:10Z

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

To clarify,
it added in the new dpm scheduler to the sampler (Amazing by the way, so much clearer, much better contrast)
and the tea cache (I haven't had much luck with that yet, yeah it can make the flow match schedular faster but at a cost to quality, and doesn't work with dpm, at least it didn't last night, looks like they made more changes today.)

zazoum-art · 2025-01-07T17:11:57Z

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data.
Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

MrReclusive · 2025-01-07T17:40:02Z

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data. Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

zazoum-art · 2025-01-07T17:58:22Z

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

why go so big. For resolution? You can spend 300 euros on Topaz and produce recommended res. You don;t need to make like 8k out of comfy.
I don't fully understand.

MrReclusive · 2025-01-07T18:04:46Z

I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. I posted to it the WHOLE kijai's wrapper, one file by one. You can only upload images to o1. So, yes, I did that. Is it OK LICENSE-wise @kijai ? I take as a constant my specific set-up.

slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. this setup only allows putting different parts on different gpu's, so encoding/decoding on a seperate gpu then the sampler and model, mainly to prevent constant unloading/loading of models when you have multigpu setup.

What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data. Just an amateurish thought.

@MrReclusive Your civitai workflow works like a charm with 4090 + 3060 (clip+LLM fit in 3060). Can you explain this latest change [81d87dc]? Is it stable?

my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. I will need to reset my fork again, and put all my code back in.

You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done.

but to continue more on the idea of doing this without nvlink and pooling cards on a single sampler. Internally the 4090 runs 1000GB/s to do what it does, pcie gen 5's total bandwidth over x16 is 128GB/s and considering it would need to constantly transfer data, that's only 64GB/s each way, and on top of that, 99% of consumer mother boards and/or processors are not capable running 2 X16 slots at full speed, as soon as you add a second card the most you will get is X8, so your down to 32GB/s, and at this point your probably maxing out your cpu and using 128gb of ram.

The only way this becomes viable is on server hardware or thread ripper, but if you can afford all that, buy a couple RTX A6000's, 48gb card with nvlink, so with 2 you have 96gb and its a lot cheaper then buying a h100, or even go a4500, 20gb card with nvlink.

MrReclusive · 2025-01-07T18:17:47Z

oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500

why go so big. For resolution? You can spend 300 euros on Topaz and produce recommended res. You don;t need to make like 8k out of comfy. I don't fully understand.

I have topaz, up-scaling is crap compared to doing this natively, the problem is balancing good starting resolution and length, yeah I can get an amazing 2 second clip that can be up scaled as much as I want, take a look at civitai, people are only doing this in vertical because getting good scaling in horizontal is problematic, yeah, I can run vfi and get it to 4 seconds but now its all in slow motion, the need is from me not doing this as just something to play with, or make 2 to 4 second "fun" clips, im trying to actually use this for something more.
right now i am running 4 to 5 second clips at 768by320 (2.39:1), without vfi, and its the limit of the 4090 really, i can upscale and it looks okay if the faces are close to the camera, but if there not, the faces are just distorted, and that's how they come out, hard to upscale garbage, just like until flux, people far from the camera always looked like garbage.

and for the price of the 5090, let me know if you can even find a 4090 at msrp ;) it never actually made it to msrp of $1599, the 2 i have I paid $1850 for nearly 2 years ago.

zazoum-art · 2025-01-07T18:40:08Z

Question about your pull; Does your sampler have multi-GPU (DP/DDP) support for proccessing too? You got me into thoughts with your 3090 NVlink approach you described.

Just for info: I am chatting the same tine with GTP-o1 to explain to me. Personally I have no idea about these stuff :P

MrReclusive · 2025-01-07T18:46:16Z

Question about your pull; Does your sampler have multi-GPU (DP/DDP) support for proccessing too? You got me into thoughts with your 3090 NVlink approach you described.

Just for info: I am chatting the same tine with GTP-o1 to explain to me. Personally I have no idea about these stuff :P

I am not aware of any sampler that supports that, and no this one doesn't either.

MrReclusive closed this Jan 1, 2025

MrReclusive force-pushed the main branch from 6196fac to 46e31f1 Compare January 1, 2025 04:33

Added new cuda selector with optional

05a87da

MrReclusive reopened this Jan 1, 2025

Updated with new sampler stuff

81d87dc

Updated with new sampler stuff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU support #184

MultiGPU support #184

MrReclusive commented Dec 19, 2024

kijai commented Dec 23, 2024

MrReclusive commented Dec 24, 2024

kijai commented Dec 24, 2024

MrReclusive commented Dec 24, 2024

zazoum-art commented Dec 24, 2024

zazoum-art commented Dec 24, 2024 •

edited

Loading

kendrick90 commented Dec 27, 2024

Subarasheese commented Dec 28, 2024 •

edited

Loading

MrReclusive commented Dec 31, 2024 •

edited

Loading

MrReclusive commented Jan 1, 2025

MrReclusive commented Jan 1, 2025

zazoum-art commented Jan 6, 2025

zazoum-art commented Jan 6, 2025 •

edited

Loading

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025 •

edited

Loading

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025 •

edited

Loading

MrReclusive commented Jan 7, 2025

MultiGPU support #184

Are you sure you want to change the base?

MultiGPU support #184

Conversation

MrReclusive commented Dec 19, 2024

kijai commented Dec 23, 2024

MrReclusive commented Dec 24, 2024

kijai commented Dec 24, 2024

MrReclusive commented Dec 24, 2024

zazoum-art commented Dec 24, 2024

zazoum-art commented Dec 24, 2024 • edited Loading

kendrick90 commented Dec 27, 2024

Subarasheese commented Dec 28, 2024 • edited Loading

MrReclusive commented Dec 31, 2024 • edited Loading

MrReclusive commented Jan 1, 2025

MrReclusive commented Jan 1, 2025

zazoum-art commented Jan 6, 2025

zazoum-art commented Jan 6, 2025 • edited Loading

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025 • edited Loading

MrReclusive commented Jan 7, 2025

MrReclusive commented Jan 7, 2025

zazoum-art commented Jan 7, 2025 • edited Loading

MrReclusive commented Jan 7, 2025

zazoum-art commented Dec 24, 2024 •

edited

Loading

Subarasheese commented Dec 28, 2024 •

edited

Loading

MrReclusive commented Dec 31, 2024 •

edited

Loading

zazoum-art commented Jan 6, 2025 •

edited

Loading

zazoum-art commented Jan 7, 2025 •

edited

Loading

zazoum-art commented Jan 7, 2025 •

edited

Loading