-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiGPU support #184
base: main
Are you sure you want to change the base?
MultiGPU support #184
Conversation
Thank you for this, but I'm still hesitant to merge this as I can't test it myself, and I don't know what happens for non-cuda users when trying to populate device selection like this? Also it should not be a required input as that will force everyone to remake the node in old workflows. I would rather have it as either separate node, or maybe best would be an optional device selection input to the nodes so it won't have any effect for everyone not using it, which is still the vast majority of the users. |
I know I do need to update this work with the video enhancer, and what was changed today (Haven't looked yet.) right now for 1 gpu it just only lists 1 device, not a big deal. |
Yeah exactly what I was thinking, it's the most non-invasive way to add it I can think of. Whatever way the given node currently chooses the device shouldn't change, and when the optional input is given it just would override it. |
alright, will do, have it ready in a day or so. |
I downloaded you repo, erased kijai's, opened your civit workflow and queue is giving an error that the cuda: doesn't belong in the group of the input. If I fix or Fix v2 or reload returns to kijai's normal nodes. I have left the custom nodes git to its original name. I have a 4090 and a 3060 and can't test it with your civit workflow for the above reasons. |
@MrReclusive There is in manager a multi gpu nodes by pollockjj . Should I use this? The nodes normally detect my gpus but I have to re-create your workflows. Maybe put in your repo a notice readme with bold "FOR TESTING ONLY" and post safe steps on to test it. EDIT: OK, I found it, I downloaded the zip and didn;t notice the missing files! |
I have a 4090 and quadro M6000 I can test with when the next update with the separate optional node is ready. If it's helpful let me know. |
I have a dual 3090 setup and I am getting this:
Will this only work on newer Ada arch cards? Thank you |
Hey everyone, I have the new setup coded for optional so it doesn't interfere with existing setup, just going through other nodes I don't use to see if they benefit from it to, and testing what can and can't be moved (expecting same device errors.) I can't test anything besides the 4xxxx series of cards, as I gave away all my 3xxx series cards last year, but nothing in what I've changed in code should effect the ability to use other cards if they already work in this video wrapper. and for the polluckjj multigpu, I looked at it, its based off neuratech-ai's multigpu code, which is what I based this on as well. Ill have the new fork and new template up tomorrow, just need to remove all my other custom nodes from the template I use. (i've been digging way into custom prompting on this.) |
I am also working on a image splitter that accounts for the required frames when splitting for the purpose of rendering initial video with high frame count at low resolution, then splitting it into 2/3/4 separate batches so you can upscale, if anyone is interested in that ill be posting that soon. |
Updated with new sampler stuff.
@MrReclusive |
I give a try with o1 and o1-pro-mode in PRO plan to combine inference in a sampler (both GPUs working "paraller" NOT sequenceal). I have no idea on what I am doing, o1 explains it to me and suggests. From what I understood, it "breaks the latents"? I just do the debugging. Its a living hell just as o1 told me it would be. |
my last change was try to make this compatible again, i injected what they changed on main build, that didn't work, it didn't like that. it is completely stable, and im running on it now, it just reports conflicts with the main branch here preventing pulling. |
slightly confused. are you trying to run 2 gpu's on 1 sampler? without proper nvlink or the model being coded specifically for it, don't think that will ever work, even if it did, it would be a huge performance hit constantly transferring data across pcie lanes to allow them to work together, and probably to much of a hit to make it viable. |
To clarify, |
What you are describing, is exactly what o1 pro mode warned me about. But is was just a thought of mine that now we ll have 30 January, 2 5090s in a setup, connected the same PCIE way both, maybe we could split the latents (into "layers" o1 names them) and have Inference WITHOUT NVLink. Back and forth data.
You should make branches in pull repo, so that we know, what branch work, what is experimental, copy paste by us as needed and we are done. |
oddly enough I'm wanting to go the other way, pick up a few refurbished 3090's since they had nvlink which allows memory pooling, so if python and torch can see it right, 2 would give you a 48gb card, 4 would give me a 96gb card, and considering the 5090's im sure are going to extremely expensive, i think for awhile I'd rather buy a few 3090's at $1100 each, considering a single 5090 is going to probably be close to $2500 |
why go so big. For resolution? You can spend 300 euros on Topaz and produce recommended res. You don;t need to make like 8k out of comfy. |
but to continue more on the idea of doing this without nvlink and pooling cards on a single sampler. Internally the 4090 runs 1000GB/s to do what it does, pcie gen 5's total bandwidth over x16 is 128GB/s and considering it would need to constantly transfer data, that's only 64GB/s each way, and on top of that, 99% of consumer mother boards and/or processors are not capable running 2 X16 slots at full speed, as soon as you add a second card the most you will get is X8, so your down to 32GB/s, and at this point your probably maxing out your cpu and using 128gb of ram. The only way this becomes viable is on server hardware or thread ripper, but if you can afford all that, buy a couple RTX A6000's, 48gb card with nvlink, so with 2 you have 96gb and its a lot cheaper then buying a h100, or even go a4500, 20gb card with nvlink. |
I have topaz, up-scaling is crap compared to doing this natively, the problem is balancing good starting resolution and length, yeah I can get an amazing 2 second clip that can be up scaled as much as I want, take a look at civitai, people are only doing this in vertical because getting good scaling in horizontal is problematic, yeah, I can run vfi and get it to 4 seconds but now its all in slow motion, the need is from me not doing this as just something to play with, or make 2 to 4 second "fun" clips, im trying to actually use this for something more. and for the price of the 5090, let me know if you can even find a 4090 at msrp ;) it never actually made it to msrp of $1599, the 2 i have I paid $1850 for nearly 2 years ago. |
Question about your pull; Does your sampler have multi-GPU (DP/DDP) support for proccessing too? You got me into thoughts with your 3090 NVlink approach you described. Just for info: I am chatting the same tine with GTP-o1 to explain to me. Personally I have no idea about these stuff :P |
I am not aware of any sampler that supports that, and no this one doesn't either. |
This just adds the device selection to multiple nodes for those with multi gpu's so we can set nodes to run on specific cuda device(s).