-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Improve AMD performance. #10302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AMD performance. #10302
Conversation
I honestly have no idea why this improves things but it does.
|
It's related to MIOpen: ROCm/TheRock#1542 I'm not 100% positive, but I think |
I honestly have no idea why this improves things but it does.
|
When using flash-attention (built from the ROCm repo, main_perf branch) this commit slows inference by a lot on my 7800XT:
Maybe this new behavior can be disabled in case flash-attention is used? |
|
This causes an OOM crash on my Linux 6700XT setup when VAE decoding starts. |
|
Please reverse this commit. We have done extensive tensing over a period of months, I can assure you that your assertion is incorrect. Disabling
Please see the many highly detailed graphs showing timing attached to the PR below, where we briefly toyed with the identical patch that you have just implemented. We have developed many solutions for dealing with slow VAE decodes, the latest being this custom extension that dynamically toggles off cudnn during VAE encode and decoding only when using an AMD gpu. https://github.com/sfinktah/ovum-cudnn-wrapper I am also the developer of a timing node, https://github.com/sfinktah/comfy-ovum which allows me to compare the performance of every element of a workflow with and without cudnn (cudnn off is red, cudnn on is green).
That example (which I plucked at random) shows how different nodes react either positively or negatively to having cudnn disabled. I can prepare detailed graphs showing timing and memory usage for any workflow you care to nominate, on any platform you care to nominate, if that proves necessary. A more useful AMD helper would optionally replicate the functionality of my aforementioned cudnn-wrapper within your python core code. |
|
RX7900 XT user here, just adding on to the pile that I'm also experiencing issues after updating to this commit. Gens that would normally take 30~ seconds to finish are taking 43~ seconds. (It was also causing OOM on vae decode and went to tiled) Reverting the file completely to what it was 3 weeks ago brought back expected performance. |
|
I think it heavily depends on which wheels you are using and which OS you're on. Like if you're on Linux, you might be using TheRock or the official PyTorch ones. On Windows, you have a lot of people using different methods like ZLUDA, the older Scott wheels or TheRock wheels, or people might be using WSL with various Linux wheels, etc etc. In my testing, I found using the following as my run.bat script to work most reliably(Windows 11 and TheRock nightly wheels with a 7900xt): You may or may not need --fp32-vae depending on the model you're working with, or you might get all black decodes. You can also use a node to switch the VAE like Kijai's VAE loader node. As far as I know though, you have to make sure set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is enabled and that --use-pytorch-cross-attention is there as well. I found settting MIOpen to fast there to work best, though I'm not 100% positive of the impact it might have on things like VRAM usage. The log level is just there to hush any potential console spam. Using the above config: So it seems like torch.backends.cudnn.enabled = False is causing it to consume a lot more VRAM than before. This might be what's causing issues for some. Oh and this is with: |
|
This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups. |
|
I noticed upscale image with model performance regressed with this patch: ~5s/it -> ~32s/it which is quite significant when upscaling videos. I actually noticed a similar regression when I tried rocm 7, which is why I'm still using 6.4. |
You probably ran out of VRAM, which might be related to my earlier post. It's also possible that something went from being fp16 to bf16 or fp32, which would make it take up more VRAM. |
I honestly have no idea why this improves things but it does.
|
It seems related to Now:
|
There should be a flag to re-enable it though. |
Within the ZLUDA userbase, we have noted resizing and upscaling are the two things that really slow down without cuDNN. Though as I just said to comfyanonymous, I haven't tested that in a native ROCm environment. At least with ZLUDA, one can reason that it has something to do with the cudnn emulation -- the explanation for its similar effect on native AMD systems is a little harder to explain (assuming you are running native AMD, of course). |
|
I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow). I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything. All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux. For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme). |
I honestly have no idea why this improves things but it does.
|
I can try re-testing with latest rocm/pytorch. I do use |
|
I did some tests using latest https://rocm.nightlies.amd.com/v2/gfx110X-dgpu pytorch. I think this can help explain the difference of opinion in this thread. It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn. Note: I used a smaller test video so the numbers are slightly different than those I mentioned earlier. pytorch version: 2.10.0a0+rocm7.10.0a20251015
pytorch version: 2.9.0.dev20250827+rocm6.4
So it may make sense to disable cudnn only for newer pytorch/rocm versions and to ask upstream to find the root cause of the regression so cudnn doesn't need to be disabled anywhere. |
If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know. |
Totally aside from the discussion about cudnn, for which I bow to @alexheretic, I'd be genuinely interested in comparing the performance of WAN2.2 between the two extremes of nightly rocm on Linux (you), and pytorch 2.7/hip 6.2/windows/zluda (me). Provided you have an RDNA 3 card that is roughly equal to my 7900 XTX. Or perhaps someone else reading this can obligue me? Also, if you have achieved good RDNA 4 performance on any platform, there are always people raising issues re: ZLUDA about gfx1200 and gfx1201, even gfx1151 (strix). As of about 2 months ago, their reports of Linux support were not inspiring. Since people only complain when things don't work, it would be nice to know the current state of play wrt Linux ROCm. |
I'm using Linux. |
That's interesting to know. Honestly I have avoided trying any of the pytorch 7 based builds for Windows beacuse there's no MIOpen.dll in AMD's ROCm SDK, ergo no triton and no torch.induction. That holds true for 6.4 unfortunately, unless there have been recent developments. |
If you care to search the ComfyUI repo for
Or you can just deploy a single node at the start of your workflow so that cudnn is in a known state. There is also a AMD/NVIDIA switching node (haven't really tested that one) if you want to get conditional about it.
|
|
If only the VAE encode/decode is the problem, maybe it would be best to run the encode/decode stuff |
|
@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider
|
|
On my end, I can confirm this does eliminate SDXL's really slow first run for a given image resolution (30-60 mins of mostly VAE, depending on image size); see also #5759 and ROCm issue 5754. Fixing that is a big win for first impressions. But I can sympathize with the desire for a more-targeted fix, if it's affecting other setups and use cases. I see bf16-vae is also automatically enabling for me now, which is handy; I was using the launch flag before. I'm on a Ryzen AI HX 370 (Strix Point) on Windows, with ROCm 6.4.4 PyTorch from AMD repos, following these instructions. Image-generation time has become shorter for large pics, even though VAE tiling kicks in at a smaller image size. Previously, if I assigned 24GB RAM (out of 32 total) to the GPU, I could often do 1600x1280 and 1920x1088 (similar areas) without tiling, and a 30-step gen took about 300s (5 min). Now it always resorts to tiling for those sizes, but the time is down to about 220s (<4 min). Interestingly, if I change the CPU/GPU RAM split to 16/16 instead, I can do 1600x1280 without tiling, and it shaves off a bit more time. However, attempting 1920x1280 with a 16/16 split hits the RAM ceiling and starts freezing Windows, whereas an 8/24 split is able to safely switch to tiled VAE and finish in about 300s. Before the update, 8/24 could do 1920x1280 without tiling, but it took something like 360-380s, if I recall correctly. 1024x1024 remained pretty much the same speed before/after the update, at about 1 min for 20 steps. I notice one of the other commits for this release improved the memory estimation for VAE on AMD by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change? |
No, the VAE memory estimation has always been off on AMD.
With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off. |
Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously. # This needs to be up here, so it can disable cudnn before anything can even think about using it
torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
if torch.backends.cudnn.enabled:
print(" :: Enabled cuDNN")
else:
print(" :: Disabled cuDNN")I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable. |
There is nothing to fix here. It's how MIOpen works on AMD. Reference: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html Short version: AMD has 2 card types, pro and consumer. On all cards the default behaviour is to search for optimal compute solution to any given matrix problem the first time it is ecountered - that takes both time and VRAM. After that it'll be stored in local database and looked up so it's fast. The problem is MIOpen database is per-version, that is once the version is bumped the old database is ignored and a new one is started. And each new major ROCm release usually has new MIOpen version. There is a database of optimal compute solutions for pro cards but not consumer ones (maybe it'll change someday but it hasn't for years) so Radeons need to run a lot of computations to figure out what is what - though even for pro cards the AMD database does not have 100% coverage for all possible matrix sizes and data types. Rather than breaking the whole app you can set MIOPEN_FIND_MODE="FAST" (or 2) on AMD to just skip the solution search on each new problem. On pro cards you won't even feel it much since these are pretty well profiled by AMD, so any long-term performance loss from suboptimal solution picks are minimal. On Radeons that search is quite time consuming (esp. for training and bigger matrix sizes in SDXL, since noboby has the VRAM to run WAN in 720p mode locally on Radeons and these problems tend to be smaller and faster to find) but it usually results in 5% or so reduced compute times long term once an optimal solution is found. TLDR: Use MIOPEN_FIND_MODE or just accept that (sometimes massive) slowdown on each first time for a few % better performance long term. It's not ComfyUI issue, it's how MIOpen works. And consumer cards suffer the most since there is no global database from AMD, but also tend to find better solutions than the default ones so it's usually worth it. |
Yeah I think that's the most sensible solution. Reenabling cudnn gave my simple sdxl workflow a boost from 1.15it/s to 1.6... (6800xt) |
|
Looks like comfyanon added a log message for v0.3.66, at least. |
|
Well, that's something. try:
arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
torch.backends.cudnn.enabled = os.environ.get("TORCH_AMD_CUDNN_ENABLED", "0").strip().lower() not in {
"0", "off", "false", "disable", "disabled", "no"}
if not torch.backends.cudnn.enabled:
logging.info(
"ComfyUI has set torch.backends.cudnn.enabled to False for better AMD performance. Set environment var TORCH_AMD_CUDDNN_ENABLED=1 to enable it again.") |
|
I've proposed #10448 which has the benefit of being easy to override. However, this requires some testing to verify that setting |
…isablement of cudnn for all AMD users)
|
MIOPEN_FIND_MODE=FAST only "fixes" the very first run on new MIOpen version install. Nothing else. Sadly it does not fix the VAE issue in ComfyUI - I've run some tests with WanWrapper: System: Kubuntu 24.04 + ROCm 7.0.2 + AMD DKMS driver + Pytorch 2.8 from ROCm
EDIT: Small explanation, (xx/40) means that xx blocks were RAM swapped to fit the project into VRAM. This affects the inference times somewhat. This used WanWrapper so all custom nodes, it's not a specific issue with Comfy VAE node. But it is tied to VAE as the inference times seem unaffected by torch.backends.cudnn.enabled state. The difference is rather massive, long VAE times can cause the run to be 2x longer - but mostly for bigger resolutions. For 608x480 and 480x480 VAE seems still reasonable even with cudnn enabled. So it could be ROCm issue or Pytorch issue, but I would guess it has to do with mixed precision used in VAE? I've tried VAE in both BF16 and FP16 (cast from FP32 model) and there doesn't seem to be much difference. TLDR: On RDNA3/Linux at least torch.backends.cudnn.enabled must be False if you want reasonable VAE times. Seems like a bug deeper in the libraries rather than Comfy issue. |
|
@Only8Bits we (the ZLUDA community) use a cudNN disabling wrapper or node pair to disable cudNN during VAE decoding. It's distributed with the install procedure, but I replicated a stand-alone version that works quite handily. See #10302 (comment) We have also had MIOPEN_FIND_MODE=2 as part of the standard launch scripts for a while, and you are very correct in what you say: VAE decoding is slower with cudNN enabled. I can't offer any opinions on the "why" of it, though I have often thought Flux/Chroma's fp32 VAE works surprisingly fast (about 1.5 seconds). We do have a node that automatically disables cudNN only for VAE operations (and only on AMD), but it's new and doesn't always work. Wrapping other nodes is a tricky business, and could probably be much better done from the Python side. ovum-cudnn-wrapper. It does however look pretty and comes preset with the smarts to work out what nodes it needs to wrap.
It's currently pending it's tick of wonderfullness in the ComfyUI repo, I can only assume because it modifies other nodes. |
|
I do tend to use tiled VAE everywhere, including for wan encodes as untiled VAE perf can indeed be bad. My tiled vae perf seems ok with cudnn on. If I have time I'll try to test cudnn-off's effect on untiled VAE. |
|
@alexheretic i have a lovely node timer that will mark things red or green based on whether they were running with VAE decode on or off, as long as you are using the included cuDNN toggler anyway. The difference is probably about 50% speed increase, if it exists. And oh yes, TILED VAE always... it's on my list to build an automatic "convert to tiled vae" doover, because sometimes I forget and don't notice that it's taking 217 seconds to do VAE decoding. Talking of which, I need to go fix a workflow!
What nobody has bought up in this conversation is the other cudnn field, |
|
VAE benches for wan & sdxl System infoUsing #7764, #10238 on 426cde3 env vars pytorch version: 2.9.0+rocm6.4With cudnn disabled wan untiled encode/decode hits oom and fallsback to tiled. However, this is slower On sdxl performance is about the same too, except for untiled decode. Here cudnn-on is slow ~27s Conclusion:
Resultswan untiled cudnn offNote: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled). wan tiled 256 cudnn offwan tiled 256 cudnn onsdxl 1280x1832 vae cudnn offsdxl 1280x1832 vae cudnn onpytorch version: 2.10.0a0+rocm7.10.0a2025101For wan it's a similar story to rocm6.4. However, for sdxl encode performance is generally Conclusion:
Resultswan untiled cudnn offNote: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled). wan tiled 256 cudnn offwan tiled 256 cudnn onsdxl 1280x1832 vae cudnn offsdxl 1280x1832 vae cudnn onAn interesting difference between having cudnn off is untiled VAE tends to OOM faster log |
|
Interesting results, thanks! For comparison, I did some SDXL VAE testing on my APU, all with bf16 vae and 1280x1600 dimensions: Untiled VAE decode, cudnn enabled: ~100s, ~10GB VRAM Untiled VAE decode, cudnn disabled: ~5s, ~19GB VRAM So yeah, enabling cudnn halves the RAM for untiled VAE, but takes MUCH longer to run. And that's not even the extra-long first run for this resolution, which was more like 30 minutes. Tiled VAE uses similar time and RAM with/without cudnn, but still has the slow-first-run issue with cudnn enabled. (IIRC it's not as bad as fullsize though, presumably since the tile size stays the same for different image dimensions, except maybe at the picture edges.) My system is a Ryzen AI 9 HX 370 with a Radeon 890M iGPU (gfx1150). 32GB RAM with 16GB assigned to the GPU (and another 8GB shareable. Oddly, assigning 24GB makes it fall back to tiled decoding for the cudnn-disabled case). |
This sounds like maybe the key issue and reason for disabling cudnn. I didn't reproduce it in my setup though. For me the downside of disabling cudnn is #10447, so I was hoping for a better solution than this. As earlier suggested by others, maybe change so cudnn is disabled only during vae? And/or maybe add some arg/env var to control this. |
* Fix lowvram issue with hunyuan image vae. (comfyanonymous#9794) * add StabilityAudio API nodes (comfyanonymous#9749) * ComfyUI version v0.3.58 * add new ByteDanceSeedream (4.0) node (comfyanonymous#9802) * Update template to 0.1.78 (comfyanonymous#9806) * Update template to 0.1.77 * Update template to 0.1.78 * ComfyUI version 0.3.59 * Support hunyuan image distilled model. (comfyanonymous#9807) * Update template to 0.1.81 (comfyanonymous#9811) * Fast preview for hunyuan image. (comfyanonymous#9814) * Implement hunyuan image refiner model. (comfyanonymous#9817) * Add Output to V3 Combo type to match what is possible with V1 (comfyanonymous#9813) * Bump frontend to 1.26.11 (comfyanonymous#9809) * Add noise augmentation to hunyuan image refiner. (comfyanonymous#9831) This was missing and should help with colors being blown out. * Fix hunyuan refiner blownout colors at noise aug less than 0.25 (comfyanonymous#9832) * Set default hunyuan refiner shift to 4.0 (comfyanonymous#9833) * add kling-v2-1 model to the KlingStartEndFrame node (comfyanonymous#9630) * convert Minimax API nodes to the V3 schema (comfyanonymous#9693) * convert WanCameraEmbedding node to V3 schema (comfyanonymous#9714) * convert Cosmos nodes to V3 schema (comfyanonymous#9721) * convert nodes_cond.py to V3 schema (comfyanonymous#9719) * convert CFG nodes to V3 schema (comfyanonymous#9717) * convert Canny node to V3 schema (comfyanonymous#9743) * convert Moonvalley API nodes to the V3 schema (comfyanonymous#9698) * Better way of doing the generator for the hunyuan image noise aug. (comfyanonymous#9834) * Enable Runtime Selection of Attention Functions (comfyanonymous#9639) * Looking into a @wrap_attn decorator to look for 'optimized_attention_override' entry in transformer_options * Created logging code for this branch so that it can be used to track down all the code paths where transformer_options would need to be added * Fix memory usage issue with inspect * Made WAN attention receive transformer_options, test node added to wan to test out attention override later * Added **kwargs to all attention functions so transformer_options could potentially be passed through * Make sure wrap_attn doesn't make itself recurse infinitely, attempt to load SageAttention and FlashAttention if not enabled so that they can be marked as available or not, create registry for available attention * Turn off attention logging for now, make AttentionOverrideTestNode have a dropdown with available attention (this is a test node only) * Make flux work with optimized_attention_override * Add logs to verify optimized_attention_override is passed all the way into attention function * Make Qwen work with optimized_attention_override * Made hidream work with optimized_attention_override * Made wan patches_replace work with optimized_attention_override * Made SD3 work with optimized_attention_override * Made HunyuanVideo work with optimized_attention_override * Made Mochi work with optimized_attention_override * Made LTX work with optimized_attention_override * Made StableAudio work with optimized_attention_override * Made optimized_attention_override work with ACE Step * Made Hunyuan3D work with optimized_attention_override * Make CosmosPredict2 work with optimized_attention_override * Made CosmosVideo work with optimized_attention_override * Made Omnigen 2 work with optimized_attention_override * Made StableCascade work with optimized_attention_override * Made AuraFlow work with optimized_attention_override * Made Lumina work with optimized_attention_override * Made Chroma work with optimized_attention_override * Made SVD work with optimized_attention_override * Fix WanI2VCrossAttention so that it expects to receive transformer_options * Fixed Wan2.1 Fun Camera transformer_options passthrough * Fixed WAN 2.1 VACE transformer_options passthrough * Add optimized to get_attention_function * Disable attention logs for now * Remove attention logging code * Remove _register_core_attention_functions, as we wouldn't want someone to call that, just in case * Satisfy ruff * Remove AttentionOverrideTest node, that's something to cook up for later * Hunyuan refiner vae now works with tiled. (comfyanonymous#9836) * Support wav2vec base models (comfyanonymous#9637) * Support wav2vec base models * trim trailing whitespace * Do interpolation after * Cleanup. (comfyanonymous#9838) * Remove single quote pattern to avoid wrong matches (comfyanonymous#9842) * Add support for Chroma Radiance (comfyanonymous#9682) * Initial Chroma Radiance support * Minor Chroma Radiance cleanups * Update Radiance nodes to ensure latents/images are on the intermediate device * Fix Chroma Radiance memory estimation. * Increase Chroma Radiance memory usage factor * Increase Chroma Radiance memory usage factor once again * Ensure images are multiples of 16 for Chroma Radiance Add batch dimension and fix channels when necessary in ChromaRadianceImageToLatent node * Tile Chroma Radiance NeRF to reduce memory consumption, update memory usage factor * Update Radiance to support conv nerf final head type. * Allow setting NeRF embedder dtype for Radiance Bump Radiance nerf tile size to 32 Support EasyCache/LazyCache on Radiance (maybe) * Add ChromaRadianceStubVAE node * Crop Radiance image inputs to multiples of 16 instead of erroring to be in line with existing VAE behavior * Convert Chroma Radiance nodes to V3 schema. * Add ChromaRadianceOptions node and backend support. Cleanups/refactoring to reduce code duplication with Chroma. * Fix overriding the NeRF embedder dtype for Chroma Radiance * Minor Chroma Radiance cleanups * Move Chroma Radiance to its own directory in ldm Minor code cleanups and tooltip improvements * Fix Chroma Radiance embedder dtype overriding * Remove Radiance dynamic nerf_embedder dtype override feature * Unbork Radiance NeRF embedder init * Remove Chroma Radiance image conversion and stub VAE nodes Add a chroma_radiance option to the VAELoader builtin node which uses comfy.sd.PixelspaceConversionVAE Add a PixelspaceConversionVAE to comfy.sd for converting BHWC 0..1 <-> BCHW -1..1 * Changes to the previous radiance commit. (comfyanonymous#9851) * Make ModuleNotFoundError ImportError instead (comfyanonymous#9850) * Add that hunyuan image is supported to readme. (comfyanonymous#9857) * Support the omnigen2 umo lora. (comfyanonymous#9886) * Fix depending on asserts to raise an exception in BatchedBrownianTree and Flash attn module (comfyanonymous#9884) Correctly handle the case where w0 is passed by kwargs in BatchedBrownianTree * Add encoder part of whisper large v3 as an audio encoder model. (comfyanonymous#9894) Not useful yet but some models use it. * Reduce Peak WAN inference VRAM usage (comfyanonymous#9898) * flux: Do the xq and xk ropes one at a time This was doing independendent interleaved tensor math on the q and k tensors, leading to the holding of more than the minimum intermediates in VRAM. On a bad day, it would VRAM OOM on xk intermediates. Do everything q and then everything k, so torch can garbage collect all of qs intermediates before k allocates its intermediates. This reduces peak VRAM usage for some WAN2.2 inferences (at least). * wan: Optimize qkv intermediates on attention As commented. The former logic computed independent pieces of QKV in parallel which help more inference intermediates in VRAM spiking VRAM usage. Fully roping Q and garbage collecting the intermediates before touching K reduces the peak inference VRAM usage. * Support the HuMo model. (comfyanonymous#9903) * Support the HuMo 17B model. (comfyanonymous#9912) * Enable fp8 ops by default on gfx1200 (comfyanonymous#9926) * make kernel of same type as image to avoid mismatch issues (comfyanonymous#9932) * Do padding of audio embed in model for humo for more flexibility. (comfyanonymous#9935) * Bump frontend to 1.26.13 (comfyanonymous#9933) * Basic WIP support for the wan animate model. (comfyanonymous#9939) * api_nodes: reduce default timeout from 7 days to 2 hours (comfyanonymous#9918) * fix(seedream4): add flag to ignore error on partial success (comfyanonymous#9952) * Update WanAnimateToVideo to more easily extend videos. (comfyanonymous#9959) * Add inputs for character replacement to the WanAnimateToVideo node. (comfyanonymous#9960) * [Reviving comfyanonymous#5709] Add strength input to Differential Diffusion (comfyanonymous#9957) * Update nodes_differential_diffusion.py * Update nodes_differential_diffusion.py * Make strength optional to avoid validation errors when loading old workflows, adjust step --------- Co-authored-by: ThereforeGames <[email protected]> * Fix LoRA Trainer bugs with FP8 models. (comfyanonymous#9854) * Fix adapter weight init * Fix fp8 model training * Avoid inference tensor * Lower wan memory estimation value a bit. (comfyanonymous#9964) Previous pr reduced the peak memory requirement. * Set some wan nodes as no longer experimental. (comfyanonymous#9976) * Support for qwen edit plus model. Use the new TextEncodeQwenImageEditPlus. (comfyanonymous#9986) * add offset param (comfyanonymous#9977) * Fix bug with WanAnimateToVideo node. (comfyanonymous#9988) * Fix bug with WanAnimateToVideo. (comfyanonymous#9990) * update template to 0.1.86 (comfyanonymous#9998) * update template to 0.1.84 * update template to 0.1.85 * Update template to 0.1.86 * feat(api-nodes): add wan t2i, t2v, i2v nodes (comfyanonymous#9996) * ComfyUI version 0.3.60 * Rodin3D - add [Rodin3D Gen-2 generate] api-node (comfyanonymous#9994) * update Rodin api node * update rodin3d gen2 api node * fix images limited bug * Add new audio nodes (comfyanonymous#9908) * Add new audio nodes - TrimAudioDuration - SplitAudioChannels - AudioConcat - AudioMerge - AudioAdjustVolume * Update nodes_audio.py * Add EmptyAudio -node * Change duration to Float (allows sub seconds) * Fix issue with .view() in HuMo. (comfyanonymous#10014) * Fix memory leak by properly detaching model finalizer (comfyanonymous#9979) When unloading models in load_models_gpu(), the model finalizer was not being explicitly detached, leading to a memory leak. This caused linear memory consumption increase over time as models are repeatedly loaded and unloaded. This change prevents orphaned finalizer references from accumulating in memory during model switching operations. * Make LatentCompositeMasked work with basic video latents. (comfyanonymous#10023) * Fix the failing unit test. (comfyanonymous#10037) * Add @Kosinkadink as code owner (comfyanonymous#10041) Updated CODEOWNERS to include @Kosinkadink as a code owner. * convert nodes_rebatch.py to V3 schema (comfyanonymous#9945) * convert nodes_fresca.py to V3 schema (comfyanonymous#9951) * convert nodes_sdupscale.py to V3 schema (comfyanonymous#9943) * convert nodes_tcfg.py to V3 schema (comfyanonymous#9942) * convert nodes_sag.py to V3 schema (comfyanonymous#9940) * convert nodes_post_processing to V3 schema (comfyanonymous#9491) * convert CLIPTextEncodeSDXL nodes to V3 schema (comfyanonymous#9716) * Don't add template to qwen2.5vl when template is in prompt. (comfyanonymous#10043) Make the hunyuan image refiner template_end 36. * Add 'input_cond' and 'input_uncond' to the args dictionary passed into sampler_cfg_function (comfyanonymous#10044) * Update template to 0.1.88 (comfyanonymous#10046) * Add workflow templates version tracking to system_stats (comfyanonymous#9089) Adds installed and required workflow templates version information to the /system_stats endpoint, allowing the frontend to detect and notify users when their templates package is outdated. - Add get_installed_templates_version() and get_required_templates_version() methods to FrontendManager - Include templates version info in system_stats response - Add comprehensive unit tests for the new functionality * convert nodes_hidream.py to V3 schema (comfyanonymous#9946) * convert nodes_bfl.py to V3 schema (comfyanonymous#10033) * convert nodes_luma.py to V3 schema (comfyanonymous#10030) * convert nodes_pixart.py to V3 schema (comfyanonymous#10019) * convert nodes_photomaker.py to V3 schema (comfyanonymous#10017) * convert nodes_qwen.py to V3 schema (comfyanonymous#10049) * Reduce Peak WAN inference VRAM usage - part II (comfyanonymous#10062) * flux: math: Use _addcmul to avoid expensive VRAM intermediate The rope process can be the VRAM peak and this intermediate for the addition result before releasing the original can OOM. addcmul_ it. * wan: Delete the self attention before cross attention This saves VRAM when the cross attention and FFN are in play as the VRAM peak. * Improvements to the stable release workflow. (comfyanonymous#10065) * Fix typo in release workflow. (comfyanonymous#10066) * convert nodes_lotus.py to V3 schema (comfyanonymous#10057) * convert nodes_lumina2.py to V3 schema (comfyanonymous#10058) * convert nodes_hypertile.py to V3 schema (comfyanonymous#10061) * feat: ComfyUI can be run on the specified Ascend NPU (comfyanonymous#9663) * feature: Set the Ascend NPU to use a single one * Enable the `--cuda-device` parameter to support both CUDA and Ascend NPUs simultaneously. * Make the code just set the ASCENT_RT_VISIBLE_DEVICES environment variable without any other edits to master branch --------- Co-authored-by: Jedrzej Kosinski <[email protected]> * Fix stable workflow creating multiple draft releases. (comfyanonymous#10067) * Update command to install latest nighly pytorch. (comfyanonymous#10085) * [Rodin3d api nodes] Updated the name of the save file path (changed from timestamp to UUID). (comfyanonymous#10011) * Update savepath name from time to uuid * delete lib * Update template to 0.1.91 (comfyanonymous#10096) * add WanImageToImageApi node (comfyanonymous#10094) * convert nodes_mochi.py to V3 schema (comfyanonymous#10069) * convert nodes_perpneg.py to V3 schema (comfyanonymous#10081) * dont cache new locale entry points (comfyanonymous#10101) * convert nodes_mahiro.py to V3 schema (comfyanonymous#10070) * Add action to create cached deps with manually specified torch. (comfyanonymous#10102) * Make the final release test optional in the stable release action. (comfyanonymous#10103) * Different base files for different release. (comfyanonymous#10104) * Different base files for nvidia and amd portables. (comfyanonymous#10105) * Add a way to have different names for stable nvidia portables. (comfyanonymous#10106) * Add action to do the full stable release. (comfyanonymous#10107) * Make stable release workflow callable. (comfyanonymous#10108) * Add basic readme for AMD portable. (comfyanonymous#10109) * ComfyUI version 0.3.61 * Workflow permission fix. (comfyanonymous#10110) * Add new portable links to readme. (comfyanonymous#10112) * fix(Rodin3D-Gen2): missing "task_uuid" parameter (comfyanonymous#10128) * enable Seedance Pro model in the FirstLastFrame node (comfyanonymous#10120) * ComfyUI version 0.3.62. * Bump frontend to 1.27.7 (comfyanonymous#10133) * convert nodes_audio_encoder.py to V3 schema (comfyanonymous#10123) * convert nodes_gits.py to V3 schema (comfyanonymous#9949) * convert nodes_differential_diffusion.py to V3 schema (comfyanonymous#10056) * convert nodes_optimalsteps.py to V3 schema (comfyanonymous#10074) * convert nodes_pag.py to V3 schema (comfyanonymous#10080) * convert nodes_lt.py to V3 schema (comfyanonymous#10084) * convert nodes_ip2p.pt to V3 schema (comfyanonymous#10097) * Support the new hunyuan vae. (comfyanonymous#10150) * feat: Add Epsilon Scaling node for exposure bias correction (comfyanonymous#10132) * sd: fix VAE tiled fallback VRAM leak (comfyanonymous#10139) When the VAE catches this VRAM OOM, it launches the fallback logic straight from the exception context. Python however refs the entire call stack that caused the exception including any local variables for the sake of exception report and debugging. In the case of tensors, this can hold on the references to GBs of VRAM and inhibit the VRAM allocated from freeing them. So dump the except context completely before going back to the VAE via the tiler by getting out of the except block with nothing but a flag. The greately increases the reliability of the tiler fallback, especially on low VRAM cards, as with the bug, if the leak randomly leaked more than the headroom needed for a single tile, the tiler would fallback would OOM and fail the flow. * WAN: Fix cache VRAM leak on error (comfyanonymous#10141) If this suffers an exception (such as a VRAM oom) it will leave the encode() and decode() methods which skips the cleanup of the WAN feature cache. The comfy node cache then ultimately keeps a reference this object which is in turn reffing large tensors from the failed execution. The feature cache is currently setup at a class variable on the encoder/decoder however, the encode and decode functions always clear it on both entry and exit of normal execution. Its likely the design intent is this is usable as a streaming encoder where the input comes in batches, however the functions as they are today don't support that. So simplify by bringing the cache back to local variable, so that if it does VRAM OOM the cache itself is properly garbage when the encode()/decode() functions dissappear from the stack. * Add a .bat to the AMD portable to disable smart memory. (comfyanonymous#10153) * convert nodes_morphology.py to V3 schema (comfyanonymous#10159) * fix(api-nodes): made logging path to be smaller (comfyanonymous#10156) * Turn on TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL by default. (comfyanonymous#10168) * update example_node to use V3 schema (comfyanonymous#9723) * feat(linter, api-nodes): add pylint for comfy_api_nodes folder (comfyanonymous#10157) * feat(api-nodes): add kling-2-5-turbo to txt2video and img2video nodes (comfyanonymous#10155) * fix(api-nodes): reimport of base64 in Gemini node (comfyanonymous#10181) * fix(api-nodes): bad indentation in Recraft API node function (comfyanonymous#10175) * convert nodes_torch_compile.py to V3 schema (comfyanonymous#10173) * convert nodes_eps.py to V3 schema (comfyanonymous#10172) * convert nodes_pixverse.py to V3 schema (comfyanonymous#10177) * convert nodes_tomesd.py to V3 schema (comfyanonymous#10180) * convert nodes_edit_model.py to V3 schema (comfyanonymous#10147) * Fix type annotation syntax in MotionEncoder_tc __init__ (comfyanonymous#10186) ## Summary Fixed incorrect type hint syntax in `MotionEncoder_tc.__init__()` parameter list. ## Changes - Line 647: Changed `num_heads=int` to `num_heads: int` - This corrects the parameter annotation from a default value assignment to proper type hint syntax ## Details The parameter was using assignment syntax (`=`) instead of type annotation syntax (`:`), which would incorrectly set the default value to the `int` class itself rather than annotating the expected type. * Update amd nightly command in readme. (comfyanonymous#10189) * Add instructions to install nightly AMD pytorch for windows. (comfyanonymous#10190) * Add instructions to install nightly AMD pytorch for windows. * Update README.md * fix(api-nodes): enable 2 more pylint rules, removed non needed code (comfyanonymous#10192) * convert nodes_rodin.py to V3 schema (comfyanonymous#10195) * convert nodes_stable3d.py to V3 schema (comfyanonymous#10204) * Remove soundfile dependency. No more torchaudio load or save. (comfyanonymous#10210) * fix(api-nodes): disable "std" mode for Kling2.5-turbo (comfyanonymous#10212) * Remove useless code. (comfyanonymous#10223) * Update template to 0.1.93 (comfyanonymous#10235) * Update template to 0.1.92 * Update template to 0.1.93 * ComfyUI version 0.3.63 * fix(api-nodes): enable more pylint rules (comfyanonymous#10213) * fix(api-nodes): allow negative_prompt PixVerse to be multiline (comfyanonymous#10196) * convert nodes_pika.py to V3 schema (comfyanonymous#10216) * convert nodes_kling.py to V3 schema (comfyanonymous#10236) * Implement gemma 3 as a text encoder. (comfyanonymous#10241) Not useful yet. * fix(ReCraft-API-node): allow custom multipart parser to return FormData (comfyanonymous#10244) * feat(api-nodes): add Sora2 API node (comfyanonymous#10249) * Temp fix for LTXV custom nodes. (comfyanonymous#10251) * Bump frontend to 1.27.10 (comfyanonymous#10252) * update template to 0.1.94 (comfyanonymous#10253) * ComfyUI version 0.3.64 * feat(V3-io): allow Enum classes for Combo options (comfyanonymous#10237) * Refactor model sampling sigmas code. (comfyanonymous#10250) * Mvly/node update (comfyanonymous#10042) * updated V2V node to allow for control image input exposing steps in v2v fixing guidance_scale as input parameter TODO: allow for motion_intensity as input param. * refactor: comment out unsupported resolution and adjust default values in video nodes * set control_after_generate * adding new defaults * fixes * changed control_after_generate back to True * changed control_after_generate back to False --------- Co-authored-by: thorsten <[email protected]> * feat(api-nodes, pylint): use lazy formatting in logging functions (comfyanonymous#10248) * convert nodes_model_downscale.py to V3 schema (comfyanonymous#10199) * convert nodes_lora_extract.py to V3 schema (comfyanonymous#10182) * convert nodes_compositing.py to V3 schema (comfyanonymous#10174) * convert nodes_latent.py to V3 schema (comfyanonymous#10160) * More surgical fix for comfyanonymous#10267 (comfyanonymous#10276) * fix(v3,api-nodes): V3 schema typing; corrected Pika API nodes (comfyanonymous#10265) * convert nodes_sd3.py and nodes_slg.py to V3 schema (comfyanonymous#10162) * Fix bug with applying loras on fp8 scaled without fp8 ops. (comfyanonymous#10279) * convert nodes_flux to V3 schema (comfyanonymous#10122) * convert nodes_upscale_model.py to V3 schema (comfyanonymous#10149) * Fix save audio nodes saving mono audio as stereo. (comfyanonymous#10289) * feat(GeminiImage-ApiNode): add aspect_ratio and release version of model (comfyanonymous#10255) * feat(api-nodes): add price extractor feature; small fixes to Kling & Pika nodes (comfyanonymous#10284) * Update template to 0.1.95 (comfyanonymous#10294) * Implement the mmaudio VAE. (comfyanonymous#10300) * Improve AMD performance. (comfyanonymous#10302) I honestly have no idea why this improves things but it does. * Update node docs to 0.3.0 (comfyanonymous#10318) * update extra models paths example (comfyanonymous#10316) * Update the extra_model_paths.yaml.example (comfyanonymous#10319) * Always set diffusion model to eval() mode. (comfyanonymous#10331) * add indent=4 kwarg to json.dumps() (comfyanonymous#10307) * WAN2.2: Fix cache VRAM leak on error (comfyanonymous#10308) Same change pattern as 7e8dd27 applied to WAN2.2 If this suffers an exception (such as a VRAM oom) it will leave the encode() and decode() methods which skips the cleanup of the WAN feature cache. The comfy node cache then ultimately keeps a reference this object which is in turn reffing large tensors from the failed execution. The feature cache is currently setup at a class variable on the encoder/decoder however, the encode and decode functions always clear it on both entry and exit of normal execution. Its likely the design intent is this is usable as a streaming encoder where the input comes in batches, however the functions as they are today don't support that. So simplify by bringing the cache back to local variable, so that if it does VRAM OOM the cache itself is properly garbage when the encode()/decode() functions dissappear from the stack. * convert nodes_hunyuan.py to V3 schema (comfyanonymous#10136) * Enable RDNA4 pytorch attention on ROCm 7.0 and up. (comfyanonymous#10332) * Fix loading old stable diffusion ckpt files on newer numpy. (comfyanonymous#10333) * Better memory estimation for the SD/Flux VAE on AMD. (comfyanonymous#10334) * ComfyUI version 0.3.65 * Faster workflow cancelling. (comfyanonymous#10301) * Python 3.14 instructions. (comfyanonymous#10337) * api-nodes: fixed dynamic pricing format; import comfy_io directly (comfyanonymous#10336) * Bump frontend to 1.28.6 (comfyanonymous#10345) * gfx942 doesn't support fp8 operations. (comfyanonymous#10348) * Add TemporalScoreRescaling node (comfyanonymous#10351) * Add TemporalScoreRescaling node * Mention image generation in tsr_k's tooltip * feat(api-nodes): add Veo3.1 model (comfyanonymous#10357) * Latest pytorch stable is cu130 (comfyanonymous#10361) * Fix order of inputs nested merge_nested_dicts (comfyanonymous#10362) * refactor: Replace manual patches merging with merge_nested_dicts (comfyanonymous#10360) * Bump frontend to 1.28.7 (comfyanonymous#10364) * feat: deprecated API alert (comfyanonymous#10366) * fix(api-nodes): remove "veo2" model from Veo3 node (comfyanonymous#10372) * Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (comfyanonymous#10373) * workaround also works on cudnn 91200 (comfyanonymous#10375) * Do batch_slice in EasyCache's apply_cache_diff (comfyanonymous#10376) * execution: fold in dependency aware caching / Fix --cache-none with loops/lazy etc (comfyanonymous#10368) * execution: fold in dependency aware caching This makes --cache-none compatiable with lazy and expanded subgraphs. Currently the --cache-none option is powered by the DependencyAwareCache. The cache attempts to maintain a parallel copy of the execution list data structure, however it is only setup once at the start of execution and does not get meaninigful updates to the execution list. This causes multiple problems when --cache-none is used with lazy and expanded subgraphs as the DAC does not accurately update its copy of the execution data structure. DAC has an attempt to handle subgraphs ensure_subcache however this does not accurately connect to nodes outside the subgraph. The current semantics of DAC are to free a node ASAP after the dependent nodes are executed. This means that if a subgraph refs such a node it will be requed and re-executed by the execution_list but DAC wont see it in its to-free lists anymore and leak memory. Rather than try and cover all the cases where the execution list changes from inside the cache, move the while problem to the executor which maintains an always up-to-date copy of the wanted data-structure. The executor now has a fast-moving run-local cache of its own. Each _to node has its own mini cache, and the cache is unconditionally primed at the time of add_strong_link. add_strong_link is called for all of static workflows, lazy links and expanded subgraphs so its the singular source of truth for output dependendencies. In the case of a cache-hit, the executor cache will hold the non-none value (it will respect updates if they happen somehow as well). In the case of a cache-miss, the executor caches a None and will wait for a notification to update the value when the node completes. When a node completes execution, it simply releases its mini-cache and in turn its strong refs on its direct anscestor outputs, allowing for ASAP freeing (same as the DependencyAwareCache but a little more automatic). This now allows for re-implementation of --cache-none with no cache at all. The dependency aware cache was also observing the dependency sematics for the objects and UI cache which is not accurate (this entire logic was always outputs specific). This also prepares for more complex caching strategies (such as RAM pressure based caching), where a cache can implement any freeing strategy completely independently of the DepedancyAwareness requirement. * main: re-implement --cache-none as no cache at all The execution list now tracks the dependency aware caching more correctly that the DependancyAwareCache. Change it to a cache that does nothing. * test_execution: add --cache-none to the test suite --cache-none is now expected to work universally. Run it through the full unit test suite. Propagate the server parameterization for whether or not the server is capabale of caching, so that the minority of tests that specifically check for cache hits can if else. Hard assert NOT caching in the else to give some coverage of --cache-none expected behaviour to not acutally cache. * convert nodes_controlnet.py to V3 schema (comfyanonymous#10202) * Update Python 3.14 installation instructions (comfyanonymous#10385) Removed mention of installing pytorch nightly for Python 3.14. * Disable torch compiler for cast_bias_weight function (comfyanonymous#10384) * Disable torch compiler for cast_bias_weight function * Fix torch compile. * Turn off cuda malloc by default when --fast autotune is turned on. (comfyanonymous#10393) * Fix batch size above 1 giving bad output in chroma radiance. (comfyanonymous#10394) * Speed up chroma radiance. (comfyanonymous#10395) * Pytorch is stupid. (comfyanonymous#10398) * Deprecation warning on unused files (comfyanonymous#10387) * only warn for unused files * include internal extensions * Update template to 0.2.1 (comfyanonymous#10413) * Update template to 0.1.97 * Update template to 0.2.1 * Log message for cudnn disable on AMD. (comfyanonymous#10418) * Revert "execution: fold in dependency aware caching / Fix --cache-none with l…" (comfyanonymous#10422) This reverts commit b1467da. * ComfyUI version v0.3.66 * Only disable cudnn on newer AMD GPUs. (comfyanonymous#10437) * Add custom node published subgraphs endpoint (comfyanonymous#10438) * Add get_subgraphs_dir to ComfyExtension and PUBLISHED_SUBGRAPH_DIRS to nodes.py * Created initial endpoints, although the returned paths are a bit off currently * Fix path and actually return real data * Sanitize returned /api/global_subgraphs entries * Remove leftover function from early prototyping * Remove added whitespace * Add None check for sanitize_entry * execution: fold in dependency aware caching / Fix --cache-none with loops/lazy etc (Resubmit) (comfyanonymous#10440) * execution: fold in dependency aware caching This makes --cache-none compatiable with lazy and expanded subgraphs. Currently the --cache-none option is powered by the DependencyAwareCache. The cache attempts to maintain a parallel copy of the execution list data structure, however it is only setup once at the start of execution and does not get meaninigful updates to the execution list. This causes multiple problems when --cache-none is used with lazy and expanded subgraphs as the DAC does not accurately update its copy of the execution data structure. DAC has an attempt to handle subgraphs ensure_subcache however this does not accurately connect to nodes outside the subgraph. The current semantics of DAC are to free a node ASAP after the dependent nodes are executed. This means that if a subgraph refs such a node it will be requed and re-executed by the execution_list but DAC wont see it in its to-free lists anymore and leak memory. Rather than try and cover all the cases where the execution list changes from inside the cache, move the while problem to the executor which maintains an always up-to-date copy of the wanted data-structure. The executor now has a fast-moving run-local cache of its own. Each _to node has its own mini cache, and the cache is unconditionally primed at the time of add_strong_link. add_strong_link is called for all of static workflows, lazy links and expanded subgraphs so its the singular source of truth for output dependendencies. In the case of a cache-hit, the executor cache will hold the non-none value (it will respect updates if they happen somehow as well). In the case of a cache-miss, the executor caches a None and will wait for a notification to update the value when the node completes. When a node completes execution, it simply releases its mini-cache and in turn its strong refs on its direct anscestor outputs, allowing for ASAP freeing (same as the DependencyAwareCache but a little more automatic). This now allows for re-implementation of --cache-none with no cache at all. The dependency aware cache was also observing the dependency sematics for the objects and UI cache which is not accurate (this entire logic was always outputs specific). This also prepares for more complex caching strategies (such as RAM pressure based caching), where a cache can implement any freeing strategy completely independently of the DepedancyAwareness requirement. * main: re-implement --cache-none as no cache at all The execution list now tracks the dependency aware caching more correctly that the DependancyAwareCache. Change it to a cache that does nothing. * test_execution: add --cache-none to the test suite --cache-none is now expected to work universally. Run it through the full unit test suite. Propagate the server parameterization for whether or not the server is capabale of caching, so that the minority of tests that specifically check for cache hits can if else. Hard assert NOT caching in the else to give some coverage of --cache-none expected behaviour to not acutally cache. * Small readme improvement. (comfyanonymous#10442) * WIP way to support multi multi dimensional latents. (comfyanonymous#10456) * Update template to 0.2.2 (comfyanonymous#10461) Fix template typo issue * feat(api-nodes): network client v2: async ops, cancellation, downloads, refactor (comfyanonymous#10390) * feat(api-nodes): implement new API client for V3 nodes * feat(api-nodes): implement new API client for V3 nodes * feat(api-nodes): implement new API client for V3 nodes * converted WAN nodes to use new client; polishing * fix(auth): do not leak authentification for the absolute urls * convert BFL API nodes to use new API client; remove deprecated BFL nodes * converted Google Veo nodes * fix(Veo3.1 model): take into account "generate_audio" parameter * convert Tripo API nodes to V3 schema (comfyanonymous#10469) * Remove useless function (comfyanonymous#10472) * convert Gemini API nodes to V3 schema (comfyanonymous#10476) * Add warning for torch-directml usage (comfyanonymous#10482) Added a warning message about the state of torch-directml. * Fix mistake. (comfyanonymous#10484) * fix(api-nodes): random issues on Windows by capturing general OSError for retries (comfyanonymous#10486) * Bump portable deps workflow to torch cu130 python 3.13.9 (comfyanonymous#10493) * Add a bat to run comfyui portable without api nodes. (comfyanonymous#10504) * Update template to 0.2.3 (comfyanonymous#10503) * feat(api-nodes): add LTXV API nodes (comfyanonymous#10496) * Update template to 0.2.4 (comfyanonymous#10505) * frontend bump to 1.28.8 (comfyanonymous#10506) * ComfyUI version v0.3.67 * Bump stable portable to cu130 python 3.13.9 (comfyanonymous#10508) * Remove comfy api key from queue api. (comfyanonymous#10502) * Tell users to update nvidia drivers if problem with portable. (comfyanonymous#10510) * Tell users to update their nvidia drivers if portable doesn't start. (comfyanonymous#10518) * Mixed Precision Quantization System (comfyanonymous#10498) * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint. * Updated design using Tensor Subclasses * Fix FP8 MM * An actually functional POC * Remove CK reference and ensure correct compute dtype * Update unit tests * ruff lint * Fix missing keys * Rename quant dtype parameter * Rename quant dtype parameter * Fix unittests for CPU build * execution: Allow a subgraph nodes to execute multiple times (comfyanonymous#10499) In the case of --cache-none lazy and subgraph execution can cause anything to be run multiple times per workflow. If that rerun nodes is in itself a subgraph generator, this will crash for two reasons. pending_subgraph_results[] does not cleanup entries after their use. So when a pending_subgraph_result is consumed, remove it from the list so that if the corresponding node is fully re-executed this misses lookup and it fall through to execute the node as it should. Secondly, theres is an explicit enforcement against dups in the addition of subgraphs nodes as ephemerals to the dymprompt. Remove this enforcement as the use case is now valid. * convert nodes_recraft.py to V3 schema (comfyanonymous#10507) * Speed up offloading using pinned memory. (comfyanonymous#10526) To enable this feature use: --fast pinned_memory * Fix issue. (comfyanonymous#10527) --------- Co-authored-by: comfyanonymous <[email protected]> Co-authored-by: Alexander Piskun <[email protected]> Co-authored-by: comfyanonymous <[email protected]> Co-authored-by: ComfyUI Wiki <[email protected]> Co-authored-by: Jedrzej Kosinski <[email protected]> Co-authored-by: Benjamin Lu <[email protected]> Co-authored-by: Jukka Seppänen <[email protected]> Co-authored-by: Kimbing Ng <[email protected]> Co-authored-by: blepping <[email protected]> Co-authored-by: rattus128 <[email protected]> Co-authored-by: DELUXA <[email protected]> Co-authored-by: Jodh Singh <[email protected]> Co-authored-by: Christian Byrne <[email protected]> Co-authored-by: ThereforeGames <[email protected]> Co-authored-by: Kohaku-Blueleaf <[email protected]> Co-authored-by: Changrz <[email protected]> Co-authored-by: Guy Niv <[email protected]> Co-authored-by: Yoland Yan <[email protected]> Co-authored-by: Rui Wang (王瑞) <[email protected]> Co-authored-by: AustinMroz <[email protected]> Co-authored-by: Koratahiu <[email protected]> Co-authored-by: Finn-Hecker <[email protected]> Co-authored-by: filtered <[email protected]> Co-authored-by: thorsten <[email protected]> Co-authored-by: Daniel Harte <[email protected]> Co-authored-by: Arjan Singh <[email protected]> Co-authored-by: chaObserv <[email protected]> Co-authored-by: Faych <[email protected]> Co-authored-by: Rizumu Ayaka <[email protected]> Co-authored-by: contentis <[email protected]>







I honestly have no idea why this improves things but it does.