Skip to content

Conversation

@comfyanonymous
Copy link
Owner

I honestly have no idea why this improves things but it does.

I honestly have no idea why this improves things but it does.
@comfyanonymous comfyanonymous merged commit a125cd8 into master Oct 12, 2025
12 checks passed
@comfyanonymous comfyanonymous deleted the comfyanonymous-patch-2 branch October 12, 2025 04:28
@RandomGitUser321
Copy link
Contributor

It's related to MIOpen: ROCm/TheRock#1542

I'm not 100% positive, but I think torch.backends.cudnn.enabled = False just disables it or something along those lines. Last I remember, I think they are looking into some of the issues that cause things like slow VAE encode/decodes, so this bandaid fix might only be needed temporarily.

toxicwind pushed a commit to toxicwind/ComfyUI that referenced this pull request Oct 12, 2025
I honestly have no idea why this improves things but it does.
@daniandtheweb
Copy link

daniandtheweb commented Oct 12, 2025

When using flash-attention (built from the ROCm repo, main_perf branch) this commit slows inference by a lot on my 7800XT:

  • Previously: flash-attn sdxl 1024x1024 ~2 it/s
  • Now: flash-attn sdxl 1024x1024 ~1.5it/s

Maybe this new behavior can be disabled in case flash-attention is used?

@ThGrSoRu
Copy link

This causes an OOM crash on my Linux 6700XT setup when VAE decoding starts.
I've just tested it by undoing only this commit, after which image generation was successful again.

@sfinktah
Copy link

sfinktah commented Oct 13, 2025

@comfyanonymous

Please reverse this commit. We have done extensive tensing over a period of months, I can assure you that your assertion is incorrect.

Disabling cudnn only accelerates VAE Decoding and VAE Encoding, and has detrimental side effects to other operations, including slowing down initial ZLUDA compiles by a literal order of magntitude (we're talking almost an hour here). Though I realise ZLUDA will not be affected by this commit, the detrimental effects of blanket disabling of cudnn are equally applicable to native AMD pytorch installations.

image

Please see the many highly detailed graphs showing timing attached to the PR below, where we briefly toyed with the identical patch that you have just implemented.

patientx#272

We have developed many solutions for dealing with slow VAE decodes, the latest being this custom extension that dynamically toggles off cudnn during VAE encode and decoding only when using an AMD gpu.

https://github.com/sfinktah/ovum-cudnn-wrapper

I am also the developer of a timing node, https://github.com/sfinktah/comfy-ovum which allows me to compare the performance of every element of a workflow with and without cudnn (cudnn off is red, cudnn on is green).

image

That example (which I plucked at random) shows how different nodes react either positively or negatively to having cudnn disabled.

I can prepare detailed graphs showing timing and memory usage for any workflow you care to nominate, on any platform you care to nominate, if that proves necessary.

A more useful AMD helper would optionally replicate the functionality of my aforementioned cudnn-wrapper within your python core code.

@ExtraCubedPotato
Copy link

RX7900 XT user here, just adding on to the pile that I'm also experiencing issues after updating to this commit. Gens that would normally take 30~ seconds to finish are taking 43~ seconds. (It was also causing OOM on vae decode and went to tiled)

Reverting the file completely to what it was 3 weeks ago brought back expected performance.

@RandomGitUser321
Copy link
Contributor

RandomGitUser321 commented Oct 13, 2025

I think it heavily depends on which wheels you are using and which OS you're on. Like if you're on Linux, you might be using TheRock or the official PyTorch ones. On Windows, you have a lot of people using different methods like ZLUDA, the older Scott wheels or TheRock wheels, or people might be using WSL with various Linux wheels, etc etc.

In my testing, I found using the following as my run.bat script to work most reliably(Windows 11 and TheRock nightly wheels with a 7900xt):

set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set MIOPEN_FIND_MODE=FAST
set MIOPEN_LOG_LEVEL=3
.\python_embeded\python.exe -s ComfyUI/main.py --disable-smart-memory --use-pytorch-cross-attention
pause

You may or may not need --fp32-vae depending on the model you're working with, or you might get all black decodes. You can also use a node to switch the VAE like Kijai's VAE loader node. As far as I know though, you have to make sure set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 is enabled and that --use-pytorch-cross-attention is there as well. I found settting MIOpen to fast there to work best, though I'm not 100% positive of the impact it might have on things like VRAM usage. The log level is just there to hush any potential console spam.

Using the above config:
With torch.backends.cudnn.enabled = False: 17.6GB VRAM usage while diffusing = 84s.
Commenting it out: 14.7GB VRAM usage while diffusing = 83s.
The images are slightly different, due to the way these kinds of optimizations work.

So it seems like torch.backends.cudnn.enabled = False is causing it to consume a lot more VRAM than before. This might be what's causing issues for some.

Oh and this is with:

pytorch version: 2.10.0a0+rocm7.10.0a20251013
AMD arch: gfx1100
ROCm version: (7, 1)

@comfyanonymous
Copy link
Owner Author

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups.

@alexheretic
Copy link
Contributor

I noticed upscale image with model performance regressed with this patch: ~5s/it -> ~32s/it which is quite significant when upscaling videos.

I actually noticed a similar regression when I tried rocm 7, which is why I'm still using 6.4.

Total VRAM 16368 MB, total RAM 64217 MB                                                                                                                                   
pytorch version: 2.9.0.dev20250827+rocm6.4                                                                                                                                
AMD arch: gfx1100                                                                                                                                                         
ROCm version: (6, 4)                                                                                                                                                      
Set vram state to: NORMAL_VRAM                                                                                                                                            
Device: cuda:0 AMD Radeon RX 7900 GRE : native                                                                                                                            
Using Flash Attention                                                                                                                                                     
Python version: 3.12.11 (main, Jun  4 2025, 10:32:37) [GCC 15.1.1 20250425]                                                                                               
ComfyUI version: 0.3.65                                                                                                                                                   
ComfyUI frontend version: 1.27.10

@RandomGitUser321
Copy link
Contributor

~5s/it -> ~32s/it which is quite significant when upscaling videos.

You probably ran out of VRAM, which might be related to my earlier post. It's also possible that something went from being fp16 to bf16 or fp32, which would make it take up more VRAM.

gmaOCR pushed a commit to gmaOCR/ComfyUI that referenced this pull request Oct 14, 2025
I honestly have no idea why this improves things but it does.
@alexheretic
Copy link
Contributor

It seems related to torch.backends.cudnn.enabled, though I'm not sure what effect this has exactly. Previously I had upscale image with model perf ~5s/it.

Now:

  • torch.backends.cudnn.enabled = False -> ~32s/it
  • torch.backends.cudnn.enabled = True -> ~5s/it

@xzuyn
Copy link

xzuyn commented Oct 14, 2025

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this.

There should be a flag to re-enable it though.

@sfinktah
Copy link

sfinktah commented Oct 15, 2025

@comfyanonymous

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this. I also don't see any slowdowns or memory usage increase for SDXL or any other models I have tried on my RDNA 3 and RDNA 4 setups.

I trust you have nothing against some healthy peer review? If you would care to send me your WAN 2.2 workflow, GPU specs (for the RDNA 3), and time taken? WAN 2.2 seems the most relevant model these days.

Did any of your tests include an "Upscale with Model" with ERSGAN as the model, or (even better) one of these:
image

I'm not sure that one (FILM VFI -- film_net_fp32.pt) will even actually complete without cuDNN, but if you have issues, the RIFE version of that node is more forgiving about such things.

That said, those timings and observations were made with ZLUDA, and I would want to do more testing on a native ROCm system before commiting myself. And since your workflows appear to be the stick by which we must measure these things, I will await those before commencing.

@sfinktah
Copy link

It seems related to torch.backends.cudnn.enabled, though I'm not sure what effect this has exactly. Previously I had upscale image with model perf ~5s/it.

Now:

  • torch.backends.cudnn.enabled = False -> ~32s/it
  • torch.backends.cudnn.enabled = True -> ~5s/it

Within the ZLUDA userbase, we have noted resizing and upscaling are the two things that really slow down without cuDNN. Though as I just said to comfyanonymous, I haven't tested that in a native ROCm environment. At least with ZLUDA, one can reason that it has something to do with the cudnn emulation -- the explanation for its similar effect on native AMD systems is a little harder to explain (assuming you are running native AMD, of course).

@comfyanonymous
Copy link
Owner Author

I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow).

I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything.

All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux.

For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme).

adlerfaulkner pushed a commit to LucaLabsInc/ComfyUI that referenced this pull request Oct 16, 2025
I honestly have no idea why this improves things but it does.
@alexheretic
Copy link
Contributor

alexheretic commented Oct 16, 2025

I can try re-testing with latest rocm/pytorch. I do use MIOPEN_FIND_MODE=FAST, it was suggested earlier in the thread that disabling cudnn disables miopen. If so it's possibly a better default to set MIOPEN_FIND_MODE rather than cudnn. See #10302 (comment)

@alexheretic
Copy link
Contributor

alexheretic commented Oct 16, 2025

I did some tests using latest https://rocm.nightlies.amd.com/v2/gfx110X-dgpu pytorch. I think this can help explain the difference of opinion in this thread.

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

Note: I used a smaller test video so the numbers are slightly different than those I mentioned earlier.

pytorch version: 2.10.0a0+rocm7.10.0a20251015

  • cudnn default: ImageUpscaleWithModel 81.14s/it
  • cudnn = False: ImageUpscaleWithModel 12.64s/it

pytorch version: 2.9.0.dev20250827+rocm6.4

  • cudnn default: ImageUpscaleWithModel 1.84s/it
  • cudnn = False: ImageUpscaleWithModel 11.95s/it

So it may make sense to disable cudnn only for newer pytorch/rocm versions and to ask upstream to find the root cause of the regression so cudnn doesn't need to be disabled anywhere.

@RandomGitUser321
Copy link
Contributor

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know.

@sfinktah
Copy link

I have tested SDXL (standard 1024x1024 workflow), lumina 2.0 (neta yume 3.5 workflow) and flux-dev (standard workflow).

I tried the wan models but on an MI300X where setting cudnn to False also improved the first run and didn't seem to slow down anything.

All my tests were on nightly pytorch rocm 7.0 from the pytorch website on Linux.

For windows I tried the nightly AMD wheel for the strix halo (you can find the way to install it in the comfyui readme).

Totally aside from the discussion about cudnn, for which I bow to @alexheretic, I'd be genuinely interested in comparing the performance of WAN2.2 between the two extremes of nightly rocm on Linux (you), and pytorch 2.7/hip 6.2/windows/zluda (me). Provided you have an RDNA 3 card that is roughly equal to my 7900 XTX. Or perhaps someone else reading this can obligue me?

Also, if you have achieved good RDNA 4 performance on any platform, there are always people raising issues re: ZLUDA about gfx1200 and gfx1201, even gfx1151 (strix). As of about 2 months ago, their reports of Linux support were not inspiring. Since people only complain when things don't work, it would be nice to know the current state of play wrt Linux ROCm.

@alexheretic
Copy link
Contributor

If you're using Windows

I'm using Linux.

@sfinktah
Copy link

It seems newer pytorch+rocm versions have a significant regression in performance that is somewhat mitigated by disabling cudnn. However, for earlier pytorch+rocm this mitigation has a negative effect as the performance is much better with default cudnn.

If you're using Windows, that's probably due to the older versions not having AOTriton enabled. It only recently got merged into things for Windows by default. Before, you had to manually build for it with specific args. MIOpen is a part of that mix, as far as I know.

That's interesting to know. Honestly I have avoided trying any of the pytorch 7 based builds for Windows beacuse there's no MIOpen.dll in AMD's ROCm SDK, ergo no triton and no torch.induction. That holds true for 6.4 unfortunately, unless there have been recent developments.

@sfinktah
Copy link

This makes AMD experience on par with Nvidia and is a massive user experience improvement so I'm not reverting this.

There should be a flag to re-enable it though.

If you care to search the ComfyUI repo for ovum or grab it from https://github.com/sfinktah/comfy-ovum there is a node specifically for enabling and disabling cudnn. It has the added bonus of being tied into my Timer node, which will record performance in red or green for cudnn disabled and enabled respectively.

image

Or you can just deploy a single node at the start of your workflow so that cudnn is in a known state. There is also a AMD/NVIDIA switching node (haven't really tested that one) if you want to get conditional about it.

image

@xzuyn
Copy link

xzuyn commented Oct 16, 2025

If only the VAE encode/decode is the problem, maybe it would be best to run the encode/decode stuff with torch.backends.cudnn.flags(enabled=False): like this instead of completely disabling it?

@sfinktah
Copy link

sfinktah commented Oct 18, 2025

@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider

  1. altering the patch to either specifically target those cases; or
  2. printing a notification to console that cudnn has been disabled and this behavior can ben altered by an ENV var; or
  3. anything that doesn't silently make a substantial number of AMD users even more frustrated with owning an AMD?

@lostdisc
Copy link

On my end, I can confirm this does eliminate SDXL's really slow first run for a given image resolution (30-60 mins of mostly VAE, depending on image size); see also #5759 and ROCm issue 5754. Fixing that is a big win for first impressions. But I can sympathize with the desire for a more-targeted fix, if it's affecting other setups and use cases.

I see bf16-vae is also automatically enabling for me now, which is handy; I was using the launch flag before.

I'm on a Ryzen AI HX 370 (Strix Point) on Windows, with ROCm 6.4.4 PyTorch from AMD repos, following these instructions.

Image-generation time has become shorter for large pics, even though VAE tiling kicks in at a smaller image size. Previously, if I assigned 24GB RAM (out of 32 total) to the GPU, I could often do 1600x1280 and 1920x1088 (similar areas) without tiling, and a 30-step gen took about 300s (5 min). Now it always resorts to tiling for those sizes, but the time is down to about 220s (<4 min).

Interestingly, if I change the CPU/GPU RAM split to 16/16 instead, I can do 1600x1280 without tiling, and it shaves off a bit more time. However, attempting 1920x1280 with a 16/16 split hits the RAM ceiling and starts freezing Windows, whereas an 8/24 split is able to safely switch to tiled VAE and finish in about 300s. Before the update, 8/24 could do 1920x1280 without tiling, but it took something like 360-380s, if I recall correctly.

1024x1024 remained pretty much the same speed before/after the update, at about 1 min for 20 steps.

I notice one of the other commits for this release improved the memory estimation for VAE on AMD by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change?

@comfyanonymous
Copy link
Owner Author

I notice one of the other commits for this release #10334 by adding a 2.73x multiplier when AMD is detected. Does that correspond to the increased VAE memory demands from this change?

No, the VAE memory estimation has always been off on AMD.

@comfyanonymous we've heard from a number of people who have pointed out that this patch is only a net benefit for specific versions of rocm on specific platforms. will you not consider

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

@sfinktah
Copy link

sfinktah commented Oct 19, 2025

@comfyanonymous

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously.

    # This needs to be up here, so it can disable cudnn before anything can even think about using it
    torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
    if torch.backends.cudnn.enabled:
        print("  ::  Enabled cuDNN")
    else:
        print("  ::  Disabled cuDNN")

I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable.

@Only8Bits
Copy link

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

There is nothing to fix here. It's how MIOpen works on AMD. Reference: https://rocm.docs.amd.com/projects/MIOpen/en/latest/how-to/find-and-immediate.html

Short version: AMD has 2 card types, pro and consumer. On all cards the default behaviour is to search for optimal compute solution to any given matrix problem the first time it is ecountered - that takes both time and VRAM. After that it'll be stored in local database and looked up so it's fast. The problem is MIOpen database is per-version, that is once the version is bumped the old database is ignored and a new one is started. And each new major ROCm release usually has new MIOpen version. There is a database of optimal compute solutions for pro cards but not consumer ones (maybe it'll change someday but it hasn't for years) so Radeons need to run a lot of computations to figure out what is what - though even for pro cards the AMD database does not have 100% coverage for all possible matrix sizes and data types.

Rather than breaking the whole app you can set MIOPEN_FIND_MODE="FAST" (or 2) on AMD to just skip the solution search on each new problem. On pro cards you won't even feel it much since these are pretty well profiled by AMD, so any long-term performance loss from suboptimal solution picks are minimal. On Radeons that search is quite time consuming (esp. for training and bigger matrix sizes in SDXL, since noboby has the VRAM to run WAN in 720p mode locally on Radeons and these problems tend to be smaller and faster to find) but it usually results in 5% or so reduced compute times long term once an optimal solution is found.

TLDR: Use MIOPEN_FIND_MODE or just accept that (sometimes massive) slowdown on each first time for a few % better performance long term. It's not ComfyUI issue, it's how MIOpen works. And consumer cards suffer the most since there is no global database from AMD, but also tend to find better solutions than the default ones so it's usually worth it.

@derfasthirnlosenick
Copy link

@comfyanonymous

With the cudnn flag enabled the VAE and the upscale models makes my computer unusable for a few minutes the first time I run it with a specific resolution, until they fix that it's staying off.

Well, it's like... your repo... but could you add a logger output line so that people know it's happening, and perhaps an override? This is what I added to zluda.py in the end, you can default it the other way around obviously.

    # This needs to be up here, so it can disable cudnn before anything can even think about using it
    torch.backends.cudnn.enabled = os.environ.get("TORCH_BACKENDS_CUDNN_ENABLED", "1").strip().lower() not in {"0", "off", "false", "disable", "disabled", "no"}
    if torch.backends.cudnn.enabled:
        print("  ::  Enabled cuDNN")
    else:
        print("  ::  Disabled cuDNN")

I can adapt and submit that as a PR if you are willing? Just lmk what to name the environment variable.

Yeah I think that's the most sensible solution. Reenabling cudnn gave my simple sdxl workflow a boost from 1.15it/s to 1.6... (6800xt)

@lostdisc
Copy link

Looks like comfyanon added a log message for v0.3.66, at least.

@sfinktah
Copy link

sfinktah commented Oct 23, 2025

Well, that's something. Unfortunately it seems I can't make a PR as I already have an actively developed ZLUDA fork of ComfyUI, and github won't let me make another. I have submitted a PR under another account. #10463

try:
    arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
    if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
        torch.backends.cudnn.enabled = os.environ.get("TORCH_AMD_CUDNN_ENABLED", "0").strip().lower() not in {
            "0", "off", "false", "disable", "disabled", "no"}
        if not torch.backends.cudnn.enabled:
            logging.info(
                "ComfyUI has set torch.backends.cudnn.enabled to False for better AMD performance. Set environment var TORCH_AMD_CUDDNN_ENABLED=1 to enable it again.")

@alexheretic
Copy link
Contributor

I've proposed #10448 which has the benefit of being easy to override. However, this requires some testing to verify that setting MIOPEN_FIND_MODE=FAST is a good enough alternative. On my Linux 7900gre rocm6.4 + flash-attn setup, at least, it is.

comfy-ovum pushed a commit to comfy-ovum/ComfyUI that referenced this pull request Oct 24, 2025
@Only8Bits
Copy link

Only8Bits commented Oct 24, 2025

MIOPEN_FIND_MODE=FAST only "fixes" the very first run on new MIOpen version install. Nothing else. Sadly it does not fix the VAE issue in ComfyUI - I've run some tests with WanWrapper:

System: Kubuntu 24.04 + ROCm 7.0.2 + AMD DKMS driver + Pytorch 2.8 from ROCm
Hardware: 7900XT + 5800X3D, 32G RAM

  • --cache-none, Pytorch SDPA used

    • model Wan2.2-I2V-A14B-Q8_0 GGUF (10/40), steps 10+10, Triton JIT, len 81, 832x480
      • 19m:44s (118.47s/it) + 19m:58s (119.83s/it), total 1h:08m:33s
    • model Wan2.2-I2V-A14B-Q8_0 GGUF (25/40), steps 10+10, Triton JIT, len 81, 960x720
      • 1h:04m:49s (388.91s/it) + 1h:05m:24s (392.42s/it), total 3h:03m:51s
  • --cache-none, SageAttention used

    • model I2V-A14B-Q6_K GGUF + LoRA, steps 12+12, JIT, len 81, 608x480
      • 10m:38s (53.20s/it) + 10m:19s (51.61s/it), total 22m:41s [VAE=10.80s+16.43s]
    • model I2V-A14B-Q6_K GGUF (10/40) + LoRA, steps 12+12, JIT, len 81, 832x480
      • 17m:41s (88.45s/it) + 17m:44s (88.73s/it), total 1h:05m:05s [VAE=651.28ss+1048.36s]
  • --cache-none, SageAttention with torch.backends.cudnn.enabled = False

  • model I2V-A14B-Q6_K GGUF (8/40) + LoRA, steps 12+12, JIT, len 81, 832x480
    • 18m:50s (94.18s/it) + 17m:53s (89.49s/it), total 38m:22s [VAE=8.74s+13.81s]

EDIT: Small explanation, (xx/40) means that xx blocks were RAM swapped to fit the project into VRAM. This affects the inference times somewhat.

This used WanWrapper so all custom nodes, it's not a specific issue with Comfy VAE node. But it is tied to VAE as the inference times seem unaffected by torch.backends.cudnn.enabled state. The difference is rather massive, long VAE times can cause the run to be 2x longer - but mostly for bigger resolutions. For 608x480 and 480x480 VAE seems still reasonable even with cudnn enabled.

So it could be ROCm issue or Pytorch issue, but I would guess it has to do with mixed precision used in VAE? I've tried VAE in both BF16 and FP16 (cast from FP32 model) and there doesn't seem to be much difference.

TLDR: On RDNA3/Linux at least torch.backends.cudnn.enabled must be False if you want reasonable VAE times. Seems like a bug deeper in the libraries rather than Comfy issue.

@sfinktah
Copy link

@Only8Bits we (the ZLUDA community) use a cudNN disabling wrapper or node pair to disable cudNN during VAE decoding. It's distributed with the install procedure, but I replicated a stand-alone version that works quite handily. See #10302 (comment)

We have also had MIOPEN_FIND_MODE=2 as part of the standard launch scripts for a while, and you are very correct in what you say: VAE decoding is slower with cudNN enabled. I can't offer any opinions on the "why" of it, though I have often thought Flux/Chroma's fp32 VAE works surprisingly fast (about 1.5 seconds).

We do have a node that automatically disables cudNN only for VAE operations (and only on AMD), but it's new and doesn't always work. Wrapping other nodes is a tricky business, and could probably be much better done from the Python side. ovum-cudnn-wrapper. It does however look pretty and comes preset with the smarts to work out what nodes it needs to wrap.

image

It's currently pending it's tick of wonderfullness in the ComfyUI repo, I can only assume because it modifies other nodes.

@alexheretic
Copy link
Contributor

I do tend to use tiled VAE everywhere, including for wan encodes as untiled VAE perf can indeed be bad. My tiled vae perf seems ok with cudnn on. If I have time I'll try to test cudnn-off's effect on untiled VAE.

@sfinktah
Copy link

sfinktah commented Oct 25, 2025

@alexheretic i have a lovely node timer that will mark things red or green based on whether they were running with VAE decode on or off, as long as you are using the included cuDNN toggler anyway. The difference is probably about 50% speed increase, if it exists. comfy-ovum in the registry and the Timer node. Incredibly handy, as it keeps a history of the last 100 runs, lets you add notes, and has a JSON restful API for retrieving that data to use in fancy graphs.

And oh yes, TILED VAE always... it's on my list to build an automatic "convert to tiled vae" doover, because sometimes I forget and don't notice that it's taking 217 seconds to do VAE decoding. Talking of which, I need to go fix a workflow!

image

What nobody has bought up in this conversation is the other cudnn field, cudnn.benchmark (set to false by default, so not generally used by AMD or NVIDIA). I guess it does something similar to the MIOPEN flag, so possibly it has no affect on AMD.

@alexheretic
Copy link
Contributor

VAE benches for wan & sdxl

System info

Using #7764, #10238 on 426cde3

Total VRAM 16368 MB, total RAM 64217 MB
pytorch version: 2.9.0+rocm6.4
AMD arch: gfx1100
ROCm version: (6, 4)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 GRE : native
Using Flash Attention
Python version: 3.13.7 (main, Aug 15 2025, 12:34:02) [GCC 15.2.1 20250813]
ComfyUI version: 0.3.66
ComfyUI frontend version: 1.28.7

env vars

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE

pytorch version: 2.9.0+rocm6.4

With cudnn disabled wan untiled encode/decode hits oom and fallsback to tiled. However, this is slower
than explicitly using 256 tiles (e.g. using #10238 for encodes + VAEDecodeTiled).
256-tile performance is about the same on/off.

On sdxl performance is about the same too, except for untiled decode. Here cudnn-on is slow ~27s
and cudnn-off OOMs and falls back to tiled. However in either case it is faster to explicitly use
tiled with the default 512-tile decode ~3s on or off.

Conclusion:

  • Wan: cudnn can be left enabled, users should use tiling vae ✔️
  • sdxl: cudnn can be left enabled, users should use tiling vae decode ✔️
Results

wan untiled cudnn off

Note: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled).

Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.

wan tiled 256 cudnn off

[WanImageToVideo]: 16.43s
[WanImageToVideo]: 22.69s

[VAEDecodeTiled]: 25.92s
[VAEDecodeTiled]: 39.56s

wan tiled 256 cudnn on

[WanImageToVideo]: 16.37s
[WanImageToVideo]: 20.11s

[VAEDecodeTiled]: 28.21s
[VAEDecodeTiled]: 35.36s

sdxl 1280x1832 vae cudnn off

[VAEEncode]: 0.45s
[VAEEncode]: 0.45s
[VAEEncode]: 0.45s

[VAEEncodeTiled]: 1.96s
[VAEEncodeTiled]: 1.62s
[VAEEncodeTiled]: 1.66s

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
[VAEDecode]: 5.41s
[VAEDecode]: 5.48s
[VAEDecode]: 5.51s

[VAEDecodeTiled]: 3.21s
[VAEDecodeTiled]: 3.26s
[VAEDecodeTiled]: 3.24s

sdxl 1280x1832 vae cudnn on

[VAEEncode]: 0.61s
[VAEEncode]: 0.45s
[VAEEncode]: 0.45s

[VAEEncodeTiled]: 2.05s
[VAEEncodeTiled]: 1.55s
[VAEEncodeTiled]: 1.56s

[VAEDecode]: 26.84s
[VAEDecode]: 26.49s
[VAEDecode]: 26.67s

[VAEDecodeTiled]: 3.41s
[VAEDecodeTiled]: 3.14s
[VAEDecodeTiled]: 3.15s

pytorch version: 2.10.0a0+rocm7.10.0a2025101

For wan it's a similar story to rocm6.4. However, for sdxl encode performance is generally
worse with cudnn-on.

Conclusion:

  • Wan: cudnn can be left enabled, users should use tiling vae ✔️
  • sdxl: cudnn should be disabled
Results

wan untiled cudnn off

Note: Hits oom fast and falls back to tiled (note: fallback tiled is a bit slower than 256 tiled).

Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.

wan tiled 256 cudnn off

[WanImageToVideo]: 16.31s
[WanImageToVideo]: 19.36s

[VAEDecodeTiled]: 25.67s
[VAEDecodeTiled]: 38.41s

wan tiled 256 cudnn on

[WanImageToVideo]: 17.26s
[WanImageToVideo]: 20.49s

[VAEDecodeTiled]: 34.65s
[VAEDecodeTiled]: 34.42s

sdxl 1280x1832 vae cudnn off

[VAEEncode]: 0.67s
[VAEEncode]: 0.53s
[VAEEncode]: 0.59s

[VAEEncodeTiled]: 1.92s
[VAEEncodeTiled]: 1.52s
[VAEEncodeTiled]: 1.54s

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
[VAEDecode]: 5.30s
[VAEDecode]: 5.34s
[VAEDecode]: 5.41s

[VAEDecodeTiled]: 3.18s
[VAEDecodeTiled]: 3.17s
[VAEDecodeTiled]: 3.22s

sdxl 1280x1832 vae cudnn on

[VAEEncode]: 10.21s
[VAEEncode]: 9.84s
[VAEEncode]: 9.77s

[VAEEncodeTiled]: 22.60s
[VAEEncodeTiled]: 22.68s
[VAEEncodeTiled]: 22.76s

[VAEDecode]: 26.76s
[VAEDecode]: 26.61s
[VAEDecode]: 26.59s

[VAEDecodeTiled]: 3.19s
[VAEDecodeTiled]: 3.15s
[VAEDecodeTiled]: 3.14s

An interesting difference between having cudnn off is untiled VAE tends to OOM faster log Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding. and try tiled version automatically. Whereas cudnn on doesn't OOM it just takes a long time. This doesn't seem like a good reason to disable cudnn though to me. Instead perhaps RDNA3 should default to tiled VAE generally instead of forcing users to always configure it in their workflows.

@lostdisc
Copy link

lostdisc commented Oct 25, 2025

Interesting results, thanks! For comparison, I did some SDXL VAE testing on my APU, all with bf16 vae and 1280x1600 dimensions:

Untiled VAE decode, cudnn enabled: ~100s, ~10GB VRAM
Tiled VAE decode, cudnn enabled: ~15s, ~3GB VRAM

Untiled VAE decode, cudnn disabled: ~5s, ~19GB VRAM
Tiled VAE decode, cudnn disabled: ~15s, ~3GB RAM

So yeah, enabling cudnn halves the RAM for untiled VAE, but takes MUCH longer to run. And that's not even the extra-long first run for this resolution, which was more like 30 minutes.

Tiled VAE uses similar time and RAM with/without cudnn, but still has the slow-first-run issue with cudnn enabled. (IIRC it's not as bad as fullsize though, presumably since the tile size stays the same for different image dimensions, except maybe at the picture edges.)

My system is a Ryzen AI 9 HX 370 with a Radeon 890M iGPU (gfx1150). 32GB RAM with 16GB assigned to the GPU (and another 8GB shareable. Oddly, assigning 24GB makes it fall back to tiled decoding for the cudnn-disabled case).
ComfyUI 0.3.66, Python 3.12.11, ROCm 6.4.4, and Pytorch 2.8.0 on Windows 11.

@alexheretic
Copy link
Contributor

the extra-long first run for this resolution, which was more like 30 minutes

This sounds like maybe the key issue and reason for disabling cudnn. I didn't reproduce it in my setup though. For me the downside of disabling cudnn is #10447, so I was hoping for a better solution than this.

As earlier suggested by others, maybe change so cudnn is disabled only during vae? And/or maybe add some arg/env var to control this.

Xiuzhenpeng added a commit to Xiuzhenpeng/ComfyUI-Docker that referenced this pull request Oct 29, 2025
* Fix lowvram issue with hunyuan image vae. (comfyanonymous#9794)

* add StabilityAudio API nodes (comfyanonymous#9749)

* ComfyUI version v0.3.58

* add new ByteDanceSeedream (4.0) node (comfyanonymous#9802)

* Update template to 0.1.78 (comfyanonymous#9806)

* Update template to 0.1.77

* Update template to 0.1.78

* ComfyUI version 0.3.59

* Support hunyuan image distilled model. (comfyanonymous#9807)

* Update template to 0.1.81 (comfyanonymous#9811)

* Fast preview for hunyuan image. (comfyanonymous#9814)

* Implement hunyuan image refiner model. (comfyanonymous#9817)

* Add Output to V3 Combo type to match what is possible with V1 (comfyanonymous#9813)

* Bump frontend to 1.26.11 (comfyanonymous#9809)

* Add noise augmentation to hunyuan image refiner. (comfyanonymous#9831)

This was missing and should help with colors being blown out.

* Fix hunyuan refiner blownout colors at noise aug less than 0.25 (comfyanonymous#9832)

* Set default hunyuan refiner shift to 4.0 (comfyanonymous#9833)

* add kling-v2-1 model to the KlingStartEndFrame node (comfyanonymous#9630)

* convert Minimax API nodes to the V3 schema (comfyanonymous#9693)

* convert WanCameraEmbedding node to V3 schema (comfyanonymous#9714)

* convert Cosmos nodes to V3 schema (comfyanonymous#9721)

* convert nodes_cond.py to V3 schema (comfyanonymous#9719)

* convert CFG nodes to V3 schema (comfyanonymous#9717)

* convert Canny node to V3 schema (comfyanonymous#9743)

* convert Moonvalley API nodes to the V3 schema (comfyanonymous#9698)

* Better way of doing the generator for the hunyuan image noise aug. (comfyanonymous#9834)

* Enable Runtime Selection of Attention Functions (comfyanonymous#9639)

* Looking into a @wrap_attn decorator to look for 'optimized_attention_override' entry in transformer_options

* Created logging code for this branch so that it can be used to track down all the code paths where transformer_options would need to be added

* Fix memory usage issue with inspect

* Made WAN attention receive transformer_options, test node added to wan to test out attention override later

* Added **kwargs to all attention functions so transformer_options could potentially be passed through

* Make sure wrap_attn doesn't make itself recurse infinitely, attempt to load SageAttention and FlashAttention if not enabled so that they can be marked as available or not, create registry for available attention

* Turn off attention logging for now, make AttentionOverrideTestNode have a dropdown with available attention (this is a test node only)

* Make flux work with optimized_attention_override

* Add logs to verify optimized_attention_override is passed all the way into attention function

* Make Qwen work with optimized_attention_override

* Made hidream work with optimized_attention_override

* Made wan patches_replace work with optimized_attention_override

* Made SD3 work with optimized_attention_override

* Made HunyuanVideo work with optimized_attention_override

* Made Mochi work with optimized_attention_override

* Made LTX work with optimized_attention_override

* Made StableAudio work with optimized_attention_override

* Made optimized_attention_override work with ACE Step

* Made Hunyuan3D work with optimized_attention_override

* Make CosmosPredict2 work with optimized_attention_override

* Made CosmosVideo work with optimized_attention_override

* Made Omnigen 2 work with optimized_attention_override

* Made StableCascade work with optimized_attention_override

* Made AuraFlow work with optimized_attention_override

* Made Lumina work with optimized_attention_override

* Made Chroma work with optimized_attention_override

* Made SVD work with optimized_attention_override

* Fix WanI2VCrossAttention so that it expects to receive transformer_options

* Fixed Wan2.1 Fun Camera transformer_options passthrough

* Fixed WAN 2.1 VACE transformer_options passthrough

* Add optimized to get_attention_function

* Disable attention logs for now

* Remove attention logging code

* Remove _register_core_attention_functions, as we wouldn't want someone to call that, just in case

* Satisfy ruff

* Remove AttentionOverrideTest node, that's something to cook up for later

* Hunyuan refiner vae now works with tiled. (comfyanonymous#9836)

* Support wav2vec base models (comfyanonymous#9637)

* Support wav2vec base models

* trim trailing whitespace

* Do interpolation after

* Cleanup. (comfyanonymous#9838)

* Remove single quote pattern to avoid wrong matches (comfyanonymous#9842)

* Add support for Chroma Radiance (comfyanonymous#9682)

* Initial Chroma Radiance support

* Minor Chroma Radiance cleanups

* Update Radiance nodes to ensure latents/images are on the intermediate device

* Fix Chroma Radiance memory estimation.

* Increase Chroma Radiance memory usage factor

* Increase Chroma Radiance memory usage factor once again

* Ensure images are multiples of 16 for Chroma Radiance
Add batch dimension and fix channels when necessary in ChromaRadianceImageToLatent node

* Tile Chroma Radiance NeRF to reduce memory consumption, update memory usage factor

* Update Radiance to support conv nerf final head type.

* Allow setting NeRF embedder dtype for Radiance
Bump Radiance nerf tile size to 32
Support EasyCache/LazyCache on Radiance (maybe)

* Add ChromaRadianceStubVAE node

* Crop Radiance image inputs to multiples of 16 instead of erroring to be in line with existing VAE behavior

* Convert Chroma Radiance nodes to V3 schema.

* Add ChromaRadianceOptions node and backend support.
Cleanups/refactoring to reduce code duplication with Chroma.

* Fix overriding the NeRF embedder dtype for Chroma Radiance

* Minor Chroma Radiance cleanups

* Move Chroma Radiance to its own directory in ldm
Minor code cleanups and tooltip improvements

* Fix Chroma Radiance embedder dtype overriding

* Remove Radiance dynamic nerf_embedder dtype override feature

* Unbork Radiance NeRF embedder init

* Remove Chroma Radiance image conversion and stub VAE nodes
Add a chroma_radiance option to the VAELoader builtin node which uses comfy.sd.PixelspaceConversionVAE
Add a PixelspaceConversionVAE to comfy.sd for converting BHWC 0..1 <-> BCHW -1..1

* Changes to the previous radiance commit. (comfyanonymous#9851)

* Make ModuleNotFoundError ImportError instead (comfyanonymous#9850)

* Add that hunyuan image is supported to readme. (comfyanonymous#9857)

* Support the omnigen2 umo lora. (comfyanonymous#9886)

* Fix depending on asserts to raise an exception in BatchedBrownianTree and Flash attn module (comfyanonymous#9884)

Correctly handle the case where w0 is passed by kwargs in BatchedBrownianTree

* Add encoder part of whisper large v3 as an audio encoder model. (comfyanonymous#9894)

Not useful yet but some models use it.

* Reduce Peak WAN inference VRAM usage (comfyanonymous#9898)

* flux: Do the xq and xk ropes one at a time

This was doing independendent interleaved tensor math on the q and k
tensors, leading to the holding of more than the minimum intermediates
in VRAM. On a bad day, it would VRAM OOM on xk intermediates.

Do everything q and then everything k, so torch can garbage collect
all of qs intermediates before k allocates its intermediates.

This reduces peak VRAM usage for some WAN2.2 inferences (at least).

* wan: Optimize qkv intermediates on attention

As commented. The former logic computed independent pieces of QKV in
parallel which help more inference intermediates in VRAM spiking
VRAM usage. Fully roping Q and garbage collecting the intermediates
before touching K reduces the peak inference VRAM usage.

* Support the HuMo model. (comfyanonymous#9903)

* Support the HuMo 17B model. (comfyanonymous#9912)

* Enable fp8 ops by default on gfx1200 (comfyanonymous#9926)

* make kernel of same type as image to avoid mismatch issues (comfyanonymous#9932)

* Do padding of audio embed in model for humo for more flexibility. (comfyanonymous#9935)

* Bump frontend to 1.26.13 (comfyanonymous#9933)

* Basic WIP support for the wan animate model. (comfyanonymous#9939)

* api_nodes: reduce default timeout from 7 days to 2 hours (comfyanonymous#9918)

* fix(seedream4): add flag to ignore error on partial success (comfyanonymous#9952)

* Update WanAnimateToVideo to more easily extend videos. (comfyanonymous#9959)

* Add inputs for character replacement to the WanAnimateToVideo node. (comfyanonymous#9960)

* [Reviving comfyanonymous#5709] Add strength input to Differential Diffusion (comfyanonymous#9957)

* Update nodes_differential_diffusion.py

* Update nodes_differential_diffusion.py

* Make strength optional to avoid validation errors when loading old workflows, adjust step

---------

Co-authored-by: ThereforeGames <[email protected]>

* Fix LoRA Trainer bugs with FP8 models. (comfyanonymous#9854)

* Fix adapter weight init

* Fix fp8 model training

* Avoid inference tensor

* Lower wan memory estimation value a bit. (comfyanonymous#9964)

Previous pr reduced the peak memory requirement.

* Set some wan nodes as no longer experimental. (comfyanonymous#9976)

* Support for qwen edit plus model. Use the new TextEncodeQwenImageEditPlus. (comfyanonymous#9986)

* add offset param (comfyanonymous#9977)

* Fix bug with WanAnimateToVideo node. (comfyanonymous#9988)

* Fix bug with WanAnimateToVideo. (comfyanonymous#9990)

* update template to 0.1.86 (comfyanonymous#9998)

* update template to 0.1.84

* update template to 0.1.85

* Update template to 0.1.86

* feat(api-nodes): add wan t2i, t2v, i2v nodes (comfyanonymous#9996)

* ComfyUI version 0.3.60

* Rodin3D - add [Rodin3D Gen-2 generate] api-node (comfyanonymous#9994)

* update Rodin api node

* update rodin3d gen2 api node

* fix images limited bug

* Add new audio nodes (comfyanonymous#9908)

* Add new audio nodes

- TrimAudioDuration
- SplitAudioChannels
- AudioConcat
- AudioMerge
- AudioAdjustVolume

* Update nodes_audio.py

* Add EmptyAudio -node

* Change duration to Float (allows sub seconds)

* Fix issue with .view() in HuMo. (comfyanonymous#10014)

* Fix memory leak by properly detaching model finalizer (comfyanonymous#9979)

When unloading models in load_models_gpu(), the model finalizer was not
being explicitly detached, leading to a memory leak. This caused
linear memory consumption increase over time as models are repeatedly
loaded and unloaded.

This change prevents orphaned finalizer references from accumulating in
memory during model switching operations.

* Make LatentCompositeMasked work with basic video latents. (comfyanonymous#10023)

* Fix the failing unit test. (comfyanonymous#10037)

* Add @Kosinkadink as code owner (comfyanonymous#10041)

Updated CODEOWNERS to include @Kosinkadink as a code owner.

* convert nodes_rebatch.py to V3 schema (comfyanonymous#9945)

* convert nodes_fresca.py to V3 schema (comfyanonymous#9951)

* convert nodes_sdupscale.py to V3 schema (comfyanonymous#9943)

* convert nodes_tcfg.py to V3 schema (comfyanonymous#9942)

* convert nodes_sag.py to V3 schema (comfyanonymous#9940)

* convert nodes_post_processing to V3 schema (comfyanonymous#9491)

* convert CLIPTextEncodeSDXL nodes to V3 schema (comfyanonymous#9716)

* Don't add template to qwen2.5vl when template is in prompt. (comfyanonymous#10043)

Make the hunyuan image refiner template_end 36.

* Add 'input_cond' and 'input_uncond' to the args dictionary passed into sampler_cfg_function (comfyanonymous#10044)

* Update template to 0.1.88 (comfyanonymous#10046)

* Add workflow templates version tracking to system_stats (comfyanonymous#9089)

Adds installed and required workflow templates version information to the
/system_stats endpoint, allowing the frontend to detect and notify users
when their templates package is outdated.

- Add get_installed_templates_version() and get_required_templates_version()
  methods to FrontendManager
- Include templates version info in system_stats response
- Add comprehensive unit tests for the new functionality

* convert nodes_hidream.py to V3 schema (comfyanonymous#9946)

* convert nodes_bfl.py to V3 schema (comfyanonymous#10033)

* convert nodes_luma.py to V3 schema (comfyanonymous#10030)

* convert nodes_pixart.py to V3 schema (comfyanonymous#10019)

* convert nodes_photomaker.py to V3 schema (comfyanonymous#10017)

* convert nodes_qwen.py to V3 schema (comfyanonymous#10049)

* Reduce Peak WAN inference VRAM usage - part II (comfyanonymous#10062)

* flux: math: Use _addcmul to avoid expensive VRAM intermediate

The rope process can be the VRAM peak and this intermediate
for the addition result before releasing the original can OOM.
addcmul_ it.

* wan: Delete the self attention before cross attention

This saves VRAM when the cross attention and FFN are in play as the
VRAM peak.

* Improvements to the stable release workflow. (comfyanonymous#10065)

* Fix typo in release workflow. (comfyanonymous#10066)

* convert nodes_lotus.py to V3 schema (comfyanonymous#10057)

* convert nodes_lumina2.py to V3 schema (comfyanonymous#10058)

* convert nodes_hypertile.py to V3 schema (comfyanonymous#10061)

* feat: ComfyUI can be run on the specified Ascend NPU (comfyanonymous#9663)

* feature: Set the Ascend NPU to use a single one

* Enable the `--cuda-device` parameter to support both CUDA and Ascend NPUs simultaneously.

* Make the code just set the ASCENT_RT_VISIBLE_DEVICES environment variable without any other edits to master branch

---------

Co-authored-by: Jedrzej Kosinski <[email protected]>

* Fix stable workflow creating multiple draft releases. (comfyanonymous#10067)

* Update command to install latest nighly pytorch. (comfyanonymous#10085)

* [Rodin3d api nodes] Updated the name of the save file path (changed from timestamp to UUID). (comfyanonymous#10011)

* Update savepath name from time to uuid

* delete lib

* Update template to 0.1.91 (comfyanonymous#10096)

* add WanImageToImageApi node (comfyanonymous#10094)

* convert nodes_mochi.py to V3 schema (comfyanonymous#10069)

* convert nodes_perpneg.py to V3 schema (comfyanonymous#10081)

* dont cache new locale entry points (comfyanonymous#10101)

* convert nodes_mahiro.py to V3 schema (comfyanonymous#10070)

* Add action to create cached deps with manually specified torch. (comfyanonymous#10102)

* Make the final release test optional in the stable release action. (comfyanonymous#10103)

* Different base files for different release. (comfyanonymous#10104)

* Different base files for nvidia and amd portables. (comfyanonymous#10105)

* Add a way to have different names for stable nvidia portables. (comfyanonymous#10106)

* Add action to do the full stable release. (comfyanonymous#10107)

* Make stable release workflow callable. (comfyanonymous#10108)

* Add basic readme for AMD portable. (comfyanonymous#10109)

* ComfyUI version 0.3.61

* Workflow permission fix. (comfyanonymous#10110)

* Add new portable links to readme. (comfyanonymous#10112)

* fix(Rodin3D-Gen2): missing "task_uuid" parameter (comfyanonymous#10128)

* enable Seedance Pro model in the FirstLastFrame node (comfyanonymous#10120)

* ComfyUI version 0.3.62.

* Bump frontend to 1.27.7 (comfyanonymous#10133)

* convert nodes_audio_encoder.py to V3 schema (comfyanonymous#10123)

* convert nodes_gits.py to V3 schema (comfyanonymous#9949)

* convert nodes_differential_diffusion.py to V3 schema (comfyanonymous#10056)

* convert nodes_optimalsteps.py to V3 schema (comfyanonymous#10074)

* convert nodes_pag.py to V3 schema (comfyanonymous#10080)

* convert nodes_lt.py to V3 schema (comfyanonymous#10084)

* convert nodes_ip2p.pt to V3 schema (comfyanonymous#10097)

* Support the new hunyuan vae. (comfyanonymous#10150)

* feat: Add Epsilon Scaling node for exposure bias correction (comfyanonymous#10132)

* sd: fix VAE tiled fallback VRAM leak (comfyanonymous#10139)

When the VAE catches this VRAM OOM, it launches the fallback logic
straight from the exception context.

Python however refs the entire call stack that caused the exception
including any local variables for the sake of exception report and
debugging. In the case of tensors, this can hold on the references
to GBs of VRAM and inhibit the VRAM allocated from freeing them.

So dump the except context completely before going back to the VAE
via the tiler by getting out of the except block with nothing but
a flag.

The greately increases the reliability of the tiler fallback,
especially on low VRAM cards, as with the bug, if the leak randomly
leaked more than the headroom needed for a single tile, the tiler
would fallback would OOM and fail the flow.

* WAN: Fix cache VRAM leak on error (comfyanonymous#10141)

If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.

The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.

Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.

So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.

* Add a .bat to the AMD portable to disable smart memory. (comfyanonymous#10153)

* convert nodes_morphology.py to V3 schema (comfyanonymous#10159)

* fix(api-nodes): made logging path to be smaller (comfyanonymous#10156)

* Turn on TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL by default. (comfyanonymous#10168)

* update example_node to use V3 schema (comfyanonymous#9723)

* feat(linter, api-nodes): add pylint for comfy_api_nodes folder (comfyanonymous#10157)

* feat(api-nodes): add kling-2-5-turbo to txt2video and img2video nodes (comfyanonymous#10155)

* fix(api-nodes): reimport of base64 in Gemini node (comfyanonymous#10181)

* fix(api-nodes): bad indentation in Recraft API node function (comfyanonymous#10175)

* convert nodes_torch_compile.py to V3 schema (comfyanonymous#10173)

* convert nodes_eps.py to V3 schema (comfyanonymous#10172)

* convert nodes_pixverse.py to V3 schema (comfyanonymous#10177)

* convert nodes_tomesd.py to V3 schema (comfyanonymous#10180)

* convert nodes_edit_model.py to V3 schema (comfyanonymous#10147)

* Fix type annotation syntax in MotionEncoder_tc __init__ (comfyanonymous#10186)

## Summary
Fixed incorrect type hint syntax in `MotionEncoder_tc.__init__()` parameter list.

## Changes
- Line 647: Changed `num_heads=int` to `num_heads: int` 
- This corrects the parameter annotation from a default value assignment to proper type hint syntax

## Details
The parameter was using assignment syntax (`=`) instead of type annotation syntax (`:`), which would incorrectly set the default value to the `int` class itself rather than annotating the expected type.

* Update amd nightly command in readme. (comfyanonymous#10189)

* Add instructions to install nightly AMD pytorch for windows. (comfyanonymous#10190)

* Add instructions to install nightly AMD pytorch for windows.

* Update README.md

* fix(api-nodes): enable 2 more pylint rules, removed non needed code (comfyanonymous#10192)

* convert nodes_rodin.py to V3 schema (comfyanonymous#10195)

* convert nodes_stable3d.py to V3 schema (comfyanonymous#10204)

* Remove soundfile dependency. No more torchaudio load or save. (comfyanonymous#10210)

* fix(api-nodes): disable "std" mode for Kling2.5-turbo (comfyanonymous#10212)

* Remove useless code. (comfyanonymous#10223)

* Update template to 0.1.93 (comfyanonymous#10235)

* Update template to 0.1.92

* Update template to 0.1.93

* ComfyUI version 0.3.63

* fix(api-nodes): enable more pylint rules (comfyanonymous#10213)

* fix(api-nodes): allow negative_prompt PixVerse to be multiline (comfyanonymous#10196)

* convert nodes_pika.py to V3 schema (comfyanonymous#10216)

* convert nodes_kling.py to V3 schema (comfyanonymous#10236)

* Implement gemma 3 as a text encoder. (comfyanonymous#10241)

Not useful yet.

* fix(ReCraft-API-node): allow custom multipart parser to return FormData (comfyanonymous#10244)

* feat(api-nodes): add Sora2 API node (comfyanonymous#10249)

* Temp fix for LTXV custom nodes. (comfyanonymous#10251)

* Bump frontend to 1.27.10 (comfyanonymous#10252)

* update template to 0.1.94 (comfyanonymous#10253)

* ComfyUI version 0.3.64

* feat(V3-io): allow Enum classes for Combo options (comfyanonymous#10237)

* Refactor model sampling sigmas code. (comfyanonymous#10250)

* Mvly/node update (comfyanonymous#10042)

* updated V2V node to allow for control image input
exposing steps in v2v
fixing guidance_scale as input parameter

TODO: allow for motion_intensity as input param.

* refactor: comment out unsupported resolution and adjust default values in video nodes

* set control_after_generate

* adding new defaults

* fixes

* changed control_after_generate back to True

* changed control_after_generate back to False

---------

Co-authored-by: thorsten <[email protected]>

* feat(api-nodes, pylint): use lazy formatting in logging functions (comfyanonymous#10248)

* convert nodes_model_downscale.py to V3 schema (comfyanonymous#10199)

* convert nodes_lora_extract.py to V3 schema (comfyanonymous#10182)

* convert nodes_compositing.py to V3 schema (comfyanonymous#10174)

* convert nodes_latent.py to V3 schema (comfyanonymous#10160)

* More surgical fix for comfyanonymous#10267 (comfyanonymous#10276)

* fix(v3,api-nodes): V3 schema typing; corrected Pika API nodes (comfyanonymous#10265)

* convert nodes_sd3.py and nodes_slg.py to V3 schema (comfyanonymous#10162)

* Fix bug with applying loras on fp8 scaled without fp8 ops. (comfyanonymous#10279)

* convert nodes_flux to V3 schema (comfyanonymous#10122)

* convert nodes_upscale_model.py to V3 schema (comfyanonymous#10149)

* Fix save audio nodes saving mono audio as stereo. (comfyanonymous#10289)

* feat(GeminiImage-ApiNode): add aspect_ratio and release version of model (comfyanonymous#10255)

* feat(api-nodes): add price extractor feature; small fixes to Kling & Pika nodes (comfyanonymous#10284)

* Update template to 0.1.95 (comfyanonymous#10294)

* Implement the mmaudio VAE. (comfyanonymous#10300)

* Improve AMD performance. (comfyanonymous#10302)

I honestly have no idea why this improves things but it does.

* Update node docs to 0.3.0 (comfyanonymous#10318)

* update extra models paths example (comfyanonymous#10316)

* Update the extra_model_paths.yaml.example (comfyanonymous#10319)

* Always set diffusion model to eval() mode. (comfyanonymous#10331)

* add indent=4 kwarg to json.dumps() (comfyanonymous#10307)

* WAN2.2: Fix cache VRAM leak on error (comfyanonymous#10308)

Same change pattern as 7e8dd27
applied to WAN2.2

If this suffers an exception (such as a VRAM oom) it will leave the
encode() and decode() methods which skips the cleanup of the WAN
feature cache. The comfy node cache then ultimately keeps a reference
this object which is in turn reffing large tensors from the failed
execution.

The feature cache is currently setup at a class variable on the
encoder/decoder however, the encode and decode functions always clear
it on both entry and exit of normal execution.

Its likely the design intent is this is usable as a streaming encoder
where the input comes in batches, however the functions as they are
today don't support that.

So simplify by bringing the cache back to local variable, so that if
it does VRAM OOM the cache itself is properly garbage when the
encode()/decode() functions dissappear from the stack.

* convert nodes_hunyuan.py to V3 schema (comfyanonymous#10136)

* Enable RDNA4 pytorch attention on ROCm 7.0 and up. (comfyanonymous#10332)

* Fix loading old stable diffusion ckpt files on newer numpy. (comfyanonymous#10333)

* Better memory estimation for the SD/Flux VAE on AMD. (comfyanonymous#10334)

* ComfyUI version 0.3.65

* Faster workflow cancelling. (comfyanonymous#10301)

* Python 3.14 instructions. (comfyanonymous#10337)

* api-nodes: fixed dynamic pricing format; import comfy_io directly (comfyanonymous#10336)

* Bump frontend to 1.28.6 (comfyanonymous#10345)

* gfx942 doesn't support fp8 operations. (comfyanonymous#10348)

* Add TemporalScoreRescaling node (comfyanonymous#10351)

* Add TemporalScoreRescaling node

* Mention image generation in tsr_k's tooltip

* feat(api-nodes): add Veo3.1 model (comfyanonymous#10357)

* Latest pytorch stable is cu130 (comfyanonymous#10361)

* Fix order of inputs nested merge_nested_dicts (comfyanonymous#10362)

* refactor: Replace manual patches merging with merge_nested_dicts (comfyanonymous#10360)

* Bump frontend to 1.28.7 (comfyanonymous#10364)

* feat: deprecated API alert (comfyanonymous#10366)

* fix(api-nodes): remove "veo2" model from Veo3 node (comfyanonymous#10372)

* Workaround for nvidia issue where VAE uses 3x more memory on torch 2.9 (comfyanonymous#10373)

* workaround also works on cudnn 91200 (comfyanonymous#10375)

* Do batch_slice in EasyCache's apply_cache_diff (comfyanonymous#10376)

* execution: fold in dependency aware caching / Fix --cache-none with loops/lazy etc (comfyanonymous#10368)

* execution: fold in dependency aware caching

This makes --cache-none compatiable with lazy and expanded
subgraphs.

Currently the --cache-none option is powered by the
DependencyAwareCache. The cache attempts to maintain a parallel
copy of the execution list data structure, however it is only
setup once at the start of execution and does not get meaninigful
updates to the execution list.

This causes multiple problems when --cache-none is used with lazy
and expanded subgraphs as the DAC does not accurately update its
copy of the execution data structure.

DAC has an attempt to handle subgraphs ensure_subcache however
this does not accurately connect to nodes outside the subgraph.
The current semantics of DAC are to free a node ASAP after the
dependent nodes are executed.

This means that if a subgraph refs such a node it will be requed
and re-executed by the execution_list but DAC wont see it in
its to-free lists anymore and leak memory.

Rather than try and cover all the cases where the execution list
changes from inside the cache, move the while problem to the
executor which maintains an always up-to-date copy of the wanted
data-structure.

The executor now has a fast-moving run-local cache of its own.
Each _to node has its own mini cache, and the cache is unconditionally
primed at the time of add_strong_link.

add_strong_link is called for all of static workflows, lazy links
and expanded subgraphs so its the singular source of truth for
output dependendencies.

In the case of a cache-hit, the executor cache will hold the non-none
value (it will respect updates if they happen somehow as well).

In the case of a cache-miss, the executor caches a None and will
wait for a notification to update the value when the node completes.

When a node completes execution, it simply releases its mini-cache
and in turn its strong refs on its direct anscestor outputs, allowing
for ASAP freeing (same as the DependencyAwareCache but a little more
automatic).

This now allows for re-implementation of --cache-none with no cache
at all. The dependency aware cache was also observing the dependency
sematics for the objects and UI cache which is not accurate (this
entire logic was always outputs specific).

This also prepares for more complex caching strategies (such as RAM
pressure based caching), where a cache can implement any freeing
strategy completely independently of the DepedancyAwareness
requirement.

* main: re-implement --cache-none as no cache at all

The execution list now tracks the dependency aware caching more
correctly that the DependancyAwareCache.

Change it to a cache that does nothing.

* test_execution: add --cache-none to the test suite

--cache-none is now expected to work universally. Run it through the
full unit test suite. Propagate the server parameterization for whether
or not the server is capabale of caching, so that the minority of tests
that specifically check for cache hits can if else. Hard assert NOT
caching in the else to give some coverage of --cache-none expected
behaviour to not acutally cache.

* convert nodes_controlnet.py to V3 schema (comfyanonymous#10202)

* Update Python 3.14 installation instructions (comfyanonymous#10385)

Removed mention of installing pytorch nightly for Python 3.14.

* Disable torch compiler for cast_bias_weight function (comfyanonymous#10384)

* Disable torch compiler for cast_bias_weight function

* Fix torch compile.

* Turn off cuda malloc by default when --fast autotune is turned on. (comfyanonymous#10393)

* Fix batch size above 1 giving bad output in chroma radiance. (comfyanonymous#10394)

* Speed up chroma radiance. (comfyanonymous#10395)

* Pytorch is stupid. (comfyanonymous#10398)

* Deprecation warning on unused files (comfyanonymous#10387)

* only warn for unused files

* include internal extensions

* Update template to 0.2.1 (comfyanonymous#10413)

* Update template to 0.1.97

* Update template to 0.2.1

* Log message for cudnn disable on  AMD. (comfyanonymous#10418)

* Revert "execution: fold in dependency aware caching / Fix --cache-none with l…" (comfyanonymous#10422)

This reverts commit b1467da.

* ComfyUI version v0.3.66

* Only disable cudnn on newer AMD GPUs. (comfyanonymous#10437)

* Add custom node published subgraphs endpoint (comfyanonymous#10438)

* Add get_subgraphs_dir to ComfyExtension and PUBLISHED_SUBGRAPH_DIRS to nodes.py

* Created initial endpoints, although the returned paths are a bit off currently

* Fix path and actually return real data

* Sanitize returned /api/global_subgraphs entries

* Remove leftover function from early prototyping

* Remove added whitespace

* Add None check for sanitize_entry

* execution: fold in dependency aware caching / Fix --cache-none with loops/lazy etc (Resubmit) (comfyanonymous#10440)

* execution: fold in dependency aware caching

This makes --cache-none compatiable with lazy and expanded
subgraphs.

Currently the --cache-none option is powered by the
DependencyAwareCache. The cache attempts to maintain a parallel
copy of the execution list data structure, however it is only
setup once at the start of execution and does not get meaninigful
updates to the execution list.

This causes multiple problems when --cache-none is used with lazy
and expanded subgraphs as the DAC does not accurately update its
copy of the execution data structure.

DAC has an attempt to handle subgraphs ensure_subcache however
this does not accurately connect to nodes outside the subgraph.
The current semantics of DAC are to free a node ASAP after the
dependent nodes are executed.

This means that if a subgraph refs such a node it will be requed
and re-executed by the execution_list but DAC wont see it in
its to-free lists anymore and leak memory.

Rather than try and cover all the cases where the execution list
changes from inside the cache, move the while problem to the
executor which maintains an always up-to-date copy of the wanted
data-structure.

The executor now has a fast-moving run-local cache of its own.
Each _to node has its own mini cache, and the cache is unconditionally
primed at the time of add_strong_link.

add_strong_link is called for all of static workflows, lazy links
and expanded subgraphs so its the singular source of truth for
output dependendencies.

In the case of a cache-hit, the executor cache will hold the non-none
value (it will respect updates if they happen somehow as well).

In the case of a cache-miss, the executor caches a None and will
wait for a notification to update the value when the node completes.

When a node completes execution, it simply releases its mini-cache
and in turn its strong refs on its direct anscestor outputs, allowing
for ASAP freeing (same as the DependencyAwareCache but a little more
automatic).

This now allows for re-implementation of --cache-none with no cache
at all. The dependency aware cache was also observing the dependency
sematics for the objects and UI cache which is not accurate (this
entire logic was always outputs specific).

This also prepares for more complex caching strategies (such as RAM
pressure based caching), where a cache can implement any freeing
strategy completely independently of the DepedancyAwareness
requirement.

* main: re-implement --cache-none as no cache at all

The execution list now tracks the dependency aware caching more
correctly that the DependancyAwareCache.

Change it to a cache that does nothing.

* test_execution: add --cache-none to the test suite

--cache-none is now expected to work universally. Run it through the
full unit test suite. Propagate the server parameterization for whether
or not the server is capabale of caching, so that the minority of tests
that specifically check for cache hits can if else. Hard assert NOT
caching in the else to give some coverage of --cache-none expected
behaviour to not acutally cache.

* Small readme improvement. (comfyanonymous#10442)

* WIP way to support multi multi dimensional latents. (comfyanonymous#10456)

* Update template to 0.2.2 (comfyanonymous#10461)

Fix template typo issue

* feat(api-nodes): network client v2: async ops, cancellation, downloads, refactor (comfyanonymous#10390)

* feat(api-nodes): implement new API client for V3 nodes

* feat(api-nodes): implement new API client for V3 nodes

* feat(api-nodes): implement new API client for V3 nodes

* converted WAN nodes to use new client; polishing

* fix(auth): do not leak authentification for the absolute urls

* convert BFL API nodes to use new API client; remove deprecated BFL nodes

* converted Google Veo nodes

* fix(Veo3.1 model): take into account "generate_audio" parameter

* convert Tripo API nodes to V3 schema (comfyanonymous#10469)

* Remove useless function (comfyanonymous#10472)

* convert Gemini API nodes to V3 schema (comfyanonymous#10476)

* Add warning for torch-directml usage (comfyanonymous#10482)

Added a warning message about the state of torch-directml.

* Fix mistake. (comfyanonymous#10484)

* fix(api-nodes): random issues on Windows by capturing general OSError for retries (comfyanonymous#10486)

* Bump portable deps workflow to torch cu130 python 3.13.9 (comfyanonymous#10493)

* Add a bat to run comfyui portable without api nodes. (comfyanonymous#10504)

* Update template to 0.2.3 (comfyanonymous#10503)

* feat(api-nodes): add LTXV API nodes (comfyanonymous#10496)

* Update template to 0.2.4 (comfyanonymous#10505)

* frontend bump to 1.28.8 (comfyanonymous#10506)

* ComfyUI version v0.3.67

* Bump stable portable to cu130 python 3.13.9 (comfyanonymous#10508)

* Remove comfy api key from queue api. (comfyanonymous#10502)

* Tell users to update nvidia drivers if problem with portable. (comfyanonymous#10510)

* Tell users to update their nvidia drivers if portable doesn't start. (comfyanonymous#10518)

* Mixed Precision Quantization System (comfyanonymous#10498)

* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Implement mixed precision operations with a registry design and metadate for quant spec in checkpoint.

* Updated design using Tensor Subclasses

* Fix FP8 MM

* An actually functional POC

* Remove CK reference and ensure correct compute dtype

* Update unit tests

* ruff lint

* Fix missing keys

* Rename quant dtype parameter

* Rename quant dtype parameter

* Fix unittests for CPU build

* execution: Allow a subgraph nodes to execute multiple times (comfyanonymous#10499)

In the case of --cache-none lazy and subgraph execution can cause
anything to be run multiple times per workflow. If that rerun nodes is
in itself a subgraph generator, this will crash for two reasons.

pending_subgraph_results[] does not cleanup entries after their use.
So when a pending_subgraph_result is consumed, remove it from the list
so that if the corresponding node is fully re-executed this misses
lookup and it fall through to execute the node as it should.

Secondly, theres is an explicit enforcement against dups in the
addition of subgraphs nodes as ephemerals to the dymprompt. Remove this
enforcement as the use case is now valid.

* convert nodes_recraft.py to V3 schema (comfyanonymous#10507)

* Speed up offloading using pinned memory. (comfyanonymous#10526)

To enable this feature use: --fast pinned_memory

* Fix issue. (comfyanonymous#10527)

---------

Co-authored-by: comfyanonymous <[email protected]>
Co-authored-by: Alexander Piskun <[email protected]>
Co-authored-by: comfyanonymous <[email protected]>
Co-authored-by: ComfyUI Wiki <[email protected]>
Co-authored-by: Jedrzej Kosinski <[email protected]>
Co-authored-by: Benjamin Lu <[email protected]>
Co-authored-by: Jukka Seppänen <[email protected]>
Co-authored-by: Kimbing Ng <[email protected]>
Co-authored-by: blepping <[email protected]>
Co-authored-by: rattus128 <[email protected]>
Co-authored-by: DELUXA <[email protected]>
Co-authored-by: Jodh Singh <[email protected]>
Co-authored-by: Christian Byrne <[email protected]>
Co-authored-by: ThereforeGames <[email protected]>
Co-authored-by: Kohaku-Blueleaf <[email protected]>
Co-authored-by: Changrz <[email protected]>
Co-authored-by: Guy Niv <[email protected]>
Co-authored-by: Yoland Yan <[email protected]>
Co-authored-by: Rui Wang (王瑞) <[email protected]>
Co-authored-by: AustinMroz <[email protected]>
Co-authored-by: Koratahiu <[email protected]>
Co-authored-by: Finn-Hecker <[email protected]>
Co-authored-by: filtered <[email protected]>
Co-authored-by: thorsten <[email protected]>
Co-authored-by: Daniel Harte <[email protected]>
Co-authored-by: Arjan Singh <[email protected]>
Co-authored-by: chaObserv <[email protected]>
Co-authored-by: Faych <[email protected]>
Co-authored-by: Rizumu Ayaka <[email protected]>
Co-authored-by: contentis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.