Skip to content

AMD RDNA3 PERFORMANCE FIX SOLUTION FOUND #10460

@Reber01Good

Description

@Reber01Good

Custom Node Testing

Expected Behavior

Run fast, no OOM

Actual Behavior

Run slow, OOM

Steps to Reproduce

torch.backends.cudnn.enabled=False -> BAD
torch.backends.cudnn.enabled=True -> GOOD

Env Vars: MIOPEN_FIND_MODE=2 MIGRAPHX_MLIR_USE_SPECIFIC_OPS="attention" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="FALSE TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 (TunableOP tuning is faster and better than FA autotune)

Launch: --preview-size 1024 --reserve-vram 0.9 --async-offload --use-flash-attention --fp32-vae --disable-smart-memory

System specs and software can be found in the main comment.

Debug Logs

Nothing useful here:

[START] Security scan
[ComfyUI-Manager] Using uv as Python module for pip operations.
Using Python 3.12.11 environment at: .pyenv/versions/3.12.11
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2025-10-24 06:25:50.486
** Platform: Linux
** Python version: 3.12.11 (main, Oct 16 2025, 06:37:22) [GCC 13.3.0]
** Python executable: /home/alex/.pyenv/versions/3.12.11/bin/python
** ComfyUI Path: /home/alex/ComfyUI
** ComfyUI Base Folder Path: /home/alex/ComfyUI
** User directory: /home/alex/ComfyUI/user
** ComfyUI-Manager config path: /home/alex/ComfyUI/user/default/ComfyUI-Manager/config.ini
** Log path: /home/alex/ComfyUI/user/comfyui.log
Using Python 3.12.11 environment at: .pyenv/versions/3.12.11
Using Python 3.12.11 environment at: .pyenv/versions/3.12.11

Prestartup times for custom nodes:
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/rgthree-comfy
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-easy-use
   0.4 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-manager

Checkpoint files will always be loaded safely.
Total VRAM 24560 MB, total RAM 63903 MB
pytorch version: 2.10.0.dev20251017+rocm7.0
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1100
ROCm version: (7, 0)
Set vram state to: NORMAL_VRAM
Disabling smart memory management
Device: cuda:0 Radeon RX 7900 XTX : native
Using async weight offloading with 2 streams
Using Flash Attention
Python version: 3.12.11 (main, Oct 16 2025, 06:37:22) [GCC 13.3.0]
ComfyUI version: 0.3.66
ComfyUI frontend version: 1.30.1
[Prompt Server] web root: /home/alex/.pyenv/versions/3.12.11/lib/python3.12/site-packages/comfyui_frontend_package/static
### Loading: ComfyUI-Manager (V3.37)
[ComfyUI-Manager] network_mode: public
### ComfyUI Revision: 4113 [560b1bdf] *DETACHED | Released on '2025-10-21'
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
------------------------------------------
Comfyroll Studio v1.76 :  175 Nodes Loaded
------------------------------------------
** For changes, please see patch notes at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/blob/main/Patch_Notes.md
** For help, please see the wiki at https://github.com/Suzie1/ComfyUI_Comfyroll_CustomNodes/wiki
------------------------------------------
[/home/alex/ComfyUI/custom_nodes/comfyui_controlnet_aux] | INFO -> Using ckpts path: /home/alex/ComfyUI/custom_nodes/comfyui_controlnet_aux/ckpts
[/home/alex/ComfyUI/custom_nodes/comfyui_controlnet_aux] | INFO -> Using symlinks: False
[/home/alex/ComfyUI/custom_nodes/comfyui_controlnet_aux] | INFO -> Using ort providers: ['CUDAExecutionProvider', 'DirectMLExecutionProvider', 'OpenVINOExecutionProvider', 'ROCMExecutionProvider', 'CPUExecutionProvider', 'CoreMLExecutionProvider']
DWPose: Onnxruntime with acceleration providers detected
Device configuration: GPU=cuda, CPU=cpu
[ComfyUI-Easy-Use] server: v1.3.4 Loaded
[ComfyUI-Easy-Use] web root: /home/alex/ComfyUI/custom_nodes/comfyui-easy-use/web_version/v2 Loaded
ComfyUI-GGUF: Allowing full torch compile
### Loading: ComfyUI-Impact-Pack (V8.25.1)
### Loading: ComfyUI-Impact-Subpack (V1.3.5)
[Impact Pack/Subpack] Using folder_paths to determine whitelist path: /home/alex/ComfyUI/user/default/ComfyUI-Impact-Subpack/model-whitelist.txt
[Impact Pack/Subpack] Ensured whitelist directory exists: /home/alex/ComfyUI/user/default/ComfyUI-Impact-Subpack
[Impact Pack/Subpack] Loaded 0 model(s) from whitelist: /home/alex/ComfyUI/user/default/ComfyUI-Impact-Subpack/model-whitelist.txt
[Impact Pack] Wildcards loading done.
[Impact Subpack] ultralytics_bbox: /home/alex/ComfyUI/models/ultralytics/bbox
[Impact Subpack] ultralytics_segm: /home/alex/ComfyUI/models/ultralytics/segm
### Loading: ComfyUI-Inspire-Pack (V1.22.2)
LIGER kernel not found. The option to enable it will be disabled.
[MultiGPU Core Patching] Patching mm.soft_empty_cache for Comprehensive Memory Management (VRAM + CPU + Store Pruning)
[MultiGPU Core Patching] Patching mm.get_torch_device, mm.text_encoder_device, mm.unet_offload_device
[MultiGPU DEBUG] Initial current_device: cuda:0
[MultiGPU DEBUG] Initial current_text_encoder_device: cuda:0
[MultiGPU DEBUG] Initial current_unet_offload_device: cpu
[MultiGPU] Initiating custom_node Registration. . .
-----------------------------------------------
custom_node                   Found     Nodes
-----------------------------------------------
ComfyUI-LTXVideo                  N         0
ComfyUI-Florence2                 Y         2
ComfyUI_bitsandbytes_NF4          N         0
x-flux-comfyui                    N         0
ComfyUI-MMAudio                   N         0
ComfyUI-GGUF                      Y        18
PuLID_ComfyUI                     N         0
ComfyUI-WanVideoWrapper           N         0
-----------------------------------------------
[MultiGPU] Registration complete. Final mappings: CheckpointLoaderAdvancedMultiGPU, CheckpointLoaderAdvancedDisTorch2MultiGPU, UNetLoaderLP, UNETLoaderMultiGPU, VAELoaderMultiGPU, CLIPLoaderMultiGPU, DualCLIPLoaderMultiGPU, TripleCLIPLoaderMultiGPU, QuadrupleCLIPLoaderMultiGPU, CLIPVisionLoaderMultiGPU, CheckpointLoaderSimpleMultiGPU, ControlNetLoaderMultiGPU, DiffusersLoaderMultiGPU, DiffControlNetLoaderMultiGPU, UNETLoaderDisTorch2MultiGPU, VAELoaderDisTorch2MultiGPU, CLIPLoaderDisTorch2MultiGPU, DualCLIPLoaderDisTorch2MultiGPU, TripleCLIPLoaderDisTorch2MultiGPU, QuadrupleCLIPLoaderDisTorch2MultiGPU, CLIPVisionLoaderDisTorch2MultiGPU, CheckpointLoaderSimpleDisTorch2MultiGPU, ControlNetLoaderDisTorch2MultiGPU, DiffusersLoaderDisTorch2MultiGPU, DiffControlNetLoaderDisTorch2MultiGPU, Florence2ModelLoaderMultiGPU, DownloadAndLoadFlorence2ModelMultiGPU, UnetLoaderGGUFDisTorchMultiGPU, UnetLoaderGGUFAdvancedDisTorchMultiGPU, CLIPLoaderGGUFDisTorchMultiGPU, DualCLIPLoaderGGUFDisTorchMultiGPU, TripleCLIPLoaderGGUFDisTorchMultiGPU, QuadrupleCLIPLoaderGGUFDisTorchMultiGPU, UnetLoaderGGUFDisTorch2MultiGPU, UnetLoaderGGUFAdvancedDisTorch2MultiGPU, CLIPLoaderGGUFDisTorch2MultiGPU, DualCLIPLoaderGGUFDisTorch2MultiGPU, TripleCLIPLoaderGGUFDisTorch2MultiGPU, QuadrupleCLIPLoaderGGUFDisTorch2MultiGPU, UnetLoaderGGUFMultiGPU, UnetLoaderGGUFAdvancedMultiGPU, CLIPLoaderGGUFMultiGPU, DualCLIPLoaderGGUFMultiGPU, TripleCLIPLoaderGGUFMultiGPU, QuadrupleCLIPLoaderGGUFMultiGPU
/home/alex/.pyenv/versions/3.12.11/lib/python3.12/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Using Flash Attention
(RES4LYF) Init
(RES4LYF) Importing beta samplers.
FETCH ComfyRegistry Data: 5/101
(RES4LYF) Importing legacy samplers.

[rgthree-comfy] Loaded 48 magnificent nodes. 🎉

[save_image_extended] AVIF is not supported. To add it: pip install pillow pillow-avif-plugin
[save_image_extended] JXL is not supported. To add it: pip install jxlpy
[save_image_extended]                       You will need a valid MSVC env to build the wheel
[save_image_extended] version: 2.64
/home/alex/ComfyUI/custom_nodes/kaytool/api/clean_vram.py:8: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml
WAS Node Suite: OpenCV Python FFMPEG support is enabled
WAS Node Suite Warning: `ffmpeg_bin_path` is not set in `/home/alex/ComfyUI/custom_nodes/was-node-suite-comfyui/was_suite_config.json` config file. Will attempt to use system ffmpeg binaries if available.
WAS Node Suite: Finished. Loaded 220 nodes successfully.

	"Dream big and dare to fail." - Norman Vaughan

no module 'xformers'. Processing without...
no module 'xformers'. Processing without...

Import times for custom nodes:
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/websocket_image_save.py
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/test.py
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/image-resize-comfyui
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/Ttl_ComfyUi_NNLatentUpscale
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI_Cutoff
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/promptoptimizer
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-detail-daemon
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-imageautotone
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-visualarea-nodes
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/exit
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-Align
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_ttp_toolset
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-fbcnn
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/euler-smea-dyn-sampler
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-interactive
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-clip-with-break
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/cg-use-everywhere
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/Comfyui-Resolution-Master
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_steudio
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/stablediffusion-dpmpp_2m_alt-sampler
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI_restart_sampling
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/save-image-extended-comfyui
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/rocm-ninodes
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-DGLS
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_densediffusion
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-automaticcfg
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-GGUF
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-TiledDiffusion
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-VFI
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-Extra-Samplers
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-mxToolkit
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_essentials
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/teacache
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/RyuuNoodles
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-custom-scripts
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-frame-interpolation
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-florence2
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-image-saver
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_ultimatesdupscale
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-multigpu
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/qwen3-vl-nsfw
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI_Mira
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/Comfyui-Superprompt-Unofficial
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-kjnodes
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/gguf
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/SwarmComfyCommon
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/rgthree-comfy
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/basic_data_handling
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_jc2
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/kaytool
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI-Flowty-LDSR
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui_controlnet_aux
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-inspire-pack
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-sam2
   0.0 seconds: /home/alex/ComfyUI/custom_nodes/SwarmComfyExtra
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/data-analysis
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-manager
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-impact-pack
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-rk-sampler
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-impact-subpack
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-openvino
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-logicutils
   0.1 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-rgt
   0.2 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-videohelpersuite
   0.2 seconds: /home/alex/ComfyUI/custom_nodes/ComfyUI_Comfyroll_CustomNodes
   0.3 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-supir
   0.3 seconds: /home/alex/ComfyUI/custom_nodes/RES4LYF
   0.7 seconds: /home/alex/ComfyUI/custom_nodes/was-node-suite-comfyui
   0.9 seconds: /home/alex/ComfyUI/custom_nodes/bjornulf_custom_nodes
   1.8 seconds: /home/alex/ComfyUI/custom_nodes/comfyui-easy-use

Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://127.0.0.1:8188

Other

I FOUND OUT WHAT IS CAUSING ISSUES ON RDNA 3, IT IS "torch.backends.cudnn.enabled" (MIOpen)

For an entire year I have been wondering why sometimes my 7900XTX is struggling at Unet, VAE, upscale and tunableop. Sometimes it worked perfectly and I had performance close to a RTX 4090, other times it either got OOM or ran like a RTX 4080. Turns out it was torch.backends.cudnn.enabled. It needs to be enabled on RDNA3 so that MIOpen works. The issue happened randomly due to various factors like custom nodes, comfyui updates, me updating python, etc. It was impossible to pinpoint it to torch.backends.cudnn.enabled.

I found this while looking in the changelog for v0.3.65, namely "Improve AMD performance. by @comfyanonymous in #10302". I decided to edit model_management.py line 335 myself to override it to True, and the issue is fixed.

What was mentioned about needing to disable it on RDNA4 applies ONLY to RDNA4. MIOpen might be bugged there, but it is very stable and necessary on RDNA3. Also, the patch to default VAE to bf16 on all AMD gpus is also bad, it is slower on RDNA3 than RDNA4. RDNA3 fp16 and fp32 VAE run at the same speed and vram consumption (with torch.backends.cudnn.enabled=True), and fp32 avoids all black images.

This could have just slipped and remained hard-coded to False, which would have most likely ruined RDNA3 performance on ComfyUI forever.

Please patch ASAP!


Benchmarks:

SDXL BASE, default workflow, 1024x1024, 20 steps euler, vae decode (not tiled), max fan speed, not initial run, avg of 3 runs

CLIP speed: 0.066s -> 0.054s
CLIP mem: 1.74gb -> 2.8gb (the only downgrade)
UNET speed: 5.414s -> 4.392s
UNET mem: 5.94gb -> 5.5gb
VAE speed: 2.093s -> 0.186s (x10 improvement bruh)
VAE mem: 17.35gb -> 1.74gb (x10 improvement bruh)
UNET speed (2048x2048): 26.543s -> 21.831s
UNET mem (2048x2048): 8.76gb -> 6.92gb
VAE speed (2048x2048): no -> 40.631s
VAE mem (2048x2048): no -> 8.47gb
TunableOP tuning speed: 5min + oom on vae -> 20s
TunableOP improvement (full workflow time): ~0.3s, scales more with heavier workflow

I also quickly tested WAN 2.2, VAE improvement is massive, UNET maybe slightly faster.
Tiled VAE at 1024x1024 is perfect for everything.


System and parameters:
Ubuntu 24.04.3 LTS
CPU AMD R9 7950X3D
RAM 4x16gb 6000MT CL32
GPU RX 7900XTX (OC)

ComfyUI v0.3.66
ROCM 7.0.2
Python 3.12.11
Pytorch 2.10.0.dev20251017+rocm7.0
flash_attn from "pip install -U git+https://github.com/FeepingCreature/flash-attention-gfx11@gel-crabs-headdim512 --no-build-isolation" (still faster than official)

Env Vars: MIOPEN_FIND_MODE=2 MIGRAPHX_MLIR_USE_SPECIFIC_OPS="attention" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" FLASH_ATTENTION_TRITON_AMD_AUTOTUNE="FALSE TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 (TunableOP tuning is faster and better than FA autotune)
Launch: --preview-size 1024 --reserve-vram 0.9 --async-offload --use-flash-attention --fp32-vae --disable-smart-memory


Also, just some gpu comparisons to show how trash one of them is (at time of my purchase):
RTX 4090, 3.61s, 2289€
RX 7900XTX, 4.39s, 1189€
RTX 4080, 6.46s , 1309€ <- only 16gb, worse value than 60 class
The entire world was saying 4080 better because "it renders cp2077 rt at 360p faster", what a joke, glad I dodged this bullet

Metadata

Metadata

Assignees

No one assigned

    Labels

    Potential BugUser is reporting a bug. This should be tested.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions