50% drop in Comfy video encoding on rented GPUs in last 2 weeks... #10203

JeanPierrePoulin · 2025-10-04T14:54:51Z

JeanPierrePoulin
Oct 4, 2025

Hello awesome Comfy devs!

I've been facing crippling performance issue on all the GPUs I rent over the last two weeks... making my work on rented GPUs untenable. (I only have my local 4090 encoding right now!)

---- Here is my detailed question to the amazing Google Gemini with Deep Research ----

I am facing a crippling performance issue with all RunPod instances I create to encode Wan videos on ComfyUI that makes renting pods unworkable...

Here is the evidence I have:

I had been using the amazing docker image by Art Official for three months at top performance. Suddenly performance dropped by half about 10 days ago and never recovered.
A video that used to take 8 minutes now takes 21 minutes to encode.
Investigating why I found that the 4090 I was renting was constantly stuck in P2 power mode. On a related note its temperature never went over 70 Celsius when it used to reach 84 degrees!
I have recreated the same image from scratch on four other 4090 RunPods and they all demonstrated the same problem.
I have created the Art Official docker on a 5090 and this one was also functioning much more slowly at P1 power level. (my 4090 was faster!)
I have tried another famous docker image for Wan 2.2. Comfy video creation by aiorbust and the same problem appeared.
I have tried a 'vanilla' Wan 2.2. workflow template in case my long-working template was causing the issue but the vanilla template also demonstrated the problem.
Note that the seven RunPod Community Cloud instances I tried had several versions of the NVidia driver.
I have enclosed the 'telemetry image' of the status of the pods as they were in the middle of encoding a Wan video. In a healthy setup the power level should say P0 and the temperature over 80 degrees.
I have reached out to RunPod tech support and they have escalated the issue but have not revealed if this affects other customers. I have also contacted the author of the Art Official docker image and he did not report that others were facing this issue.

My least-bad theory is that:

It's unlikely to be related to problems with a recent NVidia driver because several versions were tried. (read the 'driver version' in all my clips and research each driver)
It's unlikely to be related to the docker images used given that two were tried with the same problem.
It's unlikely that RunPod would somehow ask its Community Cloud participants to 'cap' the power level to a slower state... that would face a massive backlash!
The only thing that remains is that ComfyUI / PyTorch introduced a breaking change in the last couple of weeks...

One piece of countering evidence:

My own local setup of ComfyUI running on my Windows PC with a local 4090 is constantly up to date and has not seen this horrible performance drop. Only the Comfy setups on RunPod (running on Linux) demonstrated the problem. (The problem only affects the Linux-side of Pytorch or related low-level drivers?)

Please review the images I have sent to study the NVidia driver version in particular. Please do the deepest possible dive in the Comfy / PyTorch / NVidia forums to locate issues where NVidia GPUs function at P1/P2 power levels...
Q1: Is it possible that a recent change in Comfy / PyTorch would somehow result in the GPU not being used at its max power?
Q2: Is it possible that RunPod Community Cloud GPUs are capped at P1/P2 power (to save electricity?)
Q3: What additional data point can I provide that would give you more clues?
Q4: What happened in the last couple of weeks that would cause a 50% drop in performance?
Q5: I need this working ASAP and cannot spend weeks waiting for a fix! Is it possible / recommended to 'roll back' ComfyUI and / or PyTorch so I can get work done today?

--- Short summary of Gemini's findings ---
This report presents a comprehensive technical investigation into a severe and persistent performance degradation affecting ComfyUI video encoding workloads on RunPod's Community Cloud platform. The primary symptom, a greater than 50% increase in processing time, is correlated with telemetry data showing high-end NVIDIA GPUs (RTX 4090, RTX 5090) operating in suboptimal performance states (P1/P2) despite reporting 100% utilization.

The investigation concludes that the root cause is not an isolated hardware or driver failure but a systemic issue stemming from a complex interaction between the software stack and the underlying cloud environment. The leading hypothesis is that a recent update within the PyTorch ecosystem, a core dependency of ComfyUI, has altered the characteristics of its CUDA workload. This change interacts negatively with the NVIDIA Linux driver's power management heuristics, causing the driver to select a significantly more conservative clock frequency profile within the expected P2 performance state. This effectively throttles the GPU's throughput. The issue is specific to the Linux environment, which explains why identical, up-to-date workflows on Windows systems remain unaffected.

The problem is likely exacerbated by environmental factors inherent to the RunPod Community Cloud, including the possibility of host-level power capping implemented by individual hardware providers to manage electricity costs.

Immediate remediation is achievable through a targeted software rollback. This report provides explicit, step-by-step instructions for reverting both ComfyUI and its PyTorch dependencies to a last-known-good configuration from before the performance degradation began. Further recommendations include advanced diagnostic procedures to gather more granular data and a long-term strategy for workflow stabilization, including dependency version pinning, the creation of custom Docker images, and a process for vetting cloud instances.

--- Complete 6 page report ---

https://g.co/gemini/share/1f763b7736cd

(Note: my question includes several key images of telemetry data from RunPod)

Q: Has anyone observed a sharp drop in the encoding performance of Linux-based Comfy in the last two weeks?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

50% drop in Comfy video encoding on rented GPUs in last 2 weeks... #10203

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

50% drop in Comfy video encoding on rented GPUs in last 2 weeks... #10203

Uh oh!

JeanPierrePoulin Oct 4, 2025

Replies: 0 comments

JeanPierrePoulin
Oct 4, 2025