50% drop in Comfy video encoding on rented GPUs in last 2 weeks... #10203
JeanPierrePoulin
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello awesome Comfy devs!
I've been facing crippling performance issue on all the GPUs I rent over the last two weeks... making my work on rented GPUs untenable. (I only have my local 4090 encoding right now!)
---- Here is my detailed question to the amazing Google Gemini with Deep Research ----
I am facing a crippling performance issue with all RunPod instances I create to encode Wan videos on ComfyUI that makes renting pods unworkable...
Here is the evidence I have:
My least-bad theory is that:
One piece of countering evidence:
Please review the images I have sent to study the NVidia driver version in particular. Please do the deepest possible dive in the Comfy / PyTorch / NVidia forums to locate issues where NVidia GPUs function at P1/P2 power levels...
Q1: Is it possible that a recent change in Comfy / PyTorch would somehow result in the GPU not being used at its max power?
Q2: Is it possible that RunPod Community Cloud GPUs are capped at P1/P2 power (to save electricity?)
Q3: What additional data point can I provide that would give you more clues?
Q4: What happened in the last couple of weeks that would cause a 50% drop in performance?
Q5: I need this working ASAP and cannot spend weeks waiting for a fix! Is it possible / recommended to 'roll back' ComfyUI and / or PyTorch so I can get work done today?
--- Short summary of Gemini's findings ---
This report presents a comprehensive technical investigation into a severe and persistent performance degradation affecting ComfyUI video encoding workloads on RunPod's Community Cloud platform. The primary symptom, a greater than 50% increase in processing time, is correlated with telemetry data showing high-end NVIDIA GPUs (RTX 4090, RTX 5090) operating in suboptimal performance states (P1/P2) despite reporting 100% utilization.
The investigation concludes that the root cause is not an isolated hardware or driver failure but a systemic issue stemming from a complex interaction between the software stack and the underlying cloud environment. The leading hypothesis is that a recent update within the PyTorch ecosystem, a core dependency of ComfyUI, has altered the characteristics of its CUDA workload. This change interacts negatively with the NVIDIA Linux driver's power management heuristics, causing the driver to select a significantly more conservative clock frequency profile within the expected P2 performance state. This effectively throttles the GPU's throughput. The issue is specific to the Linux environment, which explains why identical, up-to-date workflows on Windows systems remain unaffected.
The problem is likely exacerbated by environmental factors inherent to the RunPod Community Cloud, including the possibility of host-level power capping implemented by individual hardware providers to manage electricity costs.
Immediate remediation is achievable through a targeted software rollback. This report provides explicit, step-by-step instructions for reverting both ComfyUI and its PyTorch dependencies to a last-known-good configuration from before the performance degradation began. Further recommendations include advanced diagnostic procedures to gather more granular data and a long-term strategy for workflow stabilization, including dependency version pinning, the creation of custom Docker images, and a process for vetting cloud instances.
--- Complete 6 page report ---
https://g.co/gemini/share/1f763b7736cd
(Note: my question includes several key images of telemetry data from RunPod)
Q: Has anyone observed a sharp drop in the encoding performance of Linux-based Comfy in the last two weeks?
Beta Was this translation helpful? Give feedback.
All reactions