You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m experimenting with ROCm for LLM inference (llama.cpp HIP backend, running inside an unprivileged LXC/container on Linux). Everything works well overall, and my Radeon AI Pro R9700 can enter proper runtime suspend (runtime_status = suspended) when no ROCm backend is active.
However, I noticed that when a ROCm compute context is alive and VRAM allocations exist (for example, when an LLM model is loaded into VRAM), the GPU does not enter deep idle:
Memory controller stays active
PCIe link remains in a higher power state
Idle power is significantly higher (≈60–100W)
Deep runtime suspend only occurs if the ROCm backend is fully unloaded and VRAM freed
This leads to a trade-off:
Keep the model in VRAM → fast inference start, but high idle power
Unload the model → low idle power, but slow first-token latency due to reloading multi-GB weights
This raises a few questions:
1. Is it currently possible in ROCm/AMDGPU for the GPU to enter deep idle while VRAM allocations remain resident?
In other words:
Can ROCm park/suspend compute queues while keeping model weights in VRAM, allowing the GPU to drop into a low-power state without unloading everything?
2. If not, is support for VRAM-resident idle states planned for future ROCm or AMDGPU driver releases?
This would be very useful for LLM inference workloads, which are typically “bursty”:
short compute burst
several minutes of idle
another short burst
repeat
NVIDIA GPUs handle a similar pattern with context parking + memory self-refresh, allowing VRAM to stay allocated while the GPU enters a deep idle state.
3. Are there architectural or driver limitations on RDNA/ROCm that prevent this today?
Understanding whether this is a hardware, firmware, or runtime limitation would help clarify expectations for future optimizations.
Thanks for your time — I’m happy to share logs or additional information from my setup if useful.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
I’m experimenting with ROCm for LLM inference (llama.cpp HIP backend, running inside an unprivileged LXC/container on Linux). Everything works well overall, and my Radeon AI Pro R9700 can enter proper runtime suspend (
runtime_status = suspended) when no ROCm backend is active.However, I noticed that when a ROCm compute context is alive and VRAM allocations exist (for example, when an LLM model is loaded into VRAM), the GPU does not enter deep idle:
This leads to a trade-off:
This raises a few questions:
1. Is it currently possible in ROCm/AMDGPU for the GPU to enter deep idle while VRAM allocations remain resident?
In other words:
2. If not, is support for VRAM-resident idle states planned for future ROCm or AMDGPU driver releases?
This would be very useful for LLM inference workloads, which are typically “bursty”:
NVIDIA GPUs handle a similar pattern with context parking + memory self-refresh, allowing VRAM to stay allocated while the GPU enters a deep idle state.
3. Are there architectural or driver limitations on RDNA/ROCm that prevent this today?
Understanding whether this is a hardware, firmware, or runtime limitation would help clarify expectations for future optimizations.
Thanks for your time — I’m happy to share logs or additional information from my setup if useful.
Beta Was this translation helpful? Give feedback.
All reactions