Add patch to support runtime PM for PCI devices #1161

RealityAnomaly · 2025-08-22T21:48:11Z

This patch adds the necessary logic to support runtime power management for PCI devices, necessary for runtime C10 and S0ix residency (see discussion in QubesOS/qubes-issues#6411). This is implemented via a change to pciback which makes managed devices part of the runtime PM stack in the Dom0 kernel, which allows advanced power-saving behaviour such as opportunistically putting both devices and their upstream ports in D3cold. More information on how the logic works is available in the patch notes as well as in the comments throughout the code.

It is at a point of relative stability now, but there are still a few issues to be ironed out, hence why this pull request is starting as a draft. These issues include:

On system-wide resume from sleep, USB devices are not automatically reattached to domains. /usr/lib/qubes/usb-import: line 78: printf: write error: Device or resource busy is visible in qubesd logs. Devices may be reattached manually without issue, so this is potentially some sort of race condition or missing check. USB proxy bug, not related to this patch.
Some devices log a benign power state set failed message with EINVAL. Out of all the devices on my bus, this only happens for the Thunderbolt USB controller, and it does not seem to adversely affect suspend or runtime PM.
Intel GPU virtual functions that are not bound to any domain log a bunch of error messages when dom0 resumes from suspend. (I'm not sure if this one is actually related to this patch, but it might be, I need to do more testing.) This was a misconfiguration on my end, and not related to this patch.

This patch also requires a change to QEMU to permit reading and writing of the PME registers: https://gist.github.com/RealityAnomaly/d4db40832e45b7540a3863dbf9a9e47c

HW42 · 2025-08-22T23:57:56Z

xen-pciback-pm-runtime.patch

+keep power applied to the bus, and in most cases this will prevent the CPU from reaching states lower than Package C2.
+
+The vast majority of devices depend on PME (Power Management Events) to
+wake from D3cold, so Linux will not attempt to put them into deeper


From what I have seen this seems to be only the case for runtime suspend. And that makes sense, if I understand the mechanism correctly. For runtime suspend wakeup is basically always neede, since the device needs to notify the system that it wants to be woken up, otherwise functionality will be broken, for example an incoming packet wont be delivered to the system. OTOH for system wide suspend, working wakeup is only needed when this specific device is enabled for system wakeup. For Qubes' use case this isn't normally the case (you wakeup your machine from suspend with the power button, opening the lid, or similar, not via, say, wake-on-LAN (USB keyboards might be an exception here)).

To be clear: I'm trying to differentiate between runtime and system-wide suspend because the (system-wide) suspend topic is already rather big, and it looks like runtime suspend adds a bunch of further problems (like the one you are addressing here). So especially if, as here, quite non-trivial patching is needed, it would be good to keep the overview what is needed for what. That doesn't mean that we don't like to have working runtime suspend. Better power consumption during runtime is definitely welcome.

Yeah. I got to this point by running through some of the steps given by Intel, which advises to check the link power states of PCI devices without which they will not be able to enter either runtime or system suspend.

It's now my knowledge that link power states and D-states are separate, and it isn't necessary to have runtime managed D-states for system wide suspend. I do think this patch would be especially valuable for battery runtime, as without it you can't even get C6, but it should not be required for system wide S0ix.

Ok, thanks for clarifying.

I do think this patch would be especially valuable for battery runtime

Yes.

HW42 · 2025-08-22T23:59:14Z

xen-pciback-pm-runtime.patch

+ }
+
+-/* PM_OK_BITS specifies the bits that the driver domain is allowed to change.
+- * Can't allow driver domain to enable PMEs - they're shared */


Somewhere you should describe, why you think this old comment is wrong. (I'm not sure)

I never understood what this referred to by shared, but I assumed it was because by default dom0 handles all PMEs and they would not be passed through in any way. But in any case this was from a 14 year old commit, and I can't find much more additional information on the reasoning here.

From everything I've read on the PCI specs and looking at existing kernel code, the PME enable bits are set individually on PCI devices. The events are dispatched as interrupts by the root port (in Dom0). So I don't really have any worry that this violates any spec.

I'll add something along the lines of my explanation above to the patch.

Not sure either. Maybe it's about conventional PCI (not PCIe)? See the paragraph starting with

Unfortunately, there is no standard way of handling wakeup signals sent by
conventional PCI devices on systems that are not ACPI-based, but there is one
for PCI Express devices. [...]

in Linux's Documentation/power/pci.rst

When it's ready for that we will see what people on xen-devel think about this.

Sounds good. Yeah, there's like three different methods of notification I'm aware of:

(1) The notification is delivered by interrupt from the PCIe root port (or "event collector"). This is the way that's used by the hardware in my machine.
(2) The notification is delivered via ACPI. Not familiar how this works as afaik my machine doesn't use it.
(3) The notification is delivered by setting the PME bit in the configuration registers. Because these registers are unable to be read from D3cold, this can only be used for PMEs for D3hot or higher. This is also the method I use in this patch to pass the PME to the guest.

Is it reasonable to assume that there are no conventional PCI devices anymore?

I would think so, especially considering Qubes isn't targeted towards vintage hardware. That being said, there might be some niche use cases. A relatively recent Thales HSM I dealt with still supported only PCI.

I think it depends on what you mean by conventional PCI. Certainly the physical interface is dead outside some embedded devices. But if I understand the lspci output correctly, there are a couple of "conventional" PCI devices there even on my modern laptop (ThinkPad P1 Gen4) - by "conventional" I mean they don't have a PCIe capability structure.

Example:

00:1f.4 SMBus: Intel Corporation Tiger Lake-H SMBus Controller (rev 11) Subsystem: Lenovo Device 22e4 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 Region 0: Memory at 614d1c8000 (64-bit, non-prefetchable) [size=256] Region 4: I/O ports at efa0 [size=32] Kernel driver in use: i801_smbus Kernel modules: i2c_i801 00:1f.5 Serial bus controller: Intel Corporation Tiger Lake-H SPI Controller (rev 11) Subsystem: Lenovo Device 22e4 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Region 0: Memory at a0800000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: intel-spi Kernel modules: spi_intel_pci

But these are not devices that would ever be passed through to a VM.

I can confirm I also have such devices. But yeah it is impossible to pass them through anyway as they are used by dom0.

RealityAnomaly · 2025-08-30T01:49:26Z

I've just pushed an updated version of this patch. This version significantly increases stability, and ensures the host will only put devices into runtime suspend when the guest requests that it do so (by setting the PME Enable flag). Furthermore, it prevents the guest from directly controlling the PME Enable and PME Status bits, leaving this up to the runtime suspend logic in the host. This fixes the issue I've been experiencing where some devices that enable PME will instantly wake the machine up from suspend (system suspend, not runtime suspend).

Note I have also currently removed the xen_pcibk_suspend_noirq logic as the host (in my opinion) should perform the final suspend for devices in all cases, because it may need to set ACPI parameters to disable wakeup events and such. I was also concerned with this function being inconsistent with the underlying bus implementation in pci-driver.c. This also means (currently) Thunderbolt root ports will not be suspended even with force set and you will have to enable the CONFIG_HOTPLUG_PCI kernel parameter. IMO overriding the logic like this doesn't seem like a robust way to do things and may break power management assumptions but I'm open to discussion on this.

dsg22 · 2025-08-30T19:57:11Z

I've just pushed an updated version of this patch. [ ... ]

Cool, thanks. I'm building a new kernel with it now.

Just to confirm, is this gist still the current version of the qemu patch for vmm-xen-stubdom-linux?
https://gist.github.com/RealityAnomaly/d4db40832e45b7540a3863dbf9a9e47c

RealityAnomaly · 2025-08-31T03:13:26Z

I've just pushed an updated version of this patch. [ ... ]

Cool, thanks. I'm building a new kernel with it now.

Just to confirm, is this gist still the current version of the qemu patch for vmm-xen-stubdom-linux? https://gist.github.com/RealityAnomaly/d4db40832e45b7540a3863dbf9a9e47c

Yeah, that's still the current version. BTW, I'd recommend pulling the changes again, as I pushed another fix to ensure the power management state is correctly reset on PCI device detach & I switched from using pm get/put calls to enable/disable, which seems like the better way to do things.

RealityAnomaly · 2025-09-18T02:45:36Z

I've been dealing with another issue as of late: Intel GPU VFs break after Package C10 is entered, apparently because something resets the MSI and bus mastering enable bits. It's not Dom0, the stubdom, or the guest, and I'm fairly certain it's not Xen, so that only leaves the hardware.

I can't find much documentation on what exactly the PMC is doing when it enters C10, I suppose it could be power gating the PCI root ports and that affects it somehow? But I don't know why it would only affect the VFs...

RealityAnomaly · 2025-09-21T18:15:35Z

I've been dealing with another issue as of late: Intel GPU VFs break after Package C10 is entered, apparently because something resets the MSI and bus mastering enable bits. It's not Dom0, the stubdom, or the guest, and I'm fairly certain it's not Xen, so that only leaves the hardware.

I can't find much documentation on what exactly the PMC is doing when it enters C10, I suppose it could be power gating the PCI root ports and that affects it somehow? But I don't know why it would only affect the VFs...

I reported this bug upstream: strongtz/i915-sriov-dkms#349

As this driver is third-party, experimental, and not supported by Intel (SR-IOV is not officially supported for the GPU on my platform anyway) I think we can disregard this for now and carry on with the patch, but I'll add a note about it somewhere.

RealityAnomaly · 2025-09-28T20:54:27Z

Following a preliminary review with Andrew at the Qubes Summit, the basic logic of this is solid, but we need to change the way we're sending PMEs.

Ideally the way we should be doing it is implementing a new hypercall in Xen which allows us to send the PME through ACPI & then we can bypass the current hacky stuff where we write to the PME register and fake the power state as D3hot (in the case of D3cold) to allow the guest to read it.

RealityAnomaly force-pushed the feat/runtime-suspend branch from cd10e9a to 4edcfe6 Compare August 22, 2025 21:49

RealityAnomaly mentioned this pull request Aug 22, 2025

Support for the S0ix sleep state QubesOS/qubes-issues#6411

Open

RealityAnomaly force-pushed the feat/runtime-suspend branch from 4edcfe6 to a3d309b Compare August 22, 2025 21:53

HW42 reviewed Aug 22, 2025

View reviewed changes

RealityAnomaly force-pushed the feat/runtime-suspend branch 6 times, most recently from 4985ab6 to 823c708 Compare August 30, 2025 01:38

Add patch for runtime PM and PMEs

5e4a8a5

RealityAnomaly force-pushed the feat/runtime-suspend branch from 823c708 to 5e4a8a5 Compare August 31, 2025 03:11

Uh oh!

Add patch to support runtime PM for PCI devices #1161

Are you sure you want to change the base?

Add patch to support runtime PM for PCI devices #1161

Uh oh!

Conversation

RealityAnomaly commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RealityAnomaly commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsg22 commented Aug 30, 2025

Uh oh!

RealityAnomaly commented Aug 31, 2025

Uh oh!

RealityAnomaly commented Sep 18, 2025

Uh oh!

RealityAnomaly commented Sep 21, 2025

Uh oh!

RealityAnomaly commented Sep 28, 2025

Uh oh!

Uh oh!

RealityAnomaly commented Aug 22, 2025 •

edited

Loading

RealityAnomaly commented Aug 30, 2025 •

edited

Loading