Skip to content

Conversation

@cxdong
Copy link
Contributor

@cxdong cxdong commented Dec 9, 2025

pvvmcs part2 to implement the PV interfaces of accessing the vcpu state

@cxdong cxdong force-pushed the pkvm-v6.18 branch 2 times, most recently from c1f0cd8 to 0f4b8fe Compare December 16, 2025 11:12
@cxdong cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch from b504641 to 8b26289 Compare December 16, 2025 14:52
@cxdong cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch 2 times, most recently from df60a09 to 222764d Compare December 19, 2025 06:28

static void pkvm_update_exception_bitmap(struct kvm_vcpu *vcpu)
{
if (!vcpu->arch.guest_state_protected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems useful to add a WARN_ON_ONCE to all such guest_state_protected checks (since if this happens, it rather means a bug in KVM-high, and we'd better detect such bugs?)

Looks like we already have such bugs: I've just tried adding WARN_ON_ONCE to all these guest_state_protected checks (on top of https://github.com/intel-staging/pKVM-IA/tree/pvvmcs-v6.18-wip) and then tried running an Ubuntu Cloud pVM, and it warned in 2 places: in pkvm_get_rflags() called from pkvm_get_if_flag(), and in pkvm_skip_emulated_instruction() called from handle_fastpath_wrmsr().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems useful to add a WARN_ON_ONCE to all such guest_state_protected checks (since if this happens, it rather means a bug in KVM-high, and we'd better detect such bugs?)

These functions are the implementations of the kvm_x86_ops, which may be used by the KVM x86 layer for the pVM and npVM. So I think it may be convenient for the KVM x86 layer not to distinguish a VM to avoid changing the KVM? As we can implement the ops differently for the npVM and pVM. The TDX does similarly. The host is not allowed to access TD's vcpu state, so the corresponding ops is implemented by doing nothing: e.g., vt_get_segment/vt_set_segment, rather than through a warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after more thinking: if KVM abnormally triggers such a PV interface for a pVM, it is not necessarily a bug in KVM-high, it may be caused by a pVM itself, e.g. if it executes an instruction that it shouldn't execute since it requires decoding & emulation by the host. (Unless we gate such cases by earlier checks in kvm_emulate_XXX() and so on, to prevent calling those PV interfaces even in this case. But that might be too much trouble to implement.)

So yeah, might be fine to keep it as is.

Still, are we sure these 2 observations I mentioned (pkvm_get_if_flag() -> pkvm_get_rflags() and handle_fastpath_wrmsr() -> pkvm_skip_emulated_instruction()) are not indicating any bugs in KVM-high? (I haven't looked into that myself yet.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, are we sure these 2 observations I mentioned (pkvm_get_if_flag() -> pkvm_get_rflags() and handle_fastpath_wrmsr() -> pkvm_skip_emulated_instruction()) are not indicating any bugs in KVM-high? (I haven't looked into that myself yet.)

Ok, looked into that.

  1. get_if_flag() is used to set kvm_run->if_flag which is used by userspace, in particular it is used by crosvm for checking if interrupts are enabled so it can inject an interrupt (in ready_for_interrupt() in hypervisor/src/kvm/x86_64.rs).

So, since once a protected vCPU started running, pkvm_get_rflags() is not returning up-to-date values but always returning a stale value without X86_EFLAGS_IF bit set, I suppose this may prevent crosvm from injecting interrupts when needed (at least with split irqchip)? Although I've tried running pVM with crosvm with --irqchip split and it still seems to be working fine (in particular, it is receiving interrupts from virtio devices)...

  1. handle_fastpath_wrmsr() is triggering pkvm_skip_emulated_instruction() when it is called for APIC_BASE_MSR + (APIC_ICR >> 4), i.e. when the guest writes APIC_ICR.

So, the host is silently failing to skip the instruction, however the pVM is still working fine (i.e. not spinning on this trapped wrmsr instruction). Probably because pKVM has already skipped the instruction, before forwarding the MSR write handling to the host? If so, seems harmless.

BTW could you point me where in the code pKVM skips the instruction before forwarding MSR vmexit handling to the host? I can't find that in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, since once a protected vCPU started running, pkvm_get_rflags() is not returning up-to-date values but always returning a stale value without X86_EFLAGS_IF bit set, I suppose this may prevent crosvm from injecting interrupts when needed (at least with split irqchip)? Although I've tried running pVM with crosvm with --irqchip split and it still seems to be working fine (in particular, it is receiving interrupts from virtio devices)...

This sounds like an issue. Just looked the crosvm code, and seems like the ready_for_interrupt() is used to inject interrupt via KVM_INTERRUPT ioctl. Maybe the crosvm doesn't use KVM_INTERRUPT ioctl to inject interrupts for virtio devices but uses eventfd, so we didn't observe an interrupt issue?

Perhaps we can just return the rflag IF bit to the host via the get_rflag PV interface?

BTW could you point me where in the code pKVM skips the instruction before forwarding MSR vmexit handling to the host? I can't find that in the code?

It is done by update_protected_vcpu_state() in the arch/x86/kvm/vmx/vmx.c, after the MSR vmexit is handled by the host but not before as we need to know the emulation result to determine whether to skip or inject #GP. Not in this serial but in the pvvmcs part3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in case of SplitIrqChip, seems like it is only needed for PIC interrupts:

BTW just an observation: when I run a VM with noapic in the guest kernel command line (i.e. with PIC instead of LAPIC + IOAPIC), on real hardware it seems to work fine both as npVM and pVM, but with pKVM inside qemu, npVM works fine but pVM hangs during boot. (But it looks like the cause is not this non-working if_flag but something different.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can just return the rflag IF bit to the host via the get_rflag PV interface?

I thought about that too. But that still means exposing some information about the pVM's internal state to the host.

By the way... I haven't looked yet into the patches implementing the PV interfaces for interrupt injection, but just a thought: I see that whenever we inject a lot of interrupts to a VM, we have quite a lot of interrupt_allowed PV calls to pKVM. (Maybe even not one but several such calls per a single IRQ injection?)

So, just wondering: would it be possible to avoid this interrupt_allowed call, and instead, for example, somehow ask pKVM (in a single hypercall): "inject interrupt if allowed" which would check if interrupt is allowed and inject it if allowed? Both to reduce the number of host vmexits and to avoid exposing this info to the host.

(Maybe this is a not a smart suggestion. I'm yet to get familiar with how interrupt injection works in pvVMCS.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW just an observation: when I run a VM with noapic in the guest kernel command line (i.e. with PIC instead of LAPIC + IOAPIC), on real hardware it seems to work fine both as npVM and pVM, but with pKVM inside qemu, npVM works fine but pVM hangs during boot. (But it looks like the cause is not this non-working if_flag but something different.)

Tried with pKVM inside qemu and booting pVM with noapic in the guest kernel command line. With 1/2/4/8 vcpus, I didn't observe a booting issue, but with 16 vcpus, the booting hangs. Before that, I can see the below call trace from the guest kernel:

[    2.784039] virtio_blk virtio0: 16/0/0 default/read/poll queues
[    2.812041] ------------[ cut here ]------------
[    2.816041] WARNING: CPU: 1 PID: 1 at arch/x86/kernel/apic/apic.c:2308 __irq_msi_compose_msg+0xa3/0xb0
[    2.816041] Modules linked in:
[    2.816041] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-pkvm-clang+ #109 PREEMPT(voluntary)
[    2.816041] Hardware name: ChromiumOS crosvm, BIOS 0
[    2.816041] RIP: 0010:__irq_msi_compose_msg+0xa3/0xb0
[    2.816041] Code: f9 ff 7f 00 00 77 23 c1 e9 03 81 e1 e0 0f 00 00 09 c1 89 0e 5d e9 3d aa d9 00 cc 81 f9 00 01 00 00 73 07 5d c3 cc cc cc cc cc <0f> 0b 5d c3 cc cc cc cc cc 0f 1f 40 00 b8 20 7f 43 8e 90 90 90 90
[    2.816041] RSP: 0000:ffffc9000000f440 EFLAGS: 00010046
[    2.816041] RAX: 00000000fee00004 RBX: ffff888006af4c28 RCX: 0000000000000100
[    2.816041] RDX: 0000000000000000 RSI: ffffc9000000f470 RDI: ffff8880068f4d40
[    2.816041] RBP: ffffc9000000f440 R08: 0000000000000000 R09: 0000000000000000
[    2.816041] R10: 0000000000000000 R11: ffffffff81362f50 R12: ffff8880054b9680
[    2.816041] R13: 0000000000000000 R14: ffff8880072e5c00 R15: 0000000000000000
[    2.816041] FS:  0000000000000000(0000) GS:ffff8880f6c72000(0000) knlGS:0000000000000000
[    2.816041] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.816041] CR2: 0000000000000000 CR3: 0000000002e3e001 CR4: 0000000000770ef0
[    2.816041] PKRU: 55555554
[    2.816041] Call Trace:
[    2.816041]  <TASK>
[    2.816041]  x86_vector_msi_compose_msg+0x2d/0x40
[    2.816041]  irq_chip_compose_msi_msg+0x6c/0x80
[    2.816041]  msi_domain_activate+0x4b/0xd0
[    2.816041]  ? __irq_domain_activate_irq+0x62/0xa0
[    2.816041]  __irq_domain_activate_irq+0x62/0xa0
[    2.816041]  irq_domain_activate_irq+0x28/0x60
[    2.816041]  irq_startup+0xd1/0x1a0
[    2.816041]  __setup_irq+0x63b/0x7f0
[    2.816041]  request_threaded_irq+0x138/0x190
[    2.816041]  ? __cfi_vring_interrupt+0x10/0x10
[    2.816041]  vp_find_one_vq_msix+0xfc/0x1b0
[    2.816041]  vp_find_vqs_msix+0x409/0x500
[    2.816041]  vp_find_vqs+0x43/0x2b0
[    2.816041]  vp_modern_find_vqs+0x1a/0x70
[    2.816041]  init_vq+0x2c3/0x3a0
[    2.816041]  ? __cfi_default_calc_sets+0x10/0x10
[    2.816041]  virtblk_probe+0x12e/0xef0
[    2.816041]  ? __this_cpu_preempt_check+0x17/0x30
[    2.816041]  ? vp_modern_get_status+0x1a/0x30
[    2.816041]  virtio_dev_probe+0x370/0x450
[    2.816041]  ? sysfs_do_create_link_sd+0x73/0xd0
[    2.816041]  really_probe+0x13e/0x490
[    2.816041]  __driver_probe_device+0x83/0x150
[    2.816041]  driver_probe_device+0x28/0x1d0
[    2.816041]  __driver_attach+0x132/0x280
[    2.816041]  ? __cfi___driver_attach+0x10/0x10
[    2.816041]  bus_for_each_dev+0x125/0x150
[    2.816041]  driver_attach+0x22/0x30
[    2.816041]  bus_add_driver+0x17a/0x2c0
[    2.816041]  driver_register+0x65/0xf0
[    2.816041]  __register_virtio_driver+0x31/0x40
[    2.816041]  virtio_blk_init+0x59/0xa0
[    2.816041]  ? __cfi_virtio_blk_init+0x10/0x10
[    2.816041]  do_one_initcall+0xf7/0x3b0
[    2.816041]  ? __lock_acquire+0x442/0x6c0
[    2.816041]  ? sched_clock_noinstr+0xd/0x30
[    2.816041]  ? __lock_acquire+0x442/0x6c0
[    2.816041]  ? sched_clock_noinstr+0xd/0x30
[    2.816041]  ? sched_clock_noinstr+0xd/0x30
[    2.816041]  ? __lock_acquire+0x442/0x6c0
[    2.816041]  ? parameq+0x1b/0xa0
[    2.816041]  ? parse_args+0x123/0x3e0
[    2.816041]  do_initcall_level+0xa5/0x135

Is this the same signature you saw?

So, just wondering: would it be possible to avoid this interrupt_allowed call, and instead, for example, somehow ask pKVM (in a single hypercall): "inject interrupt if allowed" which would check if interrupt is allowed and inject it if allowed? Both to reduce the number of host vmexits and to avoid exposing this info to the host.

Yeah, I also thought about that. But it seems we have to modify the KVM's code. Currently the KVM injects an interrupt with the below code in kvm_check_and_inject_events():

          if (kvm_cpu_has_injectable_intr(vcpu)) {
                  r = can_inject ? kvm_x86_call(interrupt_allowed)(vcpu, true) :
                                   -EBUSY;
                  if (r < 0)
                          goto out;
                  if (r) {
                          int irq = kvm_cpu_get_interrupt(vcpu);

                          if (!WARN_ON_ONCE(irq == -1)) {
                                  kvm_queue_interrupt(vcpu, irq, false);
                                  kvm_x86_call(inject_irq)(vcpu, false);
                                  WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0);
                          }
                  }
                  if (kvm_cpu_has_injectable_intr(vcpu))
                          kvm_x86_call(enable_irq_window)(vcpu);
          }

So it expects the inject_irq to succeed if interrupt_allowed is true. Now we want to combine interrupt_allowed and inject_irq to reduce the number of hypercalls, then the inject_irq should return a meaningful result to indicate whether an interrupt is allowed or not. This is doable, but we will change the return type of inject_irq ops, as well as all the places using or implementing this ops. Not sure if it is worse to make this change.

As this is an optimization and we don't have performance issues for the guest interrupt right now, maybe we can leave this as it is until we need to improve that or some other ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same signature you saw?

No, I'm seeing a different symptom: regardless of the number of vCPUs, the pVM just hangs after printing these logs:

[    0.000000] Not enabling interrupt remapping due to skipped IO-APIC setup
[    0.000000] tsc: Unable to calibrate against PIT
[    0.000000] tsc: No reference (HPET/PMTIMER) available
[    0.000000] tsc: Marking TSC unstable due to could not calculate TSC khz

(BTW I'd expect the number of vCPUs shouldn't matter, since APIC is disabled, so the guest doesn't enable secondary CPUs anyway.)

More precisely, I'm only observing this issue only with split irqchip, i.e. with the following combination: -irqchip split in the crosvm command line + noapic in the guest kernel command line.

Actually I've just noticed that with this combination, a npVM also hangs during boot, although much later, after userspace already started. And the same is actually observed without pKVM as well.

So this seems to be not a pKVM issue but, maybe, an issue in crosvm's split irqchip implementation.

As this is an optimization and we don't have performance issues for the guest interrupt right now, maybe we can leave this as it is until we need to improve that or some other ideas?

Yeah agree, we can optimize it later if needed.

BTW this seems to be not the only KVM code path responsible for injecting interrupts. With enable_apicv enabled (i.e. in our default target setup), I'm not seeing any inject_irq calls in my vmexit tracing. I'm seeing sync_pir_to_irr calls instead. However there are still quite a few interrupt_allowed calls and some other calls as well.

For example, numbers of hypercalls during 1 second while running dd if=/dev/urandom bs=4096 count=10000 of=test oflag=direct in a npVM (with pKVM in qemu):

root@debian-virt:/sys/kernel/debug/pkvm# echo 0 >enable_vmexit_trace ; echo 1 >enable_vmexit_trace ; sleep 1; grep Host: vmexit_trace ; echo 0 >enable_vmexit_trace 
Host: CPUID 142 cycles 1481282 each-handler-cycle 10431
Host: VMCALL 38361 cycles 1986806585 each-handler-cycle 51792
Host: dump_vmexit_trace 1 cycles 49826 each-handler-cycle 49826
Host: vcpu_load 392 cycles 1386984 each-handler-cycle 3538
Host: vcpu_put 4 cycles 349642 each-handler-cycle 87410
Host: get_rflags 7734 cycles 173956690 each-handler-cycle 22492
Host: set_interrupt_shadow 7734 cycles 172761664 each-handler-cycle 22337
Host: interrupt_allowed 6600 cycles 192959931 each-handler-cycle 29236
Host: sync_pir_to_irr 5728 cycles 151754856 each-handler-cycle 26493
Host: vcpu_run 10168 cycles 1293586992 each-handler-cycle 127221

Or the same test in a pVM:

root@debian-virt:/sys/kernel/debug/pkvm# echo 0 >enable_vmexit_trace ; echo 1 >enable_vmexit_trace ; sleep 1; grep Host: vmexit_trace ; echo 0 >enable_vmexit_trace 
Host: CPUID 142 cycles 1184352 each-handler-cycle 8340
Host: VMCALL 142907 cycles 9358329498 each-handler-cycle 65485
Host: dump_vmexit_trace 1 cycles 38502 each-handler-cycle 38502
Host: vcpu_load 12772 cycles 44477650 each-handler-cycle 3482
Host: vcpu_put 908 cycles 54130012 each-handler-cycle 59614
Host: interrupt_allowed 81371 cycles 3735101546 each-handler-cycle 45902
Host: cancel_injection 18 cycles 943694 each-handler-cycle 52427
Host: sync_pir_to_irr 30556 cycles 1402745460 each-handler-cycle 45907
Host: vcpu_run 17281 cycles 4120892634 each-handler-cycle 238463

Seems like potentially a lot of room for optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this seems to be not the only KVM code path responsible for injecting interrupts. With enable_apicv enabled (i.e. in our default target setup), I'm not seeing any inject_irq calls in my vmexit tracing. I'm seeing sync_pir_to_irr calls instead. However there are still quite a few interrupt_allowed calls and some other calls as well.

I see. As we have virtual interrupt delivery, the kvm_check_and_inject_events() won't inject interrupts in its code path because the kvm_cpu_has_injectable_intr() is false. The interrupt_allowed PV interface is used by some other code path, e.g., kvm_vcpu_has_events->kvm_arch_interrupt_allowed, to know whether the pending interrupt can be injected or not, to determine if this vcpu should run. So it seems necessary to have interrupt_allowed PV interface.

cxdong and others added 23 commits December 29, 2025 10:29
Enable the pKVM VMX host to deliver SIPI vector to guest VMs. Allowing
the host to do so for pVMs is not secure. So pVMs will use a secure SIPI
mechanism eventually. Once this is ready, the pKVM host only need to
deliver SIPI vector to npVMs.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import vmx_vcpu_after_set_cpuid() from KVM to implement
vcpu_after_set_cpuid operation for pKVM. As nested and PMU are not
supported, the code related for nested and PMU are removed for pKVM.

Signed-off-by: Chuanxiao Dong <[email protected]>
Initialize KVM cpuid_xstate_sizes, which will be needed by the pKVM to
leverage KVM CPUID codes to emulate guest VM's CPUID.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import fpu_enable_guest_xfd_features() from kernel fpu for pKVM to
enable the guest xfd features. The __xfd_enable_feature() is implemented
differently with kernel as pKVM doesn't support allocating memory
dynamically thus the fpstate memory for xfd feature is required to be
allocated and donated from the host via a dedicated PV interface. Thus
the xfd feature enabling in pKVM is to check if the fpstate memory size
is sufficiently large for enabling xfd.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import kvm_set_cpuid from KVM, with changes to remove unsupported code
path, e.g, virtual APIC (emulated by the host) and kvm mmu will be
handled by the host, PMU refresh is not needed as PMU is not supported.
Hyper-V related code is bypassed as it is also not supported.

As pVM's FPU should be isolated from the host, pKVM will enable its xfd
feature if cpuid allows. For npVM, the xfd feature enabling is done by
the host rather than pKVM.

Signed-off-by: Chuanxiao Dong <[email protected]>
The pKVM hypervisor doesn't support emulate KVM PV features for
simplicity. Thus remove KVM PV feature bits from the corresponding
CPUID.

Signed-off-by: Chuanxiao Dong <[email protected]>
Add vcpu_after_set_cpuid PV interface for the host to send the CPUID
entries to the pKVM hypervisor, which will be used as the CPUID entries
for the pkvm_vcpu. The CPUID entries memory pages will be donated to
prevent the host from accessing. If the pkvm_vcpu already has a CPUID
entries, the old one's physical addresses and size will be return to the
host via pkvm_memcache.

For a pVM, this PV interface allows the host to set CPUID entries before
the pVM has started running, but doesn't allow the host to set if the
pVM is already running.

Signed-off-by: Chuanxiao Dong <[email protected]>
Enable the pKVM VMX host to send CPUID entries to the pKVM hypervisor
via the corresponding PV interface. If the PV interface is success, the
previous CPUID entries will be reclaimed if there is. As this PV
interface is inaccessible to the host once the guest state is protected,
no need to send CPUID entries in this case.

Signed-off-by: Chuanxiao Dong <[email protected]>
Do it before kvm_set_cpuid() for protected VM. Accordingly copy more
cpuid helper functions from kvm and fix dependencies on following
symbols:
  - kvm_pmu_cap
  - kvm_mmu_get_max_tdp_level()
  - xstate_get_guest_group_perm()

Signed-off-by: Kevin Tian <[email protected]>
Currently cpuid is set by the host, and immutable once pVM runs. But
certain cpuid leaves define security features required by pVM to ensure
its security inside. Lacking of them may put pVM under risk.

Adopt a simple policy similar to QEMU '--cpu host', by using the default
pKVM supported leaves as the base plus a small set allowing the host to
control. The allowed set is defined based on the leaves which crosvm
currently touches, and scrutinized to not have a security implication
to pVM.

This way saves us from the burden of maintaining a fine-grained bit-wise
complex policy as TDX does.

Signed-off-by: Kevin Tian <[email protected]>
Add vcpu_add_fpstate PV interface for the host to add a new fpstate
memory if needs to support xfd, as the fpstate memory allocated when
creating vcpu may not be sufficient. As the pVM's FPU state is managed
by the pKVM hypervisor while the npVM's FPU state is managed by the
host, re-allocating the fpstate is only necessary for the pVM. This PV
interface is not accessible to the host once the pVM started running to
ensure the host won't manipulate pVM's FPU. The old fpstate memory
physical addresses and size are returned via pkvm_memcache for the host
to reclaim.

Signed-off-by: Chuanxiao Dong <[email protected]>
Add new fpstate memory via the vcpu_add_fpstate PV interface if the xfd
feature is enabled for pVM. As the pVM's FPU state is managed by the
pKVM hypervisor while the npVM's FPU state is managed by the host,
re-allocating is only necessary for the pVM, and should be done before
adding the new cpuid entries to the pKVM hypervisor.

If the new fpstate memory is added successfully, free the old one.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import vmx_write_tsc_offset/multiplier() from KVM to implement
write_tsc_offset/multiplier operations for pKVM.

Signed-off-by: Chuanxiao Dong <[email protected]>
Add write_tsc_offset/multiplier PV interfaces to allow host to write the
tsc offset and tsc multiplier. These PV interfaces are always accessible
to the host as the Trusty pVM can use GSC as the secure time source not
the TSC. The same reason to allow host to emulate and access the guest
TSC_ADJUST and TSC MSRs.

Signed-off-by: Chuanxiao Dong <[email protected]>
Enable the pKVM VMX host to write TSC offset and multiplier via the
corresponding PV interfaces.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import vmx_load_mmu_pgd from KVM to implement load_mmu_pgd operation for
pKVM. The Hyper-V related handling is skipped as this is not supported
by pKVM.

Signed-off-by: Chuanxiao Dong <[email protected]>
Add load_mmu_pgd PV interface for the host to load the guest mmu via the
vendor-specific operation. Currently this PV interface is always
accessible to the host, as the guest mmu for both pVM and npVM are still
managed by the host. It will be updated accordingly once the pvMMU is
ready to manage the guest mmu inside the pKVM to achieve pVM's memory
isolation.

Signed-off-by: Chuanxiao Dong <[email protected]>
Enable the pKVM VMX host to load guest mmu via the load_mmu_pgd PV
interface.

Signed-off-by: Chuanxiao Dong <[email protected]>
Nested virtualization is not supported on this platform. To satisfy the
interface requirements, provide stub implementations for nested-related
operations.

Signed-off-by: Chuanxiao Dong <[email protected]>
Import vmx_setup_mce operation from the KVM to implement setup_mce
operation for pKVM.

Signed-off-by: Chuanxiao Dong <[email protected]>
Add setup_mce PV interface for the host to setup the guest MCE. This PV
interface is always accessible to the host as the input mcg_cap from the
host is checked and the guest MCE is initialized by the pKVM itself. No
guest MCE state will be leaked.

Signed-off-by: Chuanxiao Dong <[email protected]>
Enable the pKVM VMX host to setup guest MCE via the setup_mce PV
interface.

Signed-off-by: Chuanxiao Dong <[email protected]>
The pKVM hypervisor doesn't support SMM mode emulation. To satisfy the
interface requirement, provide stub implementation for SMM mode
operations.

Signed-off-by: Chuanxiao Dong <[email protected]>
@cxdong cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch from 5962b64 to e1ac2ac Compare December 29, 2025 02:55

/*
* Make sure the operations before setting vmx->segment_cache.bitmask
* won't be reordered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain in a bit more detail why need this barrier? What does it pair with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make sure the compiler won't reorder on the same CPU. The vmx->segment_cache.bitmask should be set after the segment is cached, e.g, save->selector = var->selector; in pkvm_cache_segment() should happen before vmx->segment_cache.bitmask |= mask;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess it is related to preemption? i.e. I guess the actual reason is that, as mentioned in the commit message, we don't have a special no_cache version of pkvm_get_cpl(), so kvm_x86_call(get_cpl_no_cache)(vcpu) -> pkvm_get_cpl() -> pkvm_segment_cache_test() may preempt the thread that is executing pkvm_cache_segment() here?

And that is also the reason why regular KVM doesn't need a similar barrier in its vmx_segment_cache_test_set() (which may be preempted by vmx_get_cpl_no_cache(), but vmx_get_cpl_no_cache() doesn't do anything with the segment cache, so that's safe)?

If that's correct, worth explaining that in this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually I'm wondering: why won't we just follow what KVM does, i.e. implement a separate pkvm_get_cpl_no_cache() function which just makes the hypercall to pKVM, without messing with the host-side cache?

Otherwise, are we really sure our code is safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's correct, worth explaining that in this comment?

Yes, it is correct. As I read the KVM code, it sets the segment_cache.bitmask first then saves the cached value. E.g., vmx_read_guest_seg_ar will set the segment_cache.bitmask for SEG_FIELD_AR first, then read the AR bytes from the VMCS field. So if vmx_read_guest_seg_ar is preempted after segment_cache.bitmask being set, the cached value is not up-to-date yet. In pKVM side, we save the cached value first and then set the segment_cached.bitmask. So if pkvm_cache_segment() is preempted, it won't have a mismatch between the cache and bitmask.

And actually I'm wondering: why won't we just follow what KVM does, i.e. implement a separate pkvm_get_cpl_no_cache() function which just makes the hypercall to pKVM, without messing with the host-side cache?

Yes, we can do that. Actually, there is an optimization for the get_segment PV interface. To reduce the number of get_segment PV interfaces, we will sync some segment registers to the host for npVM and set the segment_cache.bitmask (https://github.com/intel-staging/pKVM-IA/blame/pvvmcs-v6.18-wip/arch/x86/kvm/vmx/vmx.c#L9843). In this way, the no_cache version also doesn't have to use the get_segment PV interface.

!pkvm_segment_cache_test(vmx, seg, SEG_FIELD_AR)) {
union pkvm_hc_data out;

if (WARN_ON(pkvm_hypercall_out(get_segment, &out, seg)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit question: any particular reason why do this WARN_ON check for get_segment and set_segment, while not do it for most other hypercalls?

Looks like get_segment and set_segment hypercalls may only fail due to pkvm_vcpu_handle_host_hypercall() failing with -EINVAL if no vCPU is loaded, or with -EPERM if this is a pVM and it is already running. The latter should be prevented by the above vcpu->arch.guest_state_protected check. So, do we want to WARN_ON specifically in case if no vCPU is loaded? But when why just for get_segment and set_segment, not other PV calls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit more thought: maybe it makes sense to do this WARN_ON check specifically in this case, just to avoid saving a wrong value in the segment cache if the hypercall failed...

Copy link
Contributor Author

@cxdong cxdong Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we want to WARN_ON specifically in case if no vCPU is loaded? But when why just for get_segment and set_segment, not other PV calls?

No, just because the .get_segment and .set_segment callbacks are void type, so it cannot tell whether the operation is successful or not via the return value. Thus, use a WARN_ON. Maybe a KVM_BUG_ON is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

Maybe a KVM_BUG_ON is better?

Maybe...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, some thoughts:

  1. e.g. set_idt/get_idt/set_gdt/get_gdt are also void, but we don't WARN_ON or KVM_BUG_ON there?
  2. Do these warnings always indicate a KVM bug, or it may be also a pVM's fault (e.g. pVM trying to execute an instruction that requires KVM to read the guest's segment)?
  3. What is TDX's approach? Looks like it doesn't warn in such caches (except for cache_reg, for some reason)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. e.g. set_idt/get_idt/set_gdt/get_gdt are also void, but we don't WARN_ON or KVM_BUG_ON there?

Indeed, probably we can have a common rule to handle the return value. Maybe can just WARN_ON(ret); in the __pkvm_hypercall macro

Do these warnings always indicate a KVM bug, or it may be also a pVM's fault (e.g. pVM trying to execute an instruction that requires KVM to read the guest's segment)?

It seems a KVM bug. If it is a pVM's fault that requires the host KVM to emulate by reading the guest's segment, pKVM hypervisor should identify that vmexit reason and handle that properly, e.g., injecting a #GP or #UD to the pVM rather than returning to the host to emulate.

What is TDX's approach? Looks like it doesn't warn in such caches (except for cache_reg, for some reason)?

TDX's approach is for TD only. Just guess TD's vcpu->arch.guest_state_protected may be false if the TD has the debug attribute. In this case, maybe these callbacks will be used when creating or running a TD, so it is not suitable to print a warning but just ignore the writing or return a zero value for the reading. cache_reg is not a callback which will be used anyway, thus worth a warning.

pkvm_hypercall_in(set_gdt, &data);
} else {
pkvm_hypercall_out(get_gdt, &data);
*dt = data.get_gdt.desc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems the code would look simpler if all these 4 cases were done directly in the 4 functions below, instead of this common pkvm_access_idt_gdt() helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have several lines of duplicated code, e.g., the if (vcpu->arch.guest_state_protected) check. But anyway, using 4 functions rather than this common pkvm_access_idt_gdt() helper is also good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO that wouldn't be any significant duplication (e.g. those few lines would be different in pkvm_get_*dt() vs pkvm_set_*dt() functions, since pkvm_set_*dt() don't do the memset), while with this pkvm_access_idt_gdt() it seems at first glance that the logic is more sophisticated than it actually is. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will use 4 functions rather than the common pkvm_access_idt_gdt()

{
if (!for_injection ||
(!kvm_event_needs_reinjection(vcpu) &&
!vcpu->arch.exception.pending))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why need to do these checks before calling kvm_x86_call(interrupt_allowed) / kvm_x86_call(nmi_allowed)? I'd assume such checks should have already be done on the host side?

Is that because of some security concerns? Or maybe the host doesn't have enough information to correctly check this, i.e. the host's values of exception.pending, exception.injected etc are not accurate? (If so, maybe we should make them accurate?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after looking at next patches, I see that exceptions (unlike interrupts and NMIs) are managed by pKVM, not by the host, so I guess this makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Exceptions for pVMs are managed by the pKVM. Exceptions for npVMs can also be injected by the host. So in the pvVMCS part 3, the exception.pending and exception.injected of a npVM will be synced to the host as well (https://github.com/intel-staging/pKVM-IA/blob/pvvmcs-v6.18-wip/arch/x86/kvm/pkvm/pkvm.c#L1163)

if (ret)
return ret;
if (data == cur_data)
return 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wounder if this will not cause us some issues for multiple reasons:

  1. I can imagine that accessing some MSRs have side effects and even reading or writing the same value will give a different effect.
  2. there are WO or RO MSRs, so e.g. reading WO register will result with unpredicted data and result with EPERM as read data is different than data that we expect to write

Some example e.g. MSR_IA32_FLUSH_CMD (which BTW I did not encounter to be triggered by host in this context, just giving an example of such MSR)

I didn't encounter such issue in basic tests but perhaps in some different SoC we could get into difficult to debug issues because of that? So perhaps it would be better to explicitly allow it for MSR_IA32_XSS MSR_STAR MSR_CSTAR MSR_L MSR_KERNEL_GS_BASE, MSR_IA32_SYSENTER_CS MSR_IA32_SYSENTER_ESP MSR_IA32_SYSENTER_EIP? which are all store/configuration and doesn't trigger event/hw commands and warn otherwise? WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, very good point. Thanks for pointing out.

Yeah, looks like we should just allow setting those 5 MSRs set by crosvm (MSR_STAR, MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR). I assume it shouldn't be a security concern, since the guest is supposed to set up those MSRs on its own, before starting using SYSCALL or SWAPGS instructions; in other words, any initial values of those MSRs are untrusted anyway, since they ultimately specify some addresses in the guest memory, while the entire guest memory contents are initially untrusted anyway.

Probably no need to allow MSR_IA32_XSS. KVM sets it to 0 in kvm_vcpu_reset(), but pKVM does the same anyway, and KVM's kvm_vcpu_reset() doesn't check the return value, so it should be ok if it will just silently return -EPERM and do nothing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record all MSRs that I've mentioned (not only 5 above) I've saw in pkvm_set_msr default handling either on real hw or when running pKVM in QEMU with some VMs:

  • MSR_IA32_XSS (0x00000da0) [BTW agree with your reasoning that it could be skipped]
  • MSR_STAR (0xc0000081)
  • MSR_LSTAR (0xc0000082)
  • MSR_CSTAR (0xc0000083)
  • MSR_SYSCALL_MASK (0xc0000084)
  • MSR_KERNEL_GS_BASE (0xc0000102)
  • MSR_IA32_SYSENTER_CS (0x174)
  • MSR_IA32_SYSENTER_ESP (0x175)
  • MSR_IA32_SYSENTER_EIP (0x176)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right, crosvm sets those as well:

/// Configure Model specific registers for long (64-bit) mode.
pub fn set_long_mode_msrs(msrs: &mut BTreeMap<u32, u64>) {
    msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_CS, 0x0);
    msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_ESP, 0x0);
    msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_EIP, 0x0);

kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
#else
WARN_ONCE(1, "pkvm doesn't support injecting exception to a real mode guest\n");
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: commit message says: "As the pKVM hypervisor doesn't support emulating
real mode guest..." - maybe it actually supports real mode guests in general, just doesn't support injecting exceptions to them (as that would require instruction emulation in pKVM hypervisor)?

(Although I haven't tried testing real mode guests with pKVM myself, so I'm not sure.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The pKVM relies on unrestricted guest feature, thus vmx->rmode.vm86_active will not be 1. So the code will not run to here. Maybe an alternative is just to use #ifndef __PKVM_HYP__ to wrap this if condition, and update the commit message accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants