Pkvm v6.18 pvvmcs part2 #48

cxdong · 2025-12-09T14:49:22Z

pvvmcs part2 to implement the PV interfaces of accessing the vcpu state

arch/x86/include/asm/irqflags.h

arch/x86/kvm/vmx/pkvm_init.c

arch/x86/kvm/pkvm/entry.S

arch/x86/kvm/vmx/vmx.c

arch/x86/kvm/pkvm/pkvm.c

arch/x86/kvm/vmx/pkvm_init.c

arch/x86/kvm/pkvm/pkvm.c

arch/x86/kvm/vmx/vmx_ops.h

arch/x86/kvm/pkvm/pkvm.c

arch/x86/kvm/x86.c

maluka-dmytro · 2025-12-22T01:15:28Z

arch/x86/kvm/vmx/pkvm_host.c


+static void pkvm_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu->arch.guest_state_protected)


It seems useful to add a WARN_ON_ONCE to all such guest_state_protected checks (since if this happens, it rather means a bug in KVM-high, and we'd better detect such bugs?)

Looks like we already have such bugs: I've just tried adding WARN_ON_ONCE to all these guest_state_protected checks (on top of https://github.com/intel-staging/pKVM-IA/tree/pvvmcs-v6.18-wip) and then tried running an Ubuntu Cloud pVM, and it warned in 2 places: in pkvm_get_rflags() called from pkvm_get_if_flag(), and in pkvm_skip_emulated_instruction() called from handle_fastpath_wrmsr().

It seems useful to add a WARN_ON_ONCE to all such guest_state_protected checks (since if this happens, it rather means a bug in KVM-high, and we'd better detect such bugs?)

These functions are the implementations of the kvm_x86_ops, which may be used by the KVM x86 layer for the pVM and npVM. So I think it may be convenient for the KVM x86 layer not to distinguish a VM to avoid changing the KVM? As we can implement the ops differently for the npVM and pVM. The TDX does similarly. The host is not allowed to access TD's vcpu state, so the corresponding ops is implemented by doing nothing: e.g., vt_get_segment/vt_set_segment, rather than through a warning.

Ok, after more thinking: if KVM abnormally triggers such a PV interface for a pVM, it is not necessarily a bug in KVM-high, it may be caused by a pVM itself, e.g. if it executes an instruction that it shouldn't execute since it requires decoding & emulation by the host. (Unless we gate such cases by earlier checks in kvm_emulate_XXX() and so on, to prevent calling those PV interfaces even in this case. But that might be too much trouble to implement.)

So yeah, might be fine to keep it as is.

Still, are we sure these 2 observations I mentioned (pkvm_get_if_flag() -> pkvm_get_rflags() and handle_fastpath_wrmsr() -> pkvm_skip_emulated_instruction()) are not indicating any bugs in KVM-high? (I haven't looked into that myself yet.)

Still, are we sure these 2 observations I mentioned (pkvm_get_if_flag() -> pkvm_get_rflags() and handle_fastpath_wrmsr() -> pkvm_skip_emulated_instruction()) are not indicating any bugs in KVM-high? (I haven't looked into that myself yet.)

Ok, looked into that.

get_if_flag() is used to set kvm_run->if_flag which is used by userspace, in particular it is used by crosvm for checking if interrupts are enabled so it can inject an interrupt (in ready_for_interrupt() in hypervisor/src/kvm/x86_64.rs).

So, since once a protected vCPU started running, pkvm_get_rflags() is not returning up-to-date values but always returning a stale value without X86_EFLAGS_IF bit set, I suppose this may prevent crosvm from injecting interrupts when needed (at least with split irqchip)? Although I've tried running pVM with crosvm with --irqchip split and it still seems to be working fine (in particular, it is receiving interrupts from virtio devices)...

handle_fastpath_wrmsr() is triggering pkvm_skip_emulated_instruction() when it is called for APIC_BASE_MSR + (APIC_ICR >> 4), i.e. when the guest writes APIC_ICR.

So, the host is silently failing to skip the instruction, however the pVM is still working fine (i.e. not spinning on this trapped wrmsr instruction). Probably because pKVM has already skipped the instruction, before forwarding the MSR write handling to the host? If so, seems harmless.

BTW could you point me where in the code pKVM skips the instruction before forwarding MSR vmexit handling to the host? I can't find that in the code?

So, since once a protected vCPU started running, pkvm_get_rflags() is not returning up-to-date values but always returning a stale value without X86_EFLAGS_IF bit set, I suppose this may prevent crosvm from injecting interrupts when needed (at least with split irqchip)? Although I've tried running pVM with crosvm with --irqchip split and it still seems to be working fine (in particular, it is receiving interrupts from virtio devices)...

This sounds like an issue. Just looked the crosvm code, and seems like the ready_for_interrupt() is used to inject interrupt via KVM_INTERRUPT ioctl. Maybe the crosvm doesn't use KVM_INTERRUPT ioctl to inject interrupts for virtio devices but uses eventfd, so we didn't observe an interrupt issue?

Perhaps we can just return the rflag IF bit to the host via the get_rflag PV interface?

BTW could you point me where in the code pKVM skips the instruction before forwarding MSR vmexit handling to the host? I can't find that in the code?

It is done by update_protected_vcpu_state() in the arch/x86/kvm/vmx/vmx.c, after the MSR vmexit is handled by the host but not before as we need to know the emulation result to determine whether to skip or inject #GP. Not in this serial but in the pvvmcs part3.

And in case of SplitIrqChip, seems like it is only needed for PIC interrupts:

BTW just an observation: when I run a VM with noapic in the guest kernel command line (i.e. with PIC instead of LAPIC + IOAPIC), on real hardware it seems to work fine both as npVM and pVM, but with pKVM inside qemu, npVM works fine but pVM hangs during boot. (But it looks like the cause is not this non-working if_flag but something different.)

Perhaps we can just return the rflag IF bit to the host via the get_rflag PV interface?

I thought about that too. But that still means exposing some information about the pVM's internal state to the host.

By the way... I haven't looked yet into the patches implementing the PV interfaces for interrupt injection, but just a thought: I see that whenever we inject a lot of interrupts to a VM, we have quite a lot of interrupt_allowed PV calls to pKVM. (Maybe even not one but several such calls per a single IRQ injection?)

So, just wondering: would it be possible to avoid this interrupt_allowed call, and instead, for example, somehow ask pKVM (in a single hypercall): "inject interrupt if allowed" which would check if interrupt is allowed and inject it if allowed? Both to reduce the number of host vmexits and to avoid exposing this info to the host.

(Maybe this is a not a smart suggestion. I'm yet to get familiar with how interrupt injection works in pvVMCS.)

BTW just an observation: when I run a VM with noapic in the guest kernel command line (i.e. with PIC instead of LAPIC + IOAPIC), on real hardware it seems to work fine both as npVM and pVM, but with pKVM inside qemu, npVM works fine but pVM hangs during boot. (But it looks like the cause is not this non-working if_flag but something different.)

Tried with pKVM inside qemu and booting pVM with noapic in the guest kernel command line. With 1/2/4/8 vcpus, I didn't observe a booting issue, but with 16 vcpus, the booting hangs. Before that, I can see the below call trace from the guest kernel:

[ 2.784039] virtio_blk virtio0: 16/0/0 default/read/poll queues [ 2.812041] ------------[ cut here ]------------ [ 2.816041] WARNING: CPU: 1 PID: 1 at arch/x86/kernel/apic/apic.c:2308 __irq_msi_compose_msg+0xa3/0xb0 [ 2.816041] Modules linked in: [ 2.816041] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-pkvm-clang+ #109 PREEMPT(voluntary) [ 2.816041] Hardware name: ChromiumOS crosvm, BIOS 0 [ 2.816041] RIP: 0010:__irq_msi_compose_msg+0xa3/0xb0 [ 2.816041] Code: f9 ff 7f 00 00 77 23 c1 e9 03 81 e1 e0 0f 00 00 09 c1 89 0e 5d e9 3d aa d9 00 cc 81 f9 00 01 00 00 73 07 5d c3 cc cc cc cc cc <0f> 0b 5d c3 cc cc cc cc cc 0f 1f 40 00 b8 20 7f 43 8e 90 90 90 90 [ 2.816041] RSP: 0000:ffffc9000000f440 EFLAGS: 00010046 [ 2.816041] RAX: 00000000fee00004 RBX: ffff888006af4c28 RCX: 0000000000000100 [ 2.816041] RDX: 0000000000000000 RSI: ffffc9000000f470 RDI: ffff8880068f4d40 [ 2.816041] RBP: ffffc9000000f440 R08: 0000000000000000 R09: 0000000000000000 [ 2.816041] R10: 0000000000000000 R11: ffffffff81362f50 R12: ffff8880054b9680 [ 2.816041] R13: 0000000000000000 R14: ffff8880072e5c00 R15: 0000000000000000 [ 2.816041] FS: 0000000000000000(0000) GS:ffff8880f6c72000(0000) knlGS:0000000000000000 [ 2.816041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.816041] CR2: 0000000000000000 CR3: 0000000002e3e001 CR4: 0000000000770ef0 [ 2.816041] PKRU: 55555554 [ 2.816041] Call Trace: [ 2.816041] <TASK> [ 2.816041] x86_vector_msi_compose_msg+0x2d/0x40 [ 2.816041] irq_chip_compose_msi_msg+0x6c/0x80 [ 2.816041] msi_domain_activate+0x4b/0xd0 [ 2.816041] ? __irq_domain_activate_irq+0x62/0xa0 [ 2.816041] __irq_domain_activate_irq+0x62/0xa0 [ 2.816041] irq_domain_activate_irq+0x28/0x60 [ 2.816041] irq_startup+0xd1/0x1a0 [ 2.816041] __setup_irq+0x63b/0x7f0 [ 2.816041] request_threaded_irq+0x138/0x190 [ 2.816041] ? __cfi_vring_interrupt+0x10/0x10 [ 2.816041] vp_find_one_vq_msix+0xfc/0x1b0 [ 2.816041] vp_find_vqs_msix+0x409/0x500 [ 2.816041] vp_find_vqs+0x43/0x2b0 [ 2.816041] vp_modern_find_vqs+0x1a/0x70 [ 2.816041] init_vq+0x2c3/0x3a0 [ 2.816041] ? __cfi_default_calc_sets+0x10/0x10 [ 2.816041] virtblk_probe+0x12e/0xef0 [ 2.816041] ? __this_cpu_preempt_check+0x17/0x30 [ 2.816041] ? vp_modern_get_status+0x1a/0x30 [ 2.816041] virtio_dev_probe+0x370/0x450 [ 2.816041] ? sysfs_do_create_link_sd+0x73/0xd0 [ 2.816041] really_probe+0x13e/0x490 [ 2.816041] __driver_probe_device+0x83/0x150 [ 2.816041] driver_probe_device+0x28/0x1d0 [ 2.816041] __driver_attach+0x132/0x280 [ 2.816041] ? __cfi___driver_attach+0x10/0x10 [ 2.816041] bus_for_each_dev+0x125/0x150 [ 2.816041] driver_attach+0x22/0x30 [ 2.816041] bus_add_driver+0x17a/0x2c0 [ 2.816041] driver_register+0x65/0xf0 [ 2.816041] __register_virtio_driver+0x31/0x40 [ 2.816041] virtio_blk_init+0x59/0xa0 [ 2.816041] ? __cfi_virtio_blk_init+0x10/0x10 [ 2.816041] do_one_initcall+0xf7/0x3b0 [ 2.816041] ? __lock_acquire+0x442/0x6c0 [ 2.816041] ? sched_clock_noinstr+0xd/0x30 [ 2.816041] ? __lock_acquire+0x442/0x6c0 [ 2.816041] ? sched_clock_noinstr+0xd/0x30 [ 2.816041] ? sched_clock_noinstr+0xd/0x30 [ 2.816041] ? __lock_acquire+0x442/0x6c0 [ 2.816041] ? parameq+0x1b/0xa0 [ 2.816041] ? parse_args+0x123/0x3e0 [ 2.816041] do_initcall_level+0xa5/0x135

Is this the same signature you saw?

So, just wondering: would it be possible to avoid this interrupt_allowed call, and instead, for example, somehow ask pKVM (in a single hypercall): "inject interrupt if allowed" which would check if interrupt is allowed and inject it if allowed? Both to reduce the number of host vmexits and to avoid exposing this info to the host.

Yeah, I also thought about that. But it seems we have to modify the KVM's code. Currently the KVM injects an interrupt with the below code in kvm_check_and_inject_events():

if (kvm_cpu_has_injectable_intr(vcpu)) { r = can_inject ? kvm_x86_call(interrupt_allowed)(vcpu, true) : -EBUSY; if (r < 0) goto out; if (r) { int irq = kvm_cpu_get_interrupt(vcpu); if (!WARN_ON_ONCE(irq == -1)) { kvm_queue_interrupt(vcpu, irq, false); kvm_x86_call(inject_irq)(vcpu, false); WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0); } } if (kvm_cpu_has_injectable_intr(vcpu)) kvm_x86_call(enable_irq_window)(vcpu); }

So it expects the inject_irq to succeed if interrupt_allowed is true. Now we want to combine interrupt_allowed and inject_irq to reduce the number of hypercalls, then the inject_irq should return a meaningful result to indicate whether an interrupt is allowed or not. This is doable, but we will change the return type of inject_irq ops, as well as all the places using or implementing this ops. Not sure if it is worse to make this change.

As this is an optimization and we don't have performance issues for the guest interrupt right now, maybe we can leave this as it is until we need to improve that or some other ideas?

Is this the same signature you saw?

No, I'm seeing a different symptom: regardless of the number of vCPUs, the pVM just hangs after printing these logs:

[ 0.000000] Not enabling interrupt remapping due to skipped IO-APIC setup [ 0.000000] tsc: Unable to calibrate against PIT [ 0.000000] tsc: No reference (HPET/PMTIMER) available [ 0.000000] tsc: Marking TSC unstable due to could not calculate TSC khz

(BTW I'd expect the number of vCPUs shouldn't matter, since APIC is disabled, so the guest doesn't enable secondary CPUs anyway.)

More precisely, I'm only observing this issue only with split irqchip, i.e. with the following combination: -irqchip split in the crosvm command line + noapic in the guest kernel command line.

Actually I've just noticed that with this combination, a npVM also hangs during boot, although much later, after userspace already started. And the same is actually observed without pKVM as well.

So this seems to be not a pKVM issue but, maybe, an issue in crosvm's split irqchip implementation.

As this is an optimization and we don't have performance issues for the guest interrupt right now, maybe we can leave this as it is until we need to improve that or some other ideas?

Yeah agree, we can optimize it later if needed.

BTW this seems to be not the only KVM code path responsible for injecting interrupts. With enable_apicv enabled (i.e. in our default target setup), I'm not seeing any inject_irq calls in my vmexit tracing. I'm seeing sync_pir_to_irr calls instead. However there are still quite a few interrupt_allowed calls and some other calls as well.

For example, numbers of hypercalls during 1 second while running dd if=/dev/urandom bs=4096 count=10000 of=test oflag=direct in a npVM (with pKVM in qemu):

root@debian-virt:/sys/kernel/debug/pkvm# echo 0 >enable_vmexit_trace ; echo 1 >enable_vmexit_trace ; sleep 1; grep Host: vmexit_trace ; echo 0 >enable_vmexit_trace Host: CPUID 142 cycles 1481282 each-handler-cycle 10431 Host: VMCALL 38361 cycles 1986806585 each-handler-cycle 51792 Host: dump_vmexit_trace 1 cycles 49826 each-handler-cycle 49826 Host: vcpu_load 392 cycles 1386984 each-handler-cycle 3538 Host: vcpu_put 4 cycles 349642 each-handler-cycle 87410 Host: get_rflags 7734 cycles 173956690 each-handler-cycle 22492 Host: set_interrupt_shadow 7734 cycles 172761664 each-handler-cycle 22337 Host: interrupt_allowed 6600 cycles 192959931 each-handler-cycle 29236 Host: sync_pir_to_irr 5728 cycles 151754856 each-handler-cycle 26493 Host: vcpu_run 10168 cycles 1293586992 each-handler-cycle 127221

Or the same test in a pVM:

root@debian-virt:/sys/kernel/debug/pkvm# echo 0 >enable_vmexit_trace ; echo 1 >enable_vmexit_trace ; sleep 1; grep Host: vmexit_trace ; echo 0 >enable_vmexit_trace Host: CPUID 142 cycles 1184352 each-handler-cycle 8340 Host: VMCALL 142907 cycles 9358329498 each-handler-cycle 65485 Host: dump_vmexit_trace 1 cycles 38502 each-handler-cycle 38502 Host: vcpu_load 12772 cycles 44477650 each-handler-cycle 3482 Host: vcpu_put 908 cycles 54130012 each-handler-cycle 59614 Host: interrupt_allowed 81371 cycles 3735101546 each-handler-cycle 45902 Host: cancel_injection 18 cycles 943694 each-handler-cycle 52427 Host: sync_pir_to_irr 30556 cycles 1402745460 each-handler-cycle 45907 Host: vcpu_run 17281 cycles 4120892634 each-handler-cycle 238463

Seems like potentially a lot of room for optimization?

BTW this seems to be not the only KVM code path responsible for injecting interrupts. With enable_apicv enabled (i.e. in our default target setup), I'm not seeing any inject_irq calls in my vmexit tracing. I'm seeing sync_pir_to_irr calls instead. However there are still quite a few interrupt_allowed calls and some other calls as well.

I see. As we have virtual interrupt delivery, the kvm_check_and_inject_events() won't inject interrupts in its code path because the kvm_cpu_has_injectable_intr() is false. The interrupt_allowed PV interface is used by some other code path, e.g., kvm_vcpu_has_events->kvm_arch_interrupt_allowed, to know whether the pending interrupt can be injected or not, to determine if this vcpu should run. So it seems necessary to have interrupt_allowed PV interface.

arch/x86/kvm/vmx/pkvm_host.c

Enable the pKVM VMX host to deliver SIPI vector to guest VMs. Allowing the host to do so for pVMs is not secure. So pVMs will use a secure SIPI mechanism eventually. Once this is ready, the pKVM host only need to deliver SIPI vector to npVMs. Signed-off-by: Chuanxiao Dong <[email protected]>

Import vmx_vcpu_after_set_cpuid() from KVM to implement vcpu_after_set_cpuid operation for pKVM. As nested and PMU are not supported, the code related for nested and PMU are removed for pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

Initialize KVM cpuid_xstate_sizes, which will be needed by the pKVM to leverage KVM CPUID codes to emulate guest VM's CPUID. Signed-off-by: Chuanxiao Dong <[email protected]>

Import fpu_enable_guest_xfd_features() from kernel fpu for pKVM to enable the guest xfd features. The __xfd_enable_feature() is implemented differently with kernel as pKVM doesn't support allocating memory dynamically thus the fpstate memory for xfd feature is required to be allocated and donated from the host via a dedicated PV interface. Thus the xfd feature enabling in pKVM is to check if the fpstate memory size is sufficiently large for enabling xfd. Signed-off-by: Chuanxiao Dong <[email protected]>

Import kvm_set_cpuid from KVM, with changes to remove unsupported code path, e.g, virtual APIC (emulated by the host) and kvm mmu will be handled by the host, PMU refresh is not needed as PMU is not supported. Hyper-V related code is bypassed as it is also not supported. As pVM's FPU should be isolated from the host, pKVM will enable its xfd feature if cpuid allows. For npVM, the xfd feature enabling is done by the host rather than pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

The pKVM hypervisor doesn't support emulate KVM PV features for simplicity. Thus remove KVM PV feature bits from the corresponding CPUID. Signed-off-by: Chuanxiao Dong <[email protected]>

Add vcpu_after_set_cpuid PV interface for the host to send the CPUID entries to the pKVM hypervisor, which will be used as the CPUID entries for the pkvm_vcpu. The CPUID entries memory pages will be donated to prevent the host from accessing. If the pkvm_vcpu already has a CPUID entries, the old one's physical addresses and size will be return to the host via pkvm_memcache. For a pVM, this PV interface allows the host to set CPUID entries before the pVM has started running, but doesn't allow the host to set if the pVM is already running. Signed-off-by: Chuanxiao Dong <[email protected]>

Enable the pKVM VMX host to send CPUID entries to the pKVM hypervisor via the corresponding PV interface. If the PV interface is success, the previous CPUID entries will be reclaimed if there is. As this PV interface is inaccessible to the host once the guest state is protected, no need to send CPUID entries in this case. Signed-off-by: Chuanxiao Dong <[email protected]>

Do it before kvm_set_cpuid() for protected VM. Accordingly copy more cpuid helper functions from kvm and fix dependencies on following symbols: - kvm_pmu_cap - kvm_mmu_get_max_tdp_level() - xstate_get_guest_group_perm() Signed-off-by: Kevin Tian <[email protected]>

Currently cpuid is set by the host, and immutable once pVM runs. But certain cpuid leaves define security features required by pVM to ensure its security inside. Lacking of them may put pVM under risk. Adopt a simple policy similar to QEMU '--cpu host', by using the default pKVM supported leaves as the base plus a small set allowing the host to control. The allowed set is defined based on the leaves which crosvm currently touches, and scrutinized to not have a security implication to pVM. This way saves us from the burden of maintaining a fine-grained bit-wise complex policy as TDX does. Signed-off-by: Kevin Tian <[email protected]>

Add vcpu_add_fpstate PV interface for the host to add a new fpstate memory if needs to support xfd, as the fpstate memory allocated when creating vcpu may not be sufficient. As the pVM's FPU state is managed by the pKVM hypervisor while the npVM's FPU state is managed by the host, re-allocating the fpstate is only necessary for the pVM. This PV interface is not accessible to the host once the pVM started running to ensure the host won't manipulate pVM's FPU. The old fpstate memory physical addresses and size are returned via pkvm_memcache for the host to reclaim. Signed-off-by: Chuanxiao Dong <[email protected]>

Add new fpstate memory via the vcpu_add_fpstate PV interface if the xfd feature is enabled for pVM. As the pVM's FPU state is managed by the pKVM hypervisor while the npVM's FPU state is managed by the host, re-allocating is only necessary for the pVM, and should be done before adding the new cpuid entries to the pKVM hypervisor. If the new fpstate memory is added successfully, free the old one. Signed-off-by: Chuanxiao Dong <[email protected]>

Import vmx_write_tsc_offset/multiplier() from KVM to implement write_tsc_offset/multiplier operations for pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

Add write_tsc_offset/multiplier PV interfaces to allow host to write the tsc offset and tsc multiplier. These PV interfaces are always accessible to the host as the Trusty pVM can use GSC as the secure time source not the TSC. The same reason to allow host to emulate and access the guest TSC_ADJUST and TSC MSRs. Signed-off-by: Chuanxiao Dong <[email protected]>

Enable the pKVM VMX host to write TSC offset and multiplier via the corresponding PV interfaces. Signed-off-by: Chuanxiao Dong <[email protected]>

Import vmx_load_mmu_pgd from KVM to implement load_mmu_pgd operation for pKVM. The Hyper-V related handling is skipped as this is not supported by pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

Add load_mmu_pgd PV interface for the host to load the guest mmu via the vendor-specific operation. Currently this PV interface is always accessible to the host, as the guest mmu for both pVM and npVM are still managed by the host. It will be updated accordingly once the pvMMU is ready to manage the guest mmu inside the pKVM to achieve pVM's memory isolation. Signed-off-by: Chuanxiao Dong <[email protected]>

Enable the pKVM VMX host to load guest mmu via the load_mmu_pgd PV interface. Signed-off-by: Chuanxiao Dong <[email protected]>

Nested virtualization is not supported on this platform. To satisfy the interface requirements, provide stub implementations for nested-related operations. Signed-off-by: Chuanxiao Dong <[email protected]>

Import vmx_setup_mce operation from the KVM to implement setup_mce operation for pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

Add setup_mce PV interface for the host to setup the guest MCE. This PV interface is always accessible to the host as the input mcg_cap from the host is checked and the guest MCE is initialized by the pKVM itself. No guest MCE state will be leaked. Signed-off-by: Chuanxiao Dong <[email protected]>

Enable the pKVM VMX host to setup guest MCE via the setup_mce PV interface. Signed-off-by: Chuanxiao Dong <[email protected]>

The pKVM hypervisor doesn't support SMM mode emulation. To satisfy the interface requirement, provide stub implementation for SMM mode operations. Signed-off-by: Chuanxiao Dong <[email protected]>

maluka-dmytro · 2025-12-30T12:13:01Z

arch/x86/kvm/vmx/pkvm_host.c

+
+	/*
+	 * Make sure the operations before setting vmx->segment_cache.bitmask
+	 * won't be reordered.


Could you explain in a bit more detail why need this barrier? What does it pair with?

I would like to make sure the compiler won't reorder on the same CPU. The vmx->segment_cache.bitmask should be set after the segment is cached, e.g, save->selector = var->selector; in pkvm_cache_segment() should happen before vmx->segment_cache.bitmask |= mask;

Ok, I guess it is related to preemption? i.e. I guess the actual reason is that, as mentioned in the commit message, we don't have a special no_cache version of pkvm_get_cpl(), so kvm_x86_call(get_cpl_no_cache)(vcpu) -> pkvm_get_cpl() -> pkvm_segment_cache_test() may preempt the thread that is executing pkvm_cache_segment() here?

And that is also the reason why regular KVM doesn't need a similar barrier in its vmx_segment_cache_test_set() (which may be preempted by vmx_get_cpl_no_cache(), but vmx_get_cpl_no_cache() doesn't do anything with the segment cache, so that's safe)?

If that's correct, worth explaining that in this comment?

And actually I'm wondering: why won't we just follow what KVM does, i.e. implement a separate pkvm_get_cpl_no_cache() function which just makes the hypercall to pKVM, without messing with the host-side cache?

Otherwise, are we really sure our code is safe?

If that's correct, worth explaining that in this comment?

Yes, it is correct. As I read the KVM code, it sets the segment_cache.bitmask first then saves the cached value. E.g., vmx_read_guest_seg_ar will set the segment_cache.bitmask for SEG_FIELD_AR first, then read the AR bytes from the VMCS field. So if vmx_read_guest_seg_ar is preempted after segment_cache.bitmask being set, the cached value is not up-to-date yet. In pKVM side, we save the cached value first and then set the segment_cached.bitmask. So if pkvm_cache_segment() is preempted, it won't have a mismatch between the cache and bitmask.

And actually I'm wondering: why won't we just follow what KVM does, i.e. implement a separate pkvm_get_cpl_no_cache() function which just makes the hypercall to pKVM, without messing with the host-side cache?

Yes, we can do that. Actually, there is an optimization for the get_segment PV interface. To reduce the number of get_segment PV interfaces, we will sync some segment registers to the host for npVM and set the segment_cache.bitmask (https://github.com/intel-staging/pKVM-IA/blame/pvvmcs-v6.18-wip/arch/x86/kvm/vmx/vmx.c#L9843). In this way, the no_cache version also doesn't have to use the get_segment PV interface.

maluka-dmytro · 2025-12-30T12:25:32Z

arch/x86/kvm/vmx/pkvm_host.c

+	    !pkvm_segment_cache_test(vmx, seg, SEG_FIELD_AR)) {
+		union pkvm_hc_data out;
+
+		if (WARN_ON(pkvm_hypercall_out(get_segment, &out, seg)))


nit question: any particular reason why do this WARN_ON check for get_segment and set_segment, while not do it for most other hypercalls?

Looks like get_segment and set_segment hypercalls may only fail due to pkvm_vcpu_handle_host_hypercall() failing with -EINVAL if no vCPU is loaded, or with -EPERM if this is a pVM and it is already running. The latter should be prevented by the above vcpu->arch.guest_state_protected check. So, do we want to WARN_ON specifically in case if no vCPU is loaded? But when why just for get_segment and set_segment, not other PV calls?

After a bit more thought: maybe it makes sense to do this WARN_ON check specifically in this case, just to avoid saving a wrong value in the segment cache if the hypercall failed...

So, do we want to WARN_ON specifically in case if no vCPU is loaded? But when why just for get_segment and set_segment, not other PV calls?

No, just because the .get_segment and .set_segment callbacks are void type, so it cannot tell whether the operation is successful or not via the return value. Thus, use a WARN_ON. Maybe a KVM_BUG_ON is better?

Ok, makes sense.

Maybe a KVM_BUG_ON is better?

Maybe...

Although, some thoughts:

e.g. set_idt/get_idt/set_gdt/get_gdt are also void, but we don't WARN_ON or KVM_BUG_ON there?

Do these warnings always indicate a KVM bug, or it may be also a pVM's fault (e.g. pVM trying to execute an instruction that requires KVM to read the guest's segment)?

What is TDX's approach? Looks like it doesn't warn in such caches (except for cache_reg, for some reason)?

e.g. set_idt/get_idt/set_gdt/get_gdt are also void, but we don't WARN_ON or KVM_BUG_ON there?

Indeed, probably we can have a common rule to handle the return value. Maybe can just WARN_ON(ret); in the __pkvm_hypercall macro

Do these warnings always indicate a KVM bug, or it may be also a pVM's fault (e.g. pVM trying to execute an instruction that requires KVM to read the guest's segment)?

It seems a KVM bug. If it is a pVM's fault that requires the host KVM to emulate by reading the guest's segment, pKVM hypervisor should identify that vmexit reason and handle that properly, e.g., injecting a #GP or #UD to the pVM rather than returning to the host to emulate.

What is TDX's approach? Looks like it doesn't warn in such caches (except for cache_reg, for some reason)?

TDX's approach is for TD only. Just guess TD's vcpu->arch.guest_state_protected may be false if the TD has the debug attribute. In this case, maybe these callbacks will be used when creating or running a TD, so it is not suitable to print a warning but just ignore the writing or return a zero value for the reading. cache_reg is not a callback which will be used anyway, thus worth a warning.

maluka-dmytro · 2025-12-30T12:30:04Z

arch/x86/kvm/vmx/pkvm_host.c

+			pkvm_hypercall_in(set_gdt, &data);
+		} else {
+			pkvm_hypercall_out(get_gdt, &data);
+			*dt = data.get_gdt.desc;


nit: it seems the code would look simpler if all these 4 cases were done directly in the 4 functions below, instead of this common pkvm_access_idt_gdt() helper?

It will have several lines of duplicated code, e.g., the if (vcpu->arch.guest_state_protected) check. But anyway, using 4 functions rather than this common pkvm_access_idt_gdt() helper is also good to me.

IMHO that wouldn't be any significant duplication (e.g. those few lines would be different in pkvm_get_*dt() vs pkvm_set_*dt() functions, since pkvm_set_*dt() don't do the memset), while with this pkvm_access_idt_gdt() it seems at first glance that the logic is more sophisticated than it actually is. :)

I see. I will use 4 functions rather than the common pkvm_access_idt_gdt()

maluka-dmytro · 2025-12-30T12:48:07Z

arch/x86/kvm/pkvm/pkvm.c

+{
+	if (!for_injection ||
+	    (!kvm_event_needs_reinjection(vcpu) &&
+	     !vcpu->arch.exception.pending))


Any reason why need to do these checks before calling kvm_x86_call(interrupt_allowed) / kvm_x86_call(nmi_allowed)? I'd assume such checks should have already be done on the host side?

Is that because of some security concerns? Or maybe the host doesn't have enough information to correctly check this, i.e. the host's values of exception.pending, exception.injected etc are not accurate? (If so, maybe we should make them accurate?)

Ok, after looking at next patches, I see that exceptions (unlike interrupts and NMIs) are managed by pKVM, not by the host, so I guess this makes sense.

Yeah. Exceptions for pVMs are managed by the pKVM. Exceptions for npVMs can also be injected by the host. So in the pvVMCS part 3, the exception.pending and exception.injected of a npVM will be synced to the host as well (https://github.com/intel-staging/pKVM-IA/blob/pvvmcs-v6.18-wip/arch/x86/kvm/pkvm/pkvm.c#L1163)

jaszczyk-grzegorz · 2025-12-30T12:59:03Z

arch/x86/kvm/pkvm/pkvm.c

+			if (ret)
+				return ret;
+			if (data == cur_data)
+				return 0;


I wounder if this will not cause us some issues for multiple reasons:

I can imagine that accessing some MSRs have side effects and even reading or writing the same value will give a different effect.

there are WO or RO MSRs, so e.g. reading WO register will result with unpredicted data and result with EPERM as read data is different than data that we expect to write

Some example e.g. MSR_IA32_FLUSH_CMD (which BTW I did not encounter to be triggered by host in this context, just giving an example of such MSR)

I didn't encounter such issue in basic tests but perhaps in some different SoC we could get into difficult to debug issues because of that? So perhaps it would be better to explicitly allow it for MSR_IA32_XSS MSR_STAR MSR_CSTAR MSR_L MSR_KERNEL_GS_BASE, MSR_IA32_SYSENTER_CS MSR_IA32_SYSENTER_ESP MSR_IA32_SYSENTER_EIP? which are all store/configuration and doesn't trigger event/hw commands and warn otherwise? WDYT?

Indeed, very good point. Thanks for pointing out.

Yeah, looks like we should just allow setting those 5 MSRs set by crosvm (MSR_STAR, MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR). I assume it shouldn't be a security concern, since the guest is supposed to set up those MSRs on its own, before starting using SYSCALL or SWAPGS instructions; in other words, any initial values of those MSRs are untrusted anyway, since they ultimately specify some addresses in the guest memory, while the entire guest memory contents are initially untrusted anyway.

Probably no need to allow MSR_IA32_XSS. KVM sets it to 0 in kvm_vcpu_reset(), but pKVM does the same anyway, and KVM's kvm_vcpu_reset() doesn't check the return value, so it should be ok if it will just silently return -EPERM and do nothing.

Just for the record all MSRs that I've mentioned (not only 5 above) I've saw in pkvm_set_msr default handling either on real hw or when running pKVM in QEMU with some VMs:

MSR_IA32_XSS (0x00000da0) [BTW agree with your reasoning that it could be skipped]

MSR_STAR (0xc0000081)

MSR_LSTAR (0xc0000082)

MSR_CSTAR (0xc0000083)

MSR_SYSCALL_MASK (0xc0000084)

MSR_KERNEL_GS_BASE (0xc0000102)

MSR_IA32_SYSENTER_CS (0x174)

MSR_IA32_SYSENTER_ESP (0x175)

MSR_IA32_SYSENTER_EIP (0x176)

Yeah you are right, crosvm sets those as well:

/// Configure Model specific registers for long (64-bit) mode. pub fn set_long_mode_msrs(msrs: &mut BTreeMap<u32, u64>) { msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_CS, 0x0); msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_ESP, 0x0); msrs.insert(crate::msr_index::MSR_IA32_SYSENTER_EIP, 0x0);

maluka-dmytro · 2025-12-30T13:04:27Z

arch/x86/kvm/vmx/vmx.c

 		kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
+#else
+		WARN_ONCE(1, "pkvm doesn't support injecting exception to a real mode guest\n");
+#endif


nit: commit message says: "As the pKVM hypervisor doesn't support emulating
real mode guest..." - maybe it actually supports real mode guests in general, just doesn't support injecting exceptions to them (as that would require instruction emulation in pKVM hypervisor)?

(Although I haven't tried testing real mode guests with pKVM myself, so I'm not sure.)

Yes. The pKVM relies on unrestricted guest feature, thus vmx->rmode.vm86_active will not be 1. So the code will not run to here. Maybe an alternative is just to use #ifndef __PKVM_HYP__ to wrap this if condition, and update the commit message accordingly.

maluka-dmytro reviewed Dec 9, 2025

View reviewed changes

arch/x86/include/asm/irqflags.h Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/vmx/pkvm_init.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/pkvm/entry.S Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/vmx/vmx.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Show resolved Hide resolved

maluka-dmytro reviewed Dec 14, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 15, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 15, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 15, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Show resolved Hide resolved

cxdong force-pushed the pkvm-v6.18 branch 2 times, most recently from c1f0cd8 to 0f4b8fe Compare December 16, 2025 11:12

cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch from b504641 to 8b26289 Compare December 16, 2025 14:52

maluka-dmytro reviewed Dec 16, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

cxdong force-pushed the pkvm-v6.18 branch from 502eefb to bc773a3 Compare December 18, 2025 03:13

cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch from 8b26289 to 152c838 Compare December 18, 2025 06:34

maluka-dmytro reviewed Dec 18, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Show resolved Hide resolved

cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch 2 times, most recently from df60a09 to 222764d Compare December 19, 2025 06:28

cxdong commented Dec 19, 2025

View reviewed changes

arch/x86/kvm/vmx/pkvm_init.c Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/vmx/vmx_ops.h Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/pkvm/pkvm.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/x86.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/x86.c Show resolved Hide resolved

maluka-dmytro reviewed Dec 19, 2025

View reviewed changes

arch/x86/kvm/x86.c Outdated Show resolved Hide resolved

maluka-dmytro reviewed Dec 22, 2025

View reviewed changes

arch/x86/kvm/vmx/pkvm_host.c Show resolved Hide resolved

cxdong and others added 23 commits December 29, 2025 10:29

pKVM: x86: Initialize KVM cpuid_xstate_sizes

7a4d805

Initialize KVM cpuid_xstate_sizes, which will be needed by the pKVM to leverage KVM CPUID codes to emulate guest VM's CPUID. Signed-off-by: Chuanxiao Dong <[email protected]>

pKVM: x86: Remove KVM PV feature bits for pVM

23d7580

The pKVM hypervisor doesn't support emulate KVM PV features for simplicity. Thus remove KVM PV feature bits from the corresponding CPUID. Signed-off-by: Chuanxiao Dong <[email protected]>

pKVM: VMX: Implement write_tsc_offset/multiplier operations

631bba8

Import vmx_write_tsc_offset/multiplier() from KVM to implement write_tsc_offset/multiplier operations for pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

KVM: pVMX: Implement write_tsc_offset/multiplier operations

59a3309

Enable the pKVM VMX host to write TSC offset and multiplier via the corresponding PV interfaces. Signed-off-by: Chuanxiao Dong <[email protected]>

pKVM: VMX: Implement load_mmu_pgd operation

122f7ca

Import vmx_load_mmu_pgd from KVM to implement load_mmu_pgd operation for pKVM. The Hyper-V related handling is skipped as this is not supported by pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

KVM: pVMX: Implement load_mmu_pgd operation

cfef1b8

Enable the pKVM VMX host to load guest mmu via the load_mmu_pgd PV interface. Signed-off-by: Chuanxiao Dong <[email protected]>

KVM: pVMX: Provide stub implementation for nested related operations

aa12f95

Nested virtualization is not supported on this platform. To satisfy the interface requirements, provide stub implementations for nested-related operations. Signed-off-by: Chuanxiao Dong <[email protected]>

pKVM: VMX: Implement setup_mce operation

1b98c61

Import vmx_setup_mce operation from the KVM to implement setup_mce operation for pKVM. Signed-off-by: Chuanxiao Dong <[email protected]>

KVM: pVMX: Implement setup_mce operation

461bf5b

Enable the pKVM VMX host to setup guest MCE via the setup_mce PV interface. Signed-off-by: Chuanxiao Dong <[email protected]>

KVM: pVMX: Provide stub implementation for SMM mode operations

e1ac2ac

The pKVM hypervisor doesn't support SMM mode emulation. To satisfy the interface requirement, provide stub implementation for SMM mode operations. Signed-off-by: Chuanxiao Dong <[email protected]>

cxdong force-pushed the pkvm-v6.18-pvvmcs-part2 branch from 5962b64 to e1ac2ac Compare December 29, 2025 02:55

maluka-dmytro reviewed Dec 30, 2025

View reviewed changes

jaszczyk-grzegorz reviewed Dec 30, 2025

View reviewed changes

maluka-dmytro reviewed Dec 30, 2025

View reviewed changes

Pkvm v6.18 pvvmcs part2 #48

Are you sure you want to change the base?

Pkvm v6.18 pvvmcs part2 #48

Uh oh!

Conversation

cxdong commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cxdong Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

cxdong Dec 30, 2025 •

edited

Loading