Skip to content

Commit 9f16d5e

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "The biggest change here is eliminating the awful idea that KVM had of essentially guessing which pfns are refcounted pages. The reason to do so was that KVM needs to map both non-refcounted pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain refcounted pages. However, the result was security issues in the past, and more recently the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by struct page but is not refcounted. In particular this broke virtio-gpu blob resources (which directly map host graphics buffers into the guest as "vram" for the virtio-gpu device) with the amdgpu driver, because amdgpu allocates non-compound higher order pages and the tail pages could not be mapped into KVM. This requires adjusting all uses of struct page in the per-architecture code, to always work on the pfn whenever possible. The large series that did this, from David Stevens and Sean Christopherson, also cleaned up substantially the set of functions that provided arch code with the pfn for a host virtual addresses. The previous maze of twisty little passages, all different, is replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages) saving almost 200 lines of code. ARM: - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests LoongArch: - Add iocsr and mmio bus simulation in kernel. - Add in-kernel interrupt controller emulation. - Add support for virtualization extensions to the eiointc irqchip. PPC: - Drop lingering and utterly obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect documentation references to non-existing ioctls RISC-V: - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side s390: - New selftests: more ucontrol selftests and CPU model sanity checks - Support for the gen17 CPU model - List registers supported by KVM_GET/SET_ONE_REG in the documentation x86: - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes. Even if the hardware A/D tracking is disabled, it is possible to use the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty, and that removes a lot of special cases. - Elide TLB flushes when aging secondary PTEs, as has been done in x86's primary MMU for over 10 years. - Recover huge pages in-place in the TDP MMU when dirty page logging is toggled off, instead of zapping them and waiting until the page is re-accessed to create a huge mapping. This reduces vCPU jitter. - Batch TLB flushes when dirty page logging is toggled off. This reduces the time it takes to disable dirty logging by ~3x. - Remove the shrinker that was (poorly) attempting to reclaim shadow page tables in low-memory situations. - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Advertise CPUIDs for new instructions in Clearwater Forest - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which in turn can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe; harden the cache accessors to try to prevent similar issues from occuring in the future. The issue that triggered this change was already fixed in 6.12, but was still kinda latent. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups - Switch hugepage recovery thread to use vhost_task. These kthreads can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory); therefore KVM tried to place the thread in the VM's cgroups and charge the CPU time consumed by that work to the VM's container. However the kthreads did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM instances inside could not complete freezing. Fix this by replacing the kthread with a PF_USER_WORKER thread, via the vhost_task abstraction. Another 100+ lines removed, with generally better behavior too like having these threads properly parented in the process tree. - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't really work; there was really nothing to work around anyway: the broken patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the erratum. - Fix 6.12 regression where CONFIG_KVM will be built as a module even if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'. x86 selftests: - x86 selftests can now use AVX. Documentation: - Use rST internal links - Reorganize the introduction to the API document Generic: - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long due to having to wait for all CPUs become quiescent. In general both reads and writes are rare, but userspace that supports confidential computing is introducing the use of "helper" vCPUs that may jump from one host processor to another. Those will be very happy to trigger a synchronize_rcu(), and the effect on performance is quite the disaster" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits) KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD KVM: x86: add back X86_LOCAL_APIC dependency Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()" KVM: x86: switch hugepage recovery thread to vhost_task KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest Documentation: KVM: fix malformed table irqchip/loongson-eiointc: Add virt extension support LoongArch: KVM: Add irqfd support LoongArch: KVM: Add PCHPIC user mode read and write functions LoongArch: KVM: Add PCHPIC read and write functions LoongArch: KVM: Add PCHPIC device support LoongArch: KVM: Add EIOINTC user mode read and write functions LoongArch: KVM: Add EIOINTC read and write functions LoongArch: KVM: Add EIOINTC device support LoongArch: KVM: Add IPI user mode read and write function LoongArch: KVM: Add IPI read and write function LoongArch: KVM: Add IPI device support LoongArch: KVM: Add iocsr and mmio bus simulation in kernel KVM: arm64: Pass on SVE mapping failures ...
2 parents 42d9e8b + 9ee62c3 commit 9f16d5e

File tree

183 files changed

+9054
-2525
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

183 files changed

+9054
-2525
lines changed

Documentation/arch/arm64/cpu-feature-registers.rst

+2
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,8 @@ infrastructure:
152152
+------------------------------+---------+---------+
153153
| DIT | [51-48] | y |
154154
+------------------------------+---------+---------+
155+
| MPAM | [43-40] | n |
156+
+------------------------------+---------+---------+
155157
| SVE | [35-32] | y |
156158
+------------------------------+---------+---------+
157159
| GIC | [27-24] | n |

Documentation/arch/loongarch/irq-chip-model.rst

+64
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,70 @@ to CPUINTC directly::
8585
| Devices |
8686
+---------+
8787

88+
Virtual Extended IRQ model
89+
==========================
90+
91+
In this model, IPI (Inter-Processor Interrupt) and CPU Local Timer interrupt
92+
go to CPUINTC directly, CPU UARTS interrupts go to PCH-PIC, while all other
93+
devices interrupts go to PCH-PIC/PCH-MSI and gathered by V-EIOINTC (Virtual
94+
Extended I/O Interrupt Controller), and then go to CPUINTC directly::
95+
96+
+-----+ +-------------------+ +-------+
97+
| IPI |--> | CPUINTC(0-255vcpu)| <-- | Timer |
98+
+-----+ +-------------------+ +-------+
99+
^
100+
|
101+
+-----------+
102+
| V-EIOINTC |
103+
+-----------+
104+
^ ^
105+
| |
106+
+---------+ +---------+
107+
| PCH-PIC | | PCH-MSI |
108+
+---------+ +---------+
109+
^ ^ ^
110+
| | |
111+
+--------+ +---------+ +---------+
112+
| UARTs | | Devices | | Devices |
113+
+--------+ +---------+ +---------+
114+
115+
116+
Description
117+
-----------
118+
V-EIOINTC (Virtual Extended I/O Interrupt Controller) is an extension of
119+
EIOINTC, it only works in VM mode which runs in KVM hypervisor. Interrupts can
120+
be routed to up to four vCPUs via standard EIOINTC, however with V-EIOINTC
121+
interrupts can be routed to up to 256 virtual cpus.
122+
123+
With standard EIOINTC, interrupt routing setting includes two parts: eight
124+
bits for CPU selection and four bits for CPU IP (Interrupt Pin) selection.
125+
For CPU selection there is four bits for EIOINTC node selection, four bits
126+
for EIOINTC CPU selection. Bitmap method is used for CPU selection and
127+
CPU IP selection, so interrupt can only route to CPU0 - CPU3 and IP0-IP3 in
128+
one EIOINTC node.
129+
130+
With V-EIOINTC it supports to route more CPUs and CPU IP (Interrupt Pin),
131+
there are two newly added registers with V-EIOINTC.
132+
133+
EXTIOI_VIRT_FEATURES
134+
--------------------
135+
This register is read-only register, which indicates supported features with
136+
V-EIOINTC. Feature EXTIOI_HAS_INT_ENCODE and EXTIOI_HAS_CPU_ENCODE is added.
137+
138+
Feature EXTIOI_HAS_INT_ENCODE is part of standard EIOINTC. If it is 1, it
139+
indicates that CPU Interrupt Pin selection can be normal method rather than
140+
bitmap method, so interrupt can be routed to IP0 - IP15.
141+
142+
Feature EXTIOI_HAS_CPU_ENCODE is entension of V-EIOINTC. If it is 1, it
143+
indicates that CPU selection can be normal method rather than bitmap method,
144+
so interrupt can be routed to CPU0 - CPU255.
145+
146+
EXTIOI_VIRT_CONFIG
147+
------------------
148+
This register is read-write register, for compatibility intterupt routed uses
149+
the default method which is the same with standard EIOINTC. If the bit is set
150+
with 1, it indicated HW to use normal method rather than bitmap method.
151+
88152
Advanced Extended IRQ model
89153
===========================
90154

Documentation/translations/zh_CN/arch/loongarch/irq-chip-model.rst

+55
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,61 @@ PCH-LPC/PCH-MSI,然后被EIOINTC统一收集,再直接到达CPUINTC::
8787
| Devices |
8888
+---------+
8989

90+
虚拟扩展IRQ模型
91+
===============
92+
93+
在这种模型里面, IPI(Inter-Processor Interrupt) 和CPU本地时钟中断直接发送到CPUINTC,
94+
CPU串口 (UARTs) 中断发送到PCH-PIC, 而其他所有设备的中断则分别发送到所连接的PCH_PIC/
95+
PCH-MSI, 然后V-EIOINTC统一收集,再直接到达CPUINTC::
96+
97+
+-----+ +-------------------+ +-------+
98+
| IPI |--> | CPUINTC(0-255vcpu)| <-- | Timer |
99+
+-----+ +-------------------+ +-------+
100+
^
101+
|
102+
+-----------+
103+
| V-EIOINTC |
104+
+-----------+
105+
^ ^
106+
| |
107+
+---------+ +---------+
108+
| PCH-PIC | | PCH-MSI |
109+
+---------+ +---------+
110+
^ ^ ^
111+
| | |
112+
+--------+ +---------+ +---------+
113+
| UARTs | | Devices | | Devices |
114+
+--------+ +---------+ +---------+
115+
116+
V-EIOINTC 是EIOINTC的扩展, 仅工作在虚拟机模式下, 中断经EIOINTC最多可个路由到
117+
4个虚拟CPU. 但中断经V-EIOINTC最多可个路由到256个虚拟CPU.
118+
119+
传统的EIOINTC中断控制器,中断路由分为两个部分:8比特用于控制路由到哪个CPU,
120+
4比特用于控制路由到特定CPU的哪个中断管脚。控制CPU路由的8比特前4比特用于控制
121+
路由到哪个EIOINTC节点,后4比特用于控制此节点哪个CPU。中断路由在选择CPU路由
122+
和CPU中断管脚路由时,使用bitmap编码方式而不是正常编码方式,所以对于一个
123+
EIOINTC中断控制器节点,中断只能路由到CPU0 - CPU3,中断管脚IP0-IP3。
124+
125+
V-EIOINTC新增了两个寄存器,支持中断路由到更多CPU个和中断管脚。
126+
127+
V-EIOINTC功能寄存器
128+
-------------------
129+
功能寄存器是只读寄存器,用于显示V-EIOINTC支持的特性,目前两个支持两个特性
130+
EXTIOI_HAS_INT_ENCODE 和 EXTIOI_HAS_CPU_ENCODE。
131+
132+
特性EXTIOI_HAS_INT_ENCODE是传统EIOINTC中断控制器的一个特性,如果此比特为1,
133+
显示CPU中断管脚路由方式支持正常编码,而不是bitmap编码,所以中断可以路由到
134+
管脚IP0 - IP15。
135+
136+
特性EXTIOI_HAS_CPU_ENCODE是V-EIOINTC新增特性,如果此比特为1,表示CPU路由
137+
方式支持正常编码,而不是bitmap编码,所以中断可以路由到CPU0 - CPU255。
138+
139+
V-EIOINTC配置寄存器
140+
-------------------
141+
配置寄存器是可读写寄存器,为了兼容性考虑,如果不写此寄存器,中断路由采用
142+
和传统EIOINTC相同的路由设置。如果对应比特设置为1,表示采用正常路由方式而
143+
不是bitmap编码的路由方式。
144+
90145
高级扩展IRQ模型
91146
===============
92147

0 commit comments

Comments
 (0)