Add GPU memory pressure notification and multi-level watermark eviction by zyt600 · Pull Request #3 · ovg-project/gvm-nvidia-driver-modules

zyt600 · 2026-04-08T11:30:38Z

Summary

Add a kernel-to-userspace GPU memory pressure notification mechanism with multi-level watermark-based eviction control, enabling cooperative memory management between the kernel driver and user-space applications.

Key changes

Multi-level memory limits: Replace the single memory.limit with three cgroup-style controls — memory.limit.high, memory.limit.low, and memory.limit.min
Eviction notice: When GPU memory pressure exceeds the high watermark, the kernel computes a proportional per-process reclaim target and sends an eviction notice to user-space via a new UVM_WAIT_NOTICE ioctl. A force-shrink delayed work fallback ensures reclaim happens even if user-space does not respond in time.
Availability notice: When there is lots of GPU memory, the kernel broadcasts an availability notice (available memory, split equally among listeners) so that waiting processes can opportunistically allocate.
Bug fixes during development: Fixed swap charge leak on block_kill

Define UVM_WAIT_EVICTION_NOTICE and its userspace parameter structure to expose eviction target memory through UVM ioctl.

…consistency

Implement gvm_send_eviction_notice and gvm_wait_eviction_notice to deliver per-GPU eviction signals via spinlock-protected mailbox. Add uuid field to eviction_notice struct and change the ioctl param from IN-reserved to OUT so userspace receives the GPU identity.

When GPU memory usage exceeds the high watermark during chunk allocation, notify all processes proportionally to shrink their memory down to the low watermark target.

… debugfs filename

bytes_to_reclaim * process_current overflows NvU64 when both values are in the ~40GB range, producing a truncated quotient that makes process_target nearly equal to process_current instead of the intended low_watermark (85%) target. Use mul_u64_u64_div_u64() which computes the full 128-bit intermediate product before dividing.

numFreePages64k and numFreePages2m describe the same physical memory at different granularities (one 2MB page = 32 x 64KB frames). Adding both caused available_bytes to be ~2x the actual free memory, making the high watermark trigger much later than configured. Use only numFreePages64k for the available memory calculation.

Reject setting high_watermark below low_watermark. Previously only the low side checked against high, so writing a small high value could create an invalid high < low state. Also fix the range check to use == 0 instead of <= 0 for unsigned int.

When GPU memory exceeds the high watermark, processes are notified to shrink voluntarily. After a configurable grace period, a delayed work item force-shrinks any process that has not complied. Notification frequency is throttled at the global level to avoid redundant work. Both grace_period_ms and notify_throttle_ms are exposed via debugfs with cross-validation (grace < throttle).

Add a new UVM_WAIT_AVAILABILITY_NOTICE ioctl that lets userspace block until GPU memory crosses back above the low watermark after eviction. Unify shrink and availability notification throttling into a single shared atomic_long_t timestamp with cmpxchg for concurrency safety. - uvm_va_space.h: add availability mailbox struct - uvm_va_space.c: initialize availability state in va_space_create - uvm_ioctl.h: define UVM_WAIT_AVAILABILITY_NOTICE ioctl and params - uvm.c: route the new ioctl to gvm_wait_availability_notice - gvm_debugfs.h: declare send/notify/wait availability functions - gvm_debugfs.c: implement availability send/notify/wait, replace per-function static throttle timestamps with a shared atomic_long_t using cmpxchg, remove redundant gvm_eviction_in_progress flag - uvm_pmm_gpu.c: on chunk free, check if available memory exceeds the low watermark and fire the availability notification

current_memory only counted physical GPU memory (memory_current), causing userspace to underestimate pressure and continue allocating. Add memory_swap_current so the reported usage reflects the true footprint.

Mirrors the existing memory.limit.high interface to provide a lower memory protection threshold. Also renames the backing field from memory_limit to memory_limit_high for naming consistency.

Add per-process per-GPU memory.limit.min interface mirroring memory.limit.low. Validate on write that high >= low >= min, returning -EINVAL on violation.

Two-phase reclaim in gvm_notify_all_processes_to_shrink: first reclaim above low proportionally, then dip into low-to-min range only if needed. Clamp force_shrink target to memory_limit_min. In pick_and_evict_root_chunk, skip chunks belonging to processes already at min (with retry up to root_chunks.count), and rename memory_limit to memory_limit_high. Clarify target_memory/current_memory semantics across the eviction notice chain.

Move current_physical_mem == 0 check before limit/reclaim calculations to avoid unnecessary work. Add reclaim_physical_mem == 0 check to skip processes that have nothing to reclaim.

When a block is killed (cudaFree / VA space teardown), pages that were evicted from GPU to CPU had their memory.swap.current charge but never got uncharged, causing the swap counter to leak indefinitely. Uncharge swap for all evicted pages before block_destroy_gpu_state clears the evicted mask.

Switch to a single wait ioctl and shared wait queue so user space can block on one notification channel while preserving typed payloads.

Use an atomic counter (notice_listener_count) incremented/decremented around wait_event_interruptible in gvm_wait_notice to track the number of programs actively waiting for notices. In broadcast_availability, divide available_bytes by this count so each listener is notified with its fair share instead of the full free memory amount.

When GPU memory utilization exceeds the high watermark, evict to the average of high and low watermarks instead of all the way down to the low watermark, reducing unnecessary over-eviction.

The _low and _min variants already have explicit suffixes; align the _high variant to the same naming convention.

zyt600 added 25 commits February 28, 2026 03:30

change memory.limit, add memory.limit.high

5c845c4

low and high watermark

92c5dac

add wait-eviction ioctl params

2395bed

Define UVM_WAIT_EVICTION_NOTICE and its userspace parameter structure to expose eviction target memory through UVM ioctl.

add wait-eviction-notice ioctl handler in uvm.c

436ee6c

add eviction_notice mailbox to uvm_va_space_struct

6ec456a

init eviction_notice in va_space_create, rename wq to wait_queue for …

d0337fa

…consistency

add proportional GPU memory reclaim on high watermark breach

b394751

When GPU memory usage exceeds the high watermark during chunk allocation, notify all processes proportionally to shrink their memory down to the low watermark target.

rename memory_limit symbols to memory_limit_high for consistency with…

e6ed753

… debugfs filename

validate watermark high >= low on write

78fb61f

Reject setting high_watermark below low_watermark. Previously only the low side checked against high, so writing a small high value could create an invalid high < low state. Also fix the range check to use == 0 instead of <= 0 for unsigned int.

add current_memory to eviction notice passed to userspace

fd52707

include swap memory in eviction current_memory reported to userspace

cdd53d6

current_memory only counted physical GPU memory (memory_current), causing userspace to underestimate pressure and continue allocating. Add memory_swap_current so the reported usage reflects the true footprint.

add per-process per-GPU memory.limit.low debugfs control

279b69c

Mirrors the existing memory.limit.high interface to provide a lower memory protection threshold. Also renames the backing field from memory_limit to memory_limit_high for naming consistency.

add memory.limit.min debugfs control and enforce high >= low >= min

4684c27

Add per-process per-GPU memory.limit.min interface mirroring memory.limit.low. Validate on write that high >= low >= min, returning -EINVAL on violation.

fix eviction notify: early-exit on zero physical mem and zero reclaim

11140c2

Move current_physical_mem == 0 check before limit/reclaim calculations to avoid unnecessary work. Add reclaim_physical_mem == 0 check to skip processes that have nothing to reclaim.

unify gvm wait notice path for eviction and availability

0f2f751

Switch to a single wait ioctl and shared wait queue so user space can block on one notification channel while preserving typed payloads.

change default eviction target from low watermark to mid watermark

5732218

When GPU memory utilization exceeds the high watermark, evict to the average of high and low watermarks instead of all the way down to the low watermark, reducing unnecessary over-eviction.

rename get_gpu_memcg_limit to get_gpu_memcg_limit_high for consistency

89d1d82

The _low and _min variants already have explicit suffixes; align the _high variant to the same naming convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU memory pressure notification and multi-level watermark eviction#3

Add GPU memory pressure notification and multi-level watermark eviction#3
zyt600 wants to merge 25 commits into
ovg-project:mainfrom
zyt600:main

zyt600 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zyt600 commented Apr 8, 2026

Summary

Key changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant