[POC] Add adaptive back-pressure monitor to avoid OOM by res-life · Pull Request #15150 · NVIDIA/cudf-spark

res-life · 2026-06-26T06:03:56Z

What changed

This adds an opt-in executor-level adaptive back-pressure monitor for GPU executors.

The new PressureMonitor samples JVM GC time fraction and GPU device-memory utilization, then maintains an AIMD multiplicative factor. When pressure crosses configured guards, the factor increases; when the executor is healthy, it decays back to 1.0.

The factor feeds two existing admission gates:

GPU task admission via GpuSemaphore.permitsForEstimate, which inflates per-task GPU memory estimates so fewer tasks are admitted concurrently.
Host-side in-flight limits via PressureMonitor.scaleHostLimit, used by the shuffle writer BytesInFlightLimiter and async IO HostMemoryThrottle.

The feature is disabled by default behind internal configs under spark.rapids.adaptive.backpressure.*.

Why

Large GPU executors can inherit CPU-era spark.executor.cores settings that oversubscribe host-side resources. Many concurrent task threads can accumulate host/pinned memory pressure, increase JVM GC, miss heartbeats, lose executors, and trigger recompute cascades.

This change provides a stability layer for over-concurrent executors without requiring per-job hand tuning of executor.cores or concurrentGpuTasks.

Notes

Draft until a public issue is available to link and validation is rerun in an environment with the current private snapshot dependencies.

Introduce PressureMonitor, an executor-level monitor that samples JVM GC time fraction and GPU device-memory (RMM) utilization on a background thread and exposes an AIMD back-pressure factor. GpuSemaphore multiplies its per-task GPU memory estimate by this factor, so GPU task concurrency is reduced when GC time or device memory crosses a guard threshold and recovers when healthy. This removes the GC-spike -> executor-loss -> recompute cascade adaptively, without statically tuning concurrentGpuTasks / executor.cores. It is a wait-at-acquisition brake (back-pressured tasks block at the semaphore before holding scarce resources); no new exception type is introduced. Gated behind spark.rapids.adaptive.backpressure.enabled (default off). Includes PressureMonitorSuite covering the AIMD/guard logic and config wiring. Signed-off-by: Chong Gao <res_life@163.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

When the back-pressure factor is elevated (GC time or device memory over guard), also scale down the host-side in-flight byte budgets via PressureMonitor.scaleHostLimit: - BytesInFlightLimiter (multithreaded shuffle writer serialization), and - HostMemoryThrottle (async write/read TrafficController). The GPU semaphore lever (Phase 1) cannot see off-GPU host/pinned memory consumed by shuffle serialization and async IO, which is the actual driver of GC-overhead executor loss under an over-subscribed executor.cores. Scaling these existing gates closes that gap: host work blocks at its in-flight gate sooner. Both gates always admit at least one in-flight item, so a shrunk budget throttles without deadlocking. No-op when the feature is disabled. Adds scaleLimit unit tests. Signed-off-by: Chong Gao <res_life@163.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

revans2

This is interesting. I want to see benchmark number for both cases it helps and general benchmark numbers too. I am concerned that there is no obvious direct connection between block more threads waiting on the GPU, and reduce CPU java GC overhead. I get that it indirectly helps, and may directly help if the GC is caused by lots of small batches, but if we have lots of small batches/buffers/etc in our code, then shouldn't we be addressing that instead? I cannot tell if this is a bandaid on a problem, which I suspect it is, or if it is actually addressing the real problem.

Chong Gao and others added 2 commits June 26, 2026 06:02

res-life changed the title ~~Add adaptive back-pressure monitor~~ [POC] Add adaptive back-pressure monitor Jun 26, 2026

res-life requested a review from revans2 June 26, 2026 06:13

res-life changed the title ~~[POC] Add adaptive back-pressure monitor~~ [POC] Add adaptive back-pressure monitor to avoid OOM Jun 26, 2026

revans2 reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] Add adaptive back-pressure monitor to avoid OOM#15150

[POC] Add adaptive back-pressure monitor to avoid OOM#15150
res-life wants to merge 2 commits into
NVIDIA:mainfrom
res-life:adaptive-backpressure-poc

res-life commented Jun 26, 2026 •

edited

Loading

Uh oh!

revans2 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

res-life commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Notes

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

res-life commented Jun 26, 2026 •

edited

Loading