Skip to content

[POC] Add adaptive back-pressure monitor to avoid OOM#15150

Draft
res-life wants to merge 2 commits into
NVIDIA:mainfrom
res-life:adaptive-backpressure-poc
Draft

[POC] Add adaptive back-pressure monitor to avoid OOM#15150
res-life wants to merge 2 commits into
NVIDIA:mainfrom
res-life:adaptive-backpressure-poc

Conversation

@res-life

@res-life res-life commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What changed

This adds an opt-in executor-level adaptive back-pressure monitor for GPU executors.

The new PressureMonitor samples JVM GC time fraction and GPU device-memory utilization, then maintains an AIMD multiplicative factor. When pressure crosses configured guards, the factor increases; when the executor is healthy, it decays back to 1.0.

The factor feeds two existing admission gates:

  • GPU task admission via GpuSemaphore.permitsForEstimate, which inflates per-task GPU memory estimates so fewer tasks are admitted concurrently.
  • Host-side in-flight limits via PressureMonitor.scaleHostLimit, used by the shuffle writer BytesInFlightLimiter and async IO HostMemoryThrottle.

The feature is disabled by default behind internal configs under spark.rapids.adaptive.backpressure.*.

Why

Large GPU executors can inherit CPU-era spark.executor.cores settings that oversubscribe host-side resources. Many concurrent task threads can accumulate host/pinned memory pressure, increase JVM GC, miss heartbeats, lose executors, and trigger recompute cascades.

This change provides a stability layer for over-concurrent executors without requiring per-job hand tuning of executor.cores or concurrentGpuTasks.

Notes

Draft until a public issue is available to link and validation is rerun in an environment with the current private snapshot dependencies.

Chong Gao and others added 2 commits June 26, 2026 06:02
Introduce PressureMonitor, an executor-level monitor that samples JVM GC time
fraction and GPU device-memory (RMM) utilization on a background thread and
exposes an AIMD back-pressure factor. GpuSemaphore multiplies its per-task GPU
memory estimate by this factor, so GPU task concurrency is reduced when GC time
or device memory crosses a guard threshold and recovers when healthy. This
removes the GC-spike -> executor-loss -> recompute cascade adaptively, without
statically tuning concurrentGpuTasks / executor.cores.

It is a wait-at-acquisition brake (back-pressured tasks block at the semaphore
before holding scarce resources); no new exception type is introduced. Gated
behind spark.rapids.adaptive.backpressure.enabled (default off). Includes
PressureMonitorSuite covering the AIMD/guard logic and config wiring.

Signed-off-by: Chong Gao <res_life@163.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the back-pressure factor is elevated (GC time or device memory over guard),
also scale down the host-side in-flight byte budgets via PressureMonitor.scaleHostLimit:
  - BytesInFlightLimiter (multithreaded shuffle writer serialization), and
  - HostMemoryThrottle (async write/read TrafficController).

The GPU semaphore lever (Phase 1) cannot see off-GPU host/pinned memory consumed by
shuffle serialization and async IO, which is the actual driver of GC-overhead executor
loss under an over-subscribed executor.cores. Scaling these existing gates closes that
gap: host work blocks at its in-flight gate sooner. Both gates always admit at least one
in-flight item, so a shrunk budget throttles without deadlocking. No-op when the feature
is disabled. Adds scaleLimit unit tests.

Signed-off-by: Chong Gao <res_life@163.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@res-life res-life changed the title Add adaptive back-pressure monitor [POC] Add adaptive back-pressure monitor Jun 26, 2026
@res-life res-life requested a review from revans2 June 26, 2026 06:13
@res-life res-life changed the title [POC] Add adaptive back-pressure monitor [POC] Add adaptive back-pressure monitor to avoid OOM Jun 26, 2026

@revans2 revans2 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. I want to see benchmark number for both cases it helps and general benchmark numbers too. I am concerned that there is no obvious direct connection between block more threads waiting on the GPU, and reduce CPU java GC overhead. I get that it indirectly helps, and may directly help if the GC is caused by lots of small batches, but if we have lots of small batches/buffers/etc in our code, then shouldn't we be addressing that instead? I cannot tell if this is a bandaid on a problem, which I suspect it is, or if it is actually addressing the real problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants