[POC] Add adaptive back-pressure monitor to avoid OOM#15150
Draft
res-life wants to merge 2 commits into
Draft
Conversation
Introduce PressureMonitor, an executor-level monitor that samples JVM GC time fraction and GPU device-memory (RMM) utilization on a background thread and exposes an AIMD back-pressure factor. GpuSemaphore multiplies its per-task GPU memory estimate by this factor, so GPU task concurrency is reduced when GC time or device memory crosses a guard threshold and recovers when healthy. This removes the GC-spike -> executor-loss -> recompute cascade adaptively, without statically tuning concurrentGpuTasks / executor.cores. It is a wait-at-acquisition brake (back-pressured tasks block at the semaphore before holding scarce resources); no new exception type is introduced. Gated behind spark.rapids.adaptive.backpressure.enabled (default off). Includes PressureMonitorSuite covering the AIMD/guard logic and config wiring. Signed-off-by: Chong Gao <res_life@163.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the back-pressure factor is elevated (GC time or device memory over guard), also scale down the host-side in-flight byte budgets via PressureMonitor.scaleHostLimit: - BytesInFlightLimiter (multithreaded shuffle writer serialization), and - HostMemoryThrottle (async write/read TrafficController). The GPU semaphore lever (Phase 1) cannot see off-GPU host/pinned memory consumed by shuffle serialization and async IO, which is the actual driver of GC-overhead executor loss under an over-subscribed executor.cores. Scaling these existing gates closes that gap: host work blocks at its in-flight gate sooner. Both gates always admit at least one in-flight item, so a shrunk budget throttles without deadlocking. No-op when the feature is disabled. Adds scaleLimit unit tests. Signed-off-by: Chong Gao <res_life@163.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
revans2
reviewed
Jun 26, 2026
revans2
left a comment
Collaborator
There was a problem hiding this comment.
This is interesting. I want to see benchmark number for both cases it helps and general benchmark numbers too. I am concerned that there is no obvious direct connection between block more threads waiting on the GPU, and reduce CPU java GC overhead. I get that it indirectly helps, and may directly help if the GC is caused by lots of small batches, but if we have lots of small batches/buffers/etc in our code, then shouldn't we be addressing that instead? I cannot tell if this is a bandaid on a problem, which I suspect it is, or if it is actually addressing the real problem.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
This adds an opt-in executor-level adaptive back-pressure monitor for GPU executors.
The new
PressureMonitorsamples JVM GC time fraction and GPU device-memory utilization, then maintains an AIMD multiplicative factor. When pressure crosses configured guards, the factor increases; when the executor is healthy, it decays back to1.0.The factor feeds two existing admission gates:
GpuSemaphore.permitsForEstimate, which inflates per-task GPU memory estimates so fewer tasks are admitted concurrently.PressureMonitor.scaleHostLimit, used by the shuffle writerBytesInFlightLimiterand async IOHostMemoryThrottle.The feature is disabled by default behind internal configs under
spark.rapids.adaptive.backpressure.*.Why
Large GPU executors can inherit CPU-era
spark.executor.coressettings that oversubscribe host-side resources. Many concurrent task threads can accumulate host/pinned memory pressure, increase JVM GC, miss heartbeats, lose executors, and trigger recompute cascades.This change provides a stability layer for over-concurrent executors without requiring per-job hand tuning of
executor.coresorconcurrentGpuTasks.Notes
Draft until a public issue is available to link and validation is rerun in an environment with the current private snapshot dependencies.