Fix skip-merge shuffle handle lifetime#15064
Conversation
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#15018 Retain partial shuffle handles while managed buffers, input streams, and Netty file regions are active so `unregisterShuffle` cannot close data under in-flight readers. Harden lease cleanup and partial-file close handling so releases are exception-safe and interrupted close waits still finish resource cleanup before restoring the interrupt flag. Add regression coverage for retained-buffer reads, Netty file-region release, and interrupted partial-file cleanup.
Greptile SummaryThis PR fixes a skip-merge shuffle lifetime bug where
Confidence Score: 5/5Safe to merge. The deferred-close logic is correctly guarded: closed is set atomically inside the handle's monitor via markCloseIfReady(), preventing acquireRead from slipping in after the decision to close. The doClose() path correctly waits for any in-progress spill and restores the interrupt flag, and the lease rollback in ShuffleHandleLease.acquire() is complete. The lock ordering is consistent (lease monitor then handle monitor, never reversed), doClose() is only ever dispatched once per handle (closed=true gates it), and the consumer release paths are all idempotent. The multithreaded regression test and the deterministic isPhysicallyClosed assertion give good coverage of the targeted race. No logic errors or resource-leak paths found. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Reducer
participant Buffer as MultiBatchManagedBuffer
participant Lease as ShuffleHandleLease
participant Handle as SpillablePartialFileHandle
participant Catalog as MultithreadedShuffleBufferCatalog
Reducer->>Buffer: retain()
Buffer->>Lease: acquire(handles)
Lease->>Handle: "acquireRead() refCount=1"
Buffer-->>Reducer: this (retain lease stored)
Reducer->>Buffer: createInputStream()
Buffer->>Lease: acquire(handles)
Lease->>Handle: "acquireRead() refCount=2"
Buffer-->>Reducer: MultiSegmentInputStream (holds lease)
Note over Catalog: Concurrent unregisterShuffle
Catalog->>Handle: close()
Handle->>Handle: "closeRequested=true, refCount>0 defer"
Reducer->>Buffer: stream.read() handle still open
Reducer->>Buffer: stream.close()
Buffer->>Lease: close()
Lease->>Handle: "releaseRead() refCount=1"
Reducer->>Buffer: release()
Buffer->>Lease: close()
Lease->>Handle: "releaseRead() refCount=0 closeRequested=true doClose()"
Handle->>Handle: Physical close + file delete
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Reducer
participant Buffer as MultiBatchManagedBuffer
participant Lease as ShuffleHandleLease
participant Handle as SpillablePartialFileHandle
participant Catalog as MultithreadedShuffleBufferCatalog
Reducer->>Buffer: retain()
Buffer->>Lease: acquire(handles)
Lease->>Handle: "acquireRead() refCount=1"
Buffer-->>Reducer: this (retain lease stored)
Reducer->>Buffer: createInputStream()
Buffer->>Lease: acquire(handles)
Lease->>Handle: "acquireRead() refCount=2"
Buffer-->>Reducer: MultiSegmentInputStream (holds lease)
Note over Catalog: Concurrent unregisterShuffle
Catalog->>Handle: close()
Handle->>Handle: "closeRequested=true, refCount>0 defer"
Reducer->>Buffer: stream.read() handle still open
Reducer->>Buffer: stream.close()
Buffer->>Lease: close()
Lease->>Handle: "releaseRead() refCount=1"
Reducer->>Buffer: release()
Buffer->>Lease: close()
Lease->>Handle: "releaseRead() refCount=0 closeRequested=true doClose()"
Handle->>Handle: Physical close + file delete
Reviews (2): Last reviewed commit: "Merge branch 'main' into rapids-15018" | Re-trigger Greptile |
|
build |
…test changes Databricks pre-merge CI is conditional: per jenkins/Jenkinsfile-blossom.premerge it runs only when the PR title contains [databricks] or the diff touches a Databricks-shim path (sql-plugin/src/main/...db/ or a path containing "databricks"). The standard Linux pre-merge never runs Databricks. This leaves a gap. A change can be correct on vanilla Spark yet behave differently on the Databricks Spark fork without touching any auto-trigger path -- e.g. integration tests that rely on filesystem/path semantics (local vs DBFS/abfss, file:// scheme, os.walk/os.path) or that assert on optimizer plan strings (alias names and plan rendering differ on DBR). Such a test merges green because the only job that would have exercised it on Databricks was never triggered, then surfaces as a failure later on an unrelated PR that does carry [databricks] -- making an innocent PR look broken and costing triage time. To close the gap on the review side: - .greptile/config.json: add the "databricks-ci-tag" rule (scoped to integration_tests/**, severity medium) so Greptile recommends adding [databricks] when an integration-test change looks Databricks-divergent and the PR title lacks the tag. It explicitly does not flag changes already under a *db* shim path (auto-covered) or doc-only changes, to avoid noise. - .greptile/rules.md: split the vague [databricks] mention out of H7 into a focused H9 "Databricks coverage" checklist item. - AGENTS.md: document when [databricks] is needed (not just how), so both humans and Greptile (whose instructions reference AGENTS.md) share one source of truth. Scope is integration_tests/** only -- not the Scala unit-test dirs. The Databricks pre-merge builds with -DskipTests and runs only the Python integration tests (run_pyspark_from_build.sh); Scala unit tests never execute on Databricks, so the [databricks] tag cannot validate them and recommending it there would be misleading. Verified against the NVIDIA#15064 Databricks CI_PART1 console log: all shims built with -DskipTests, scalatest goal skipped (73x "Tests are skipped", zero ScalaTest/Surefire run summaries), followed only by run_pyspark_from_build.sh. The rule is advisory -- it nudges; it does not gate merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
|
The CI/CD failure is unrelated problem reported in #15073 |
abellina
left a comment
There was a problem hiding this comment.
Thanks @amahussein. I'd like to ask if there is an alternative to this implementation. From the error you quoted, the problem is that the file channel is closed while we are reading (full stop). Did I understand that right? If the channel were closed before we started reading, we'd be OK?
If so, could we just make sure that when we area reading from the channel we consistently hold the file handle lock?
Sometimes, in the code I quoted above, we read it under the lock, sometimes we don't.
Wouldn't that just fix the race between close and read?
Thanks @abellina. Yes, you read it right: the failure is the file channel getting closed while a read is in flight. If close always landed strictly before any read started (and stayed closed only after all reads finished), we'd be fine. The problem with fixing it via a read-time lock is what the failing consumer actually is. A monitor only covers a single in-method And the produce→read gap here is real and wide, not a tight CPU race — remote serving is two-phase:
This is also why I don't think we can lean on the cleanup timing instead. Doc §8.1 requires handles to stay alive "until all reducers have finished reading," but the trigger we actually have is SQL-execution-end + a 1s poll — a heuristic running on a different clock than the reads (speculative/zombie reduce attempts, multi-shuffle executions, driver-end vs. in-flight transfer all overlap). So the lifetime has to be enforced at the handle. Refcounting is already the documented mechanism (§6.3, "each open stream increments ref count on handles"); this PR just extends it to the other two consumers the serving path produces — Netty file regions and retained buffers. On your specific observation: you're right that the read path is inconsistent — the memory read is under the lock, the file read at |
|
Thanks @amahussein Then all the changes outside of the handle don't need to be made. Other than perhaps a call to a new method "acquire()" that incRefCounts the handle. The "close()" method would decRefCount until the lease is done. Thoughts? |
Thanks @abellina. I'm on board with moving the ref count into the handle. It lets Two things I want to flag before I make the change:
|
If close() could check that refCount is == 1, throwing in that case, then ok. I don't think this is possible when you get to implementing it => since what we are talking about is that part of the code (whether it is the removal part or the reader) is going to actually release, it implies close() must not close, it must decRefCount. |
Thanks @abellina. On the two points: 1. On I looked at leaning on it instead of a handle refcount. It fits the in-memory case cleanly (incRefCount the host buffer, zero-copy) and we could drop our refcount there. It doesn't fit the file-backed case, which is the one this bug hits (repro is 2. On We agree the decrement belongs on the handle (that's |
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#15018 Reference-count partial shuffle handles on the handle itself so that while managed buffers, input streams, and Netty file regions are active, unregisterShuffle (and any other close caller) defers the physical close instead of freeing data under in-flight readers. close() is an idempotent close-request; acquireRead/releaseRead track active readers and the last release performs the deferred close. Release is exception-safe, and an interrupted close while waiting on an in-progress spill still finishes resource cleanup before restoring the interrupt flag. Add regression coverage for the handle read-lease lifecycle (deferred close, repeated close while a lease is held, immediate close with no leases), retained-buffer reads across concurrent unregisterShuffle, Netty file-region release, and interrupted partial-file cleanup.
|
build |
Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me
Fixes #15018.
Description
This fixes a skip-merge shuffle lifetime bug where
unregisterShufflecould close a partial shufflefile handle while a retained buffer, input stream, or Netty file region was still reading from it.
That could surface as failed shuffle reads when cleanup raced with in-flight fetch consumers.
The fix adds reference-counted lifecycle tracking for partial shuffle file handles. Catalog cleanup
now removes metadata immediately but defers the physical handle close until all active retained
buffers, streams, and file regions release their leases. Lease cleanup is hardened so all retained
handles are released even if one release throws.
This also fixes an interrupt cleanup hole in
SpillablePartialFileHandle.doClose(): if close isinterrupted while waiting for an in-progress spill, cleanup now still releases resources and deletes
the temp file before restoring the interrupt flag.
Tests added or updated:
MultithreadedShuffleBufferCatalogSuiteunregisterShuffleconvertToNettyfile-region release closes the retained handle exactly onceSpillablePartialFileHandleSuiteValidation run:
git diff --checkmvn install -pl sql-plugin -am -DskipTests -Dmaven.scaladoc.skip=true -Dbuildver=356SpillablePartialFileHandleSuitepassed, 18/18 testsMultithreadedShuffleBufferCatalogSuitepassed, 13/13 testsChecklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance