Skip to content

Conversation

@sirzooro
Copy link
Contributor

Fix a shutdown race where Association.Abort() could return without the ABORT being written, which in turn can delay peer-side close detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.

Problem

Association.Abort() is used as an immediate shutdown path. Today it:

  • sets willSendAbort and wakes writeLoop,
  • then immediately forces readLoop to exit via SetReadDeadline(time.Now()),
  • and waits for readLoop to finish.

However, readLoop teardown closes closeWriteLoopCh, and writeLoop may exit promptly on that channel closure. In some timing windows, this means the ABORT packet is never actually gathered/written before the association is torn down.

When ABORT is skipped, the remote side may not observe a clean SCTP shutdown promptly; downstream code that waits for SCTP closure (like pion/webrtc’s SCTPTransport.OnClose) can block until other layers time out, producing flakes.

This tends to show up as a "remote OnClose never fires" symptom: the peer still has an SCTP accept loop blocked on AcceptStream()/Accept() and won’t unblock until it gets a definitive association teardown signal.

Fix

Make ABORT emission deterministic and bounded:

  1. Track whether an ABORT write attempt happened by adding an abortSentCh that is closed when writeLoop attempts to write a packet whose first chunk type is ABORT.
  2. In Abort(), wait (with a short timeout) for the ABORT write attempt before forcing readLoop to exit, and also before returning.
  3. Keep the previously-added behavior where writeLoop prefers sending a pending ABORT even if closeWriteLoopCh is closed.

This preserves the current “prompt shutdown” behavior while significantly increasing the chance the peer observes ABORT promptly.

Implementation notes:

  • The waits are intentionally bounded (200ms) so Abort remains non-blocking during shutdown even if the underlying connection is already broken.
  • The abortSentCh is best-effort “write attempted” (not an ACK from the network), but it prevents the most problematic ordering where we tear down the loops before the ABORT ever hits net.Conn.Write.

Tests

  • TestAbortStillSendsWhenWriteLoopClosing: reproduces the race where closeWriteLoopCh is closed while ABORT is pending and asserts the ABORT packet is still written.
  • TestAbort_WaitsForAbortWriteAttempt: uses a deterministic net.Conn stub to assert Abort() does not return until an ABORT write has been attempted.

Notes / Backwards compatibility

  • The new waits are bounded (currently 200ms) and only apply to the Abort path.

@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.78%. Comparing base (473ed5a) to head (7c6cf56).

Files with missing lines Patch % Lines
association.go 77.77% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #456      +/-   ##
==========================================
- Coverage   83.86%   83.78%   -0.09%     
==========================================
  Files          52       52              
  Lines        4060     4076      +16     
==========================================
+ Hits         3405     3415      +10     
- Misses        479      483       +4     
- Partials      176      178       +2     
Flag Coverage Δ
go 83.78% <77.77%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix a shutdown race where Association.Abort() could return without
the ABORT being written, which in turn can delay peer-side close
detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent
timeouts.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in SCTP association shutdown where Association.Abort() could return before the ABORT packet is actually written to the network. This race occurred when the readLoop exits and closes closeWriteLoopCh before writeLoop has a chance to send the pending ABORT, causing the peer to not detect the shutdown promptly.

Changes:

  • Added synchronization mechanism (abortSentCh channel and abortSentOnce sync.Once) to track when ABORT write attempts occur
  • Modified Abort() method to wait (with 200ms timeout) for the ABORT write attempt before forcing readLoop exit and before returning
  • Enhanced writeLoop to prefer sending pending ABORTs even when closeWriteLoopCh is closed, and to signal when ABORT packets are written
  • Added comprehensive tests to validate the fix

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
association.go Added ABORT write attempt tracking with abortSentCh and abortSentOnce; updated Abort() to wait for ABORT write; modified writeLoop to detect and signal ABORT writes and continue loop when ABORT is pending despite close signal
association_test.go Added test helper types (recordConn, abortOrderingConn) and two tests: TestAbortStillSendsWhenWriteLoopClosing validates ABORT is sent when writeLoop is closing, and TestAbort_WaitsForAbortWriteAttempt ensures Abort() waits for write attempt

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +764 to +768
isAbortPacket := len(raw) > int(commonHeaderSize) && raw[commonHeaderSize] == byte(ctAbort)
_, err := a.netConn.Write(raw)
if isAbortPacket {
a.abortSentOnce.Do(func() { close(a.abortSentCh) })
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gatherAbortPacket returns an error (line 1204), the abortSentCh channel is never closed because no packet is written. This causes Abort() to always wait for the full 200ms timeout on both wait blocks (lines 673-676 and 686-689), resulting in a 400ms delay even though the ABORT cannot be sent.

Consider closing abortSentCh immediately when gatherAbortPacket fails, or handle this error case in the writeLoop to signal that the ABORT attempt completed (even if it failed). This ensures Abort() doesn't unnecessarily block when the ABORT cannot be marshaled.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant