-
Notifications
You must be signed in to change notification settings - Fork 87
Fixed shutdown race where ABORT is not sent #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #456 +/- ##
==========================================
- Coverage 83.86% 83.78% -0.09%
==========================================
Files 52 52
Lines 4060 4076 +16
==========================================
+ Hits 3405 3415 +10
- Misses 479 483 +4
- Partials 176 178 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fix a shutdown race where Association.Abort() could return without the ABORT being written, which in turn can delay peer-side close detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.
0316507 to
7c6cf56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a race condition in SCTP association shutdown where Association.Abort() could return before the ABORT packet is actually written to the network. This race occurred when the readLoop exits and closes closeWriteLoopCh before writeLoop has a chance to send the pending ABORT, causing the peer to not detect the shutdown promptly.
Changes:
- Added synchronization mechanism (
abortSentChchannel andabortSentOncesync.Once) to track when ABORT write attempts occur - Modified
Abort()method to wait (with 200ms timeout) for the ABORT write attempt before forcing readLoop exit and before returning - Enhanced
writeLoopto prefer sending pending ABORTs even whencloseWriteLoopChis closed, and to signal when ABORT packets are written - Added comprehensive tests to validate the fix
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| association.go | Added ABORT write attempt tracking with abortSentCh and abortSentOnce; updated Abort() to wait for ABORT write; modified writeLoop to detect and signal ABORT writes and continue loop when ABORT is pending despite close signal |
| association_test.go | Added test helper types (recordConn, abortOrderingConn) and two tests: TestAbortStillSendsWhenWriteLoopClosing validates ABORT is sent when writeLoop is closing, and TestAbort_WaitsForAbortWriteAttempt ensures Abort() waits for write attempt |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| isAbortPacket := len(raw) > int(commonHeaderSize) && raw[commonHeaderSize] == byte(ctAbort) | ||
| _, err := a.netConn.Write(raw) | ||
| if isAbortPacket { | ||
| a.abortSentOnce.Do(func() { close(a.abortSentCh) }) | ||
| } |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gatherAbortPacket returns an error (line 1204), the abortSentCh channel is never closed because no packet is written. This causes Abort() to always wait for the full 200ms timeout on both wait blocks (lines 673-676 and 686-689), resulting in a 400ms delay even though the ABORT cannot be sent.
Consider closing abortSentCh immediately when gatherAbortPacket fails, or handle this error case in the writeLoop to signal that the ABORT attempt completed (even if it failed). This ensures Abort() doesn't unnecessarily block when the ABORT cannot be marshaled.
Fix a shutdown race where
Association.Abort()could return without the ABORT being written, which in turn can delay peer-side close detection (e.g.webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.Problem
Association.Abort()is used as an immediate shutdown path. Today it:willSendAbortand wakeswriteLoop,readLoopto exit viaSetReadDeadline(time.Now()),readLoopto finish.However,
readLoopteardown closescloseWriteLoopCh, andwriteLoopmay exit promptly on that channel closure. In some timing windows, this means the ABORT packet is never actually gathered/written before the association is torn down.When ABORT is skipped, the remote side may not observe a clean SCTP shutdown promptly; downstream code that waits for SCTP closure (like
pion/webrtc’sSCTPTransport.OnClose) can block until other layers time out, producing flakes.This tends to show up as a "remote OnClose never fires" symptom: the peer still has an SCTP accept loop blocked on
AcceptStream()/Accept()and won’t unblock until it gets a definitive association teardown signal.Fix
Make ABORT emission deterministic and bounded:
abortSentChthat is closed whenwriteLoopattempts to write a packet whose first chunk type isABORT.Abort(), wait (with a short timeout) for the ABORT write attempt before forcingreadLoopto exit, and also before returning.writeLoopprefers sending a pending ABORT even ifcloseWriteLoopChis closed.This preserves the current “prompt shutdown” behavior while significantly increasing the chance the peer observes ABORT promptly.
Implementation notes:
abortSentChis best-effort “write attempted” (not an ACK from the network), but it prevents the most problematic ordering where we tear down the loops before the ABORT ever hitsnet.Conn.Write.Tests
TestAbortStillSendsWhenWriteLoopClosing: reproduces the race wherecloseWriteLoopChis closed while ABORT is pending and asserts the ABORT packet is still written.TestAbort_WaitsForAbortWriteAttempt: uses a deterministicnet.Connstub to assertAbort()does not return until an ABORT write has been attempted.Notes / Backwards compatibility