Fixed shutdown race where ABORT is not sent #456

sirzooro · 2026-01-22T18:15:28Z

Fix a shutdown race where Association.Abort() could return without the ABORT being written, which in turn can delay peer-side close detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.

Problem

Association.Abort() is used as an immediate shutdown path. Today it:

sets willSendAbort and wakes writeLoop,
then immediately forces readLoop to exit via SetReadDeadline(time.Now()),
and waits for readLoop to finish.

However, readLoop teardown closes closeWriteLoopCh, and writeLoop may exit promptly on that channel closure. In some timing windows, this means the ABORT packet is never actually gathered/written before the association is torn down.

When ABORT is skipped, the remote side may not observe a clean SCTP shutdown promptly; downstream code that waits for SCTP closure (like pion/webrtc’s SCTPTransport.OnClose) can block until other layers time out, producing flakes.

This tends to show up as a "remote OnClose never fires" symptom: the peer still has an SCTP accept loop blocked on AcceptStream()/Accept() and won’t unblock until it gets a definitive association teardown signal.

Fix

Make ABORT emission deterministic and bounded:

Track whether an ABORT write attempt happened by adding an abortSentCh that is closed when writeLoop attempts to write a packet whose first chunk type is ABORT.
In Abort(), wait (with a short timeout) for the ABORT write attempt before forcing readLoop to exit, and also before returning.
Keep the previously-added behavior where writeLoop prefers sending a pending ABORT even if closeWriteLoopCh is closed.

This preserves the current “prompt shutdown” behavior while significantly increasing the chance the peer observes ABORT promptly.

Implementation notes:

The waits are intentionally bounded (200ms) so Abort remains non-blocking during shutdown even if the underlying connection is already broken.
The abortSentCh is best-effort “write attempted” (not an ACK from the network), but it prevents the most problematic ordering where we tear down the loops before the ABORT ever hits net.Conn.Write.

Tests

TestAbortStillSendsWhenWriteLoopClosing: reproduces the race where closeWriteLoopCh is closed while ABORT is pending and asserts the ABORT packet is still written.
TestAbort_WaitsForAbortWriteAttempt: uses a deterministic net.Conn stub to assert Abort() does not return until an ABORT write has been attempted.

Notes / Backwards compatibility

The new waits are bounded (currently 200ms) and only apply to the Abort path.

codecov · 2026-01-22T18:17:33Z

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.78%. Comparing base (473ed5a) to head (7c6cf56).

Files with missing lines	Patch %	Lines
association.go	77.77%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #456      +/-   ##
==========================================
- Coverage   83.86%   83.78%   -0.09%     
==========================================
  Files          52       52              
  Lines        4060     4076      +16     
==========================================
+ Hits         3405     3415      +10     
- Misses        479      483       +4     
- Partials      176      178       +2

Flag	Coverage Δ
go	`83.78% <77.77%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix a shutdown race where Association.Abort() could return without the ABORT being written, which in turn can delay peer-side close detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.

Copilot

Pull request overview

This PR fixes a race condition in SCTP association shutdown where Association.Abort() could return before the ABORT packet is actually written to the network. This race occurred when the readLoop exits and closes closeWriteLoopCh before writeLoop has a chance to send the pending ABORT, causing the peer to not detect the shutdown promptly.

Changes:

Added synchronization mechanism (abortSentCh channel and abortSentOnce sync.Once) to track when ABORT write attempts occur
Modified Abort() method to wait (with 200ms timeout) for the ABORT write attempt before forcing readLoop exit and before returning
Enhanced writeLoop to prefer sending pending ABORTs even when closeWriteLoopCh is closed, and to signal when ABORT packets are written
Added comprehensive tests to validate the fix

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
association.go	Added ABORT write attempt tracking with `abortSentCh` and `abortSentOnce`; updated `Abort()` to wait for ABORT write; modified `writeLoop` to detect and signal ABORT writes and continue loop when ABORT is pending despite close signal
association_test.go	Added test helper types (`recordConn`, `abortOrderingConn`) and two tests: `TestAbortStillSendsWhenWriteLoopClosing` validates ABORT is sent when writeLoop is closing, and `TestAbort_WaitsForAbortWriteAttempt` ensures Abort() waits for write attempt

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T18:30:05Z

association.go

+			isAbortPacket := len(raw) > int(commonHeaderSize) && raw[commonHeaderSize] == byte(ctAbort)
 			_, err := a.netConn.Write(raw)
+			if isAbortPacket {
+				a.abortSentOnce.Do(func() { close(a.abortSentCh) })
+			}


If gatherAbortPacket returns an error (line 1204), the abortSentCh channel is never closed because no packet is written. This causes Abort() to always wait for the full 200ms timeout on both wait blocks (lines 673-676 and 686-689), resulting in a 400ms delay even though the ABORT cannot be sent.

Consider closing abortSentCh immediately when gatherAbortPacket fails, or handle this error case in the writeLoop to signal that the ABORT attempt completed (even if it failed). This ensures Abort() doesn't unnecessarily block when the ABORT cannot be marshaled.

Fixed shutdown race where ABORT is not sent

7c6cf56

Fix a shutdown race where Association.Abort() could return without the ABORT being written, which in turn can delay peer-side close detection (e.g. webrtc.SCTPTransport.OnClose) and cause intermittent timeouts.

sirzooro force-pushed the fix_abort_race branch from 0316507 to 7c6cf56 Compare January 22, 2026 18:20

sirzooro requested review from JoTurk, Copilot and philipch07 January 22, 2026 18:25

Copilot started reviewing on behalf of sirzooro January 22, 2026 18:25 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed shutdown race where ABORT is not sent #456

Fixed shutdown race where ABORT is not sent #456

Uh oh!

sirzooro commented Jan 22, 2026

Uh oh!

codecov bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fixed shutdown race where ABORT is not sent #456

Are you sure you want to change the base?

Fixed shutdown race where ABORT is not sent #456

Uh oh!

Conversation

sirzooro commented Jan 22, 2026

Problem

Fix

Tests

Notes / Backwards compatibility

Uh oh!

codecov bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Jan 22, 2026 •

edited

Loading