-
Notifications
You must be signed in to change notification settings - Fork 205
ai/live: Fix race on publish close, advance trickle seq on empty writes #3824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment. The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself. However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems: * Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted. This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the segment on the leading edge, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through. * Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead. Address this in two ways: * Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel. * Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors. [1] #3802
Using a read-lock is not safe here since we may modify the segment list by precreating the segment if this request is for the next segment. Somehow never caught by the race detector until just now.
|
Added b4ac3f3: fixed another race condition uncovered by CI in the latest run ... this is not a new problem but for some reason hasn't triggered the race detector until now. Trickle tests all pass with |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3824 +/- ##
===================================================
+ Coverage 31.65044% 31.67444% +0.02400%
===================================================
Files 159 159
Lines 39020 39022 +2
===================================================
+ Hits 12350 12360 +10
+ Misses 25777 25772 -5
+ Partials 893 890 -3
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
victorges
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR description, much easier to review with all that context.
trickle/local_subscriber_test.go
Outdated
| if err != nil { | ||
| break | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a require?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it returns an error when the stream closes ... right now the error is kind of un-semantic ("stream not found" instead of "end of stream" until delayed teardowns are in), but added something in 4184452
trickle/local_subscriber_test.go
Outdated
| if i == 0 { | ||
| require.Equal(5, int(n)) // first write - "hello" | ||
| } else { | ||
| // second write latches on after first completes, but cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More to understand this logic, but where does the second and third writes come from? I see only the hello write above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch. The comment is wrong, there are only two POSTs (the single hello write plus the preconnect to precreate the next segment, which is never consummated) ... the term "write" is probably not the best one, they are POSTs but not necessarily content writes. Will fix.
An earlier version of this fix had three POSTs happening instead of two for this scenario, but forgot to update the comments in the unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the copy in 4184452 hopefully that's clearer
victorges
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment.
The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself.
However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems:
Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted.
This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the leading edge segment, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through.
Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead.
Address this in two ways:
Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel.
Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors.
[1] #3802