ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

j0sh · 2025-12-06T01:34:23Z

The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment.

The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself.

However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems:

Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted.

This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the leading edge segment, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through.
Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead.

Address this in two ways:

Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel.
Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors.

[1] #3802

The change in [1] introduced a bit of a race condition and uncovered a separate issue that would lead to dozens of rapid-fire GET calls from the orchestrator's local subscriber to the same nonexistent segment. The race condition was this: on deleting a channel, the publisher first closes the preconnect to clean up its own state, which triggers a segment close on the server with zero bytes written. Then the publisher DELETEs the channel itself. However, on closing zero-byte segments, the server did not increment the sequence number for the next expected segment. This would cause two problems: * Subscribers that set the seq to the "next" segment (-1) would keep getting the same zero-byte segment back until the channel was deleted. This is what happened to us: the orch runs a trickle local subscriber that continuously fetches the segment on the leading edge, but it would immediately return with zero bytes just before the channel is deleted. Because this is a local subscriber, it would be repeating this dozens of times until the DELETE got through. * Subscribers that handle their own sequence numbering (eg, incrementing it after a successful read; there is nothing inherently wrong with a zero-byte segment) would see an error if it fetched the next segment in the sequence, since the server does not allow for preconnects more than one segment ahead. Address this in two ways: * Have the publisher delete the channel then close its own preconnect, rather than the other way around. This addresses the immediate issue of repeated retries: because the channel is marked as deleted first, any later retries see a nonexistent channel. * Treat zero-byte segments as valid on the server and increment the expected sequence number once a zero-byte segment closes. This would also have prevented this issue even without the publisher fix (at the expense of one more preconnect) and allows us to gracefully handle non-updated publishers or scenarios that raise similar behaviors. [1] #3802

Using a read-lock is not safe here since we may modify the segment list by precreating the segment if this request is for the next segment. Somehow never caught by the race detector until just now.

j0sh · 2025-12-08T01:15:08Z

Added b4ac3f3: fixed another race condition uncovered by CI in the latest run ... this is not a new problem but for some reason hasn't triggered the race detector until now.

Trickle tests all pass with go test -race -count 1000

codecov · 2025-12-08T01:23:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 31.67444%. Comparing base (ba9c555) to head (2a8df77).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3824         +/-   ##
===================================================
+ Coverage   31.65044%   31.67444%   +0.02400%     
===================================================
  Files            159         159                 
  Lines          39020       39022          +2     
===================================================
+ Hits           12350       12360         +10     
+ Misses         25777       25772          -5     
+ Partials         893         890          -3

Files with missing lines	Coverage Δ
trickle/trickle_publisher.go	`65.43210% <100.00000%> (+1.85185%)`	⬆️
trickle/trickle_server.go	`72.65625% <100.00000%> (-0.11862%)`	⬇️

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba9c555...2a8df77. Read the comment docs.

Files with missing lines	Coverage Δ
trickle/trickle_publisher.go	`65.43210% <100.00000%> (+1.85185%)`	⬆️
trickle/trickle_server.go	`72.65625% <100.00000%> (-0.11862%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

victorges

LGTM! Thanks for the PR description, much easier to review with all that context.

victorges · 2025-12-08T14:49:52Z

trickle/local_subscriber_test.go

+		if err != nil {
+			break
+		}


Shouldn't this be a require?

Ah it returns an error when the stream closes ... right now the error is kind of un-semantic ("stream not found" instead of "end of stream" until delayed teardowns are in), but added something in 4184452

victorges · 2025-12-08T14:51:11Z

trickle/local_subscriber_test.go

+		if i == 0 {
+			require.Equal(5, int(n)) // first write - "hello"
+		} else {
+			// second write latches on after first completes, but cancelled


More to understand this logic, but where does the second and third writes come from? I see only the hello write above

Ah good catch. The comment is wrong, there are only two POSTs (the single hello write plus the preconnect to precreate the next segment, which is never consummated) ... the term "write" is probably not the best one, they are POSTs but not necessarily content writes. Will fix.

An earlier version of this fix had three POSTs happening instead of two for this scenario, but forgot to update the comments in the unit tests

Improved the copy in 4184452 hopefully that's clearer

victorges

LGTM

j0sh requested review from leszko, mjh1 and victorges December 6, 2025 01:34

github-actions bot added the go Pull requests that update Go code label Dec 6, 2025

ai/live: Fix data race during trickle reads

b4ac3f3

Using a read-lock is not safe here since we may modify the segment list by precreating the segment if this request is for the next segment. Somehow never caught by the race detector until just now.

victorges approved these changes Dec 8, 2025

View reviewed changes

j0sh added 2 commits December 9, 2025 01:27

PR feedback

4184452

Merge branch 'master' into ja/repeated-gets

2a8df77

j0sh enabled auto-merge (squash) December 9, 2025 16:21

victorges approved these changes Dec 9, 2025

View reviewed changes

victorges disabled auto-merge December 9, 2025 16:46

victorges merged commit 1f5b377 into master Dec 9, 2025
13 of 15 checks passed

victorges deleted the ja/repeated-gets branch December 9, 2025 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

j0sh commented Dec 6, 2025

Uh oh!

j0sh commented Dec 8, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

victorges left a comment

Uh oh!

victorges Dec 8, 2025

Uh oh!

j0sh Dec 9, 2025

Uh oh!

victorges Dec 8, 2025

Uh oh!

j0sh Dec 8, 2025

Uh oh!

j0sh Dec 9, 2025

Uh oh!

victorges left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

ai/live: Fix race on publish close, advance trickle seq on empty writes #3824

Conversation

j0sh commented Dec 6, 2025

Uh oh!

j0sh commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

victorges left a comment

Choose a reason for hiding this comment

Uh oh!

victorges Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

j0sh Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

victorges Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

j0sh Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

j0sh Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

victorges left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

j0sh commented Dec 8, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading