compute: handle future epochs in channel adapter #32017

teskje · 2025-03-26T13:01:14Z

Previously the compute command channel adapter assumed that it is impossible to receive responses with epoch values greater than the channel adapter's current epoch. This assumption is wrong because the Timely cluster-side of workers learns about new connection epochs by observing commands from worker 0, which can be ahead of other workers' epochs.

This PR removes the assumption and instead makes the channel adapter handle responses with future epochs by stashing them until sufficiently many client reconnects have been observed.

Motivation

This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/9124

Tips for reviewer

I've tried to write a regression test using mzcompose and toxiproxy, spawning a multi-process cluster and resetting its controller connection in a loop. Didn't manage to reproduce the race that leads to the "epoch from the future" panic though.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

Previously the compute command channel adapter assumed that it is impossible to receive responses with epoch values greater than the channel adapter's current epoch. This assumption is wrong because the Timely cluster-side of workers learns about new connection epochs by observing commands from worker 0, which can be ahead of other workers' epochs. This commit removes the assumption and instead makes the channel adapter handle responses with future epochs by stashing them until sufficiently many client reconnects have been observed.

teskje · 2025-03-27T08:47:42Z

src/compute/src/server.rs

+                // Serve this connection until we see any of the channels disconnect.
+                loop {
+                    let (resp, resp_epoch) = match stashed_response.take() {
+                        Some(stashed) => stashed,
+                        None => match serve_rx_channels() {
+                            Ok(response) => response,
+                            Err(()) => break,
+                        },
+                    };
+
+                    if resp_epoch < epoch {
+                        // Response for a previous connection; discard it.
+                        continue;
+                    } else if resp_epoch > epoch {
+                        // Response for a future connection; stash it and reconnect.
+                        stashed_response = Some((resp, resp_epoch));
+                        break;
+                    } else {
+                        // Response for the current connection; forward it.
+                        if response_tx.send(resp).is_err() {
+                            break;


This would all be much nicer if we could have a crossbeam_channel::Receiver wrapper with a builtin stash that you could push_front to. Unfortunately crossbeam_channel::select doesn't support other types so I had to move the stash outside and ended up with this somewhat convoluted control flow.

antiguru

Looks good, thank you!

teskje · 2025-03-27T09:22:35Z

TFTR!

teskje marked this pull request as ready for review March 26, 2025 14:01

teskje requested a review from a team as a code owner March 26, 2025 14:01

teskje requested a review from antiguru March 26, 2025 14:01

teskje commented Mar 27, 2025

View reviewed changes

antiguru approved these changes Mar 27, 2025

View reviewed changes

teskje merged commit f9ed0fc into MaterializeInc:main Mar 27, 2025
82 checks passed

teskje deleted the compute-future-epoch branch March 27, 2025 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: handle future epochs in channel adapter #32017

compute: handle future epochs in channel adapter #32017

teskje commented Mar 26, 2025

teskje Mar 27, 2025

antiguru left a comment

teskje commented Mar 27, 2025

compute: handle future epochs in channel adapter #32017

compute: handle future epochs in channel adapter #32017

Conversation

teskje commented Mar 26, 2025

Motivation

Tips for reviewer

Checklist

teskje Mar 27, 2025

Choose a reason for hiding this comment

antiguru left a comment

Choose a reason for hiding this comment

teskje commented Mar 27, 2025