Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: handle future epochs in channel adapter #32017

Merged
merged 1 commit into from
Mar 27, 2025

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Mar 26, 2025

Previously the compute command channel adapter assumed that it is impossible to receive responses with epoch values greater than the channel adapter's current epoch. This assumption is wrong because the Timely cluster-side of workers learns about new connection epochs by observing commands from worker 0, which can be ahead of other workers' epochs.

This PR removes the assumption and instead makes the channel adapter handle responses with future epochs by stashing them until sufficiently many client reconnects have been observed.

Motivation

  • This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/9124

Tips for reviewer

I've tried to write a regression test using mzcompose and toxiproxy, spawning a multi-process cluster and resetting its controller connection in a loop. Didn't manage to reproduce the race that leads to the "epoch from the future" panic though.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

Previously the compute command channel adapter assumed that it is
impossible to receive responses with epoch values greater than the
channel adapter's current epoch. This assumption is wrong because
the Timely cluster-side of workers learns about new connection epochs by
observing commands from worker 0, which can be ahead of other workers'
epochs.

This commit removes the assumption and instead makes the channel adapter
handle responses with future epochs by stashing them until sufficiently
many client reconnects have been observed.
@teskje teskje marked this pull request as ready for review March 26, 2025 14:01
@teskje teskje requested a review from a team as a code owner March 26, 2025 14:01
@teskje teskje requested a review from antiguru March 26, 2025 14:01
Comment on lines +773 to +793
// Serve this connection until we see any of the channels disconnect.
loop {
let (resp, resp_epoch) = match stashed_response.take() {
Some(stashed) => stashed,
None => match serve_rx_channels() {
Ok(response) => response,
Err(()) => break,
},
};

if resp_epoch < epoch {
// Response for a previous connection; discard it.
continue;
} else if resp_epoch > epoch {
// Response for a future connection; stash it and reconnect.
stashed_response = Some((resp, resp_epoch));
break;
} else {
// Response for the current connection; forward it.
if response_tx.send(resp).is_err() {
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would all be much nicer if we could have a crossbeam_channel::Receiver wrapper with a builtin stash that you could push_front to. Unfortunately crossbeam_channel::select doesn't support other types so I had to move the stash outside and ended up with this somewhat convoluted control flow.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@teskje
Copy link
Contributor Author

teskje commented Mar 27, 2025

TFTR!

@teskje teskje merged commit f9ed0fc into MaterializeInc:main Mar 27, 2025
82 checks passed
@teskje teskje deleted the compute-future-epoch branch March 27, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants