[design doc] optimizer release engineering #30233

mgree · 2024-10-28T20:36:40Z

Design doc.

Rendered markdown

Motivation

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

…rank, and moritz

doc/developer/design/20241028_optimizer_releases.md

ggevay · 2024-10-30T13:19:13Z

doc/developer/design/20241028_optimizer_releases.md

+
+### Supporting qualification
+
+A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode.


(a) would be nice, but would be a significant amount of work to add this feature.

An alternative way for customers to qualify the new optimizer version is to simply do a blue-green deployment where they change only the optimizer version, but not cut over to the new version for something like a week, and if it works fine over a week, then cut over. I think this would be minimal extra work for us.

ggevay · 2024-10-30T13:28:42Z

doc/developer/design/20241028_optimizer_releases.md

+
+## Open questions
+
+- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid)


One should be able to specify unstable/hot when specifying the optimizer version for a cluster, but probably by just saying something like unstable rather than a version number. Having a version number for it seems problematic:

If we want to increment it at every change (e.g., when we merge a minor optimizer change we go from v5.41 to v5.42), then the problem is that the old version (v5.41) is not available anymore, so if somebody specified that by version number then they are out of luck.

If the version number stays the same across minor optimizer changes on unstable, that's a bit weird, because version numbers are generally supposed to refer to immutable versions.

Except if we do something like v6-unstable as a version number, which is ok if it doesn't refer to an immutable version. But I don't really see any advantage of this over simply unstable, and the disadvantage is that we'd have to modify it on experimental clusters at every release. E.g., maybe our test environments will have various clusters that should be always on unstable, and it would be good to avoid having to having to move them from v6-unstable to v7-unstable. (This move could also be automatic, but that would be essentially the same as simply unstable.)

Agreed, the Debian sid model seems right to me too.

ggevay

Looks good to me! (But I guess some other people should also take a look.)

benesch · 2024-11-01T22:34:45Z

doc/developer/design/20241028_optimizer_releases.md

+Currently, users run on "Materialize weekly", but:
+
+  1. That cannot be case forever, as self-hosted users may not upgrade immediately.
+  2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads.


I think the connection between the weekly release schedule and the problem, as stated, isn't quite clear. (We've talked about it—it's just not written here!) Is the problem that users ever experience regressions? Is the problem that users have no way to test new versions of the optimizer against their workload? Because in theory we could allow users to test new versions of the optimizer each week (using something the hot/warm/cold solution you've laid out below). But of course then the next problem would almost surely be that users don't have the time/willingness to test a new version of the optimizer each week.

What I'm driving at is that I don't think we have a clear answer as to what amount of breakage and on what cadence is acceptable to users. I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy, and it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.

As we've talked about, I'm not opposed to trying the slower release cadence, but I wanted to record that my intuition here is that we are underestimating how painful it will be to migrate users from one version of the optimizer to the next if new versions are released on a six month cadence.

it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.

Yes, if there is 1 optimizer change in a given week, then it's immediately clear which change is to blame for a plan regression. With 6 months of changes, we'd need to do some non-trivial investigation. But I'm not too worried about this, because we have a powerful tool for this: Alexander's optimizer trace tool usually allows us to quickly (usually within 1-2 hours) figure out what's going on when an optimization is not going the way we imagined it. (Also note that I often need to run the trace tool anyway even when I know which PR is to blame, because it's often not clear how exactly the PR is breaking things, due to complex interactions between transforms.)

I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy

If we have several months between getting to know about a regression and forcing a user to the regressing version, that gives us time to fix regressions:

Sometimes another optimizer change can create an improvement for the same user, offsetting the regression, so it doesn't matter.

Sometimes a regression is not inherently hard to fix once we know about the specific regression; maybe the regression happened just because we didn't think of some silly edge case. So, sometimes we'll be able to come up with an easy optimizer code fix.

Sometimes it's a lot of work to fix a regression, but it can also align with our general optimizer evolution plans, so we'll do it anyway.

Sometimes we'll opt to not fix a regression. Even in this case, there are still various other options for addressing the regression:

We might be able to manually rewrite the user's query to force a better plan. (This can involve chopping up the problematic part of the query to several indexed views, which are each small enough to not give much freedom to the optimizer to make bad decisions.)

We can compensate the user with credits, so they can simply scale up.

(We can introduce a feature flag tweaking some optimizer behavior for just that user. I hope to almost always avoid this.)

benesch · 2024-11-01T22:36:30Z

doc/developer/design/20241028_optimizer_releases.md

+
+## Open questions
+
+- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid)


Agreed, the Debian sid model seems right to me too.

benesch · 2024-11-01T22:38:52Z

doc/developer/design/20241028_optimizer_releases.md

+
+- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid)
+
+- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history.


I'm not actually sure this works, but I think you might be able to depend on the MaterializeInc/materialize repository from the repository itself. So the main branch of the repo would depend on the warm optimizer crate:

optimizer-warm = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-warm"

and cold optimizer crate:

optimizer-cold = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-cold"

That way things would still function as a monorepo on main, but bugfixes could be backported into optimizer-warm and optimizer-cold as necessary.

If Cargo barfs on the above, we can automate something with git subtree. On reflection, git subtree sounds a lot less painful than actually copying the code, I think, and will preserve git history.

I had forgotten about git subtree! This would be a good way to both keep history and let us work with multiple versions. (We should confirm in a test repo that the approach works before committing to it.)

benesch · 2024-11-01T22:40:12Z

doc/developer/design/20241028_optimizer_releases.md

+
+- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history.
+
+- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster?


The plan for use case isolation is to push query parsing, planning, and optimization into clusterd. So selecting an optimizer per cluster already achieves exactly what you want here, I think. (environmentd would vanish, and be replaced with a lightweightcontrolplaned that users don't interact with directly and handles no query parsing/planning/optimization directly.)

Love this. This plan will have to address the various operations that currently process queries (and so run the optimizer) in environmentd (mentioned on L79). But figuring these things out independent of the optimizer strikes me as good idea anyway!

benesch · 2024-11-01T22:40:56Z

doc/developer/design/20241028_optimizer_releases.md

+
+- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster?
+
+- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three?


Using the active cluster to determine what optimizer version to use makes sense to me, and dovetails nicely with our use case isolation plans (see prior comment).

So, then would we add a "cluster" property to views (similar to what MVs already have)? E.g.,

CREATE VIEW ... IN CLUSTER ... AS SELECT ...

I guess this would be needed for knowing what optimizer to use at bootstrap time, where we don't really have an active cluster. (The value of this property could default to the active cluster at the time of creating the view, similarly to MVs.)

benesch · 2024-11-01T22:42:19Z

doc/developer/design/20241028_optimizer_releases.md

+The `optimizer` crate will contain the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates.
+
+The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable.
+
+These crates do _not_ include:
+  + the SQL parser
+  + (\*) SQL -> HIR in `sql`
+  + `sequence_*` methods in adapter
+  + (\*) LIR -> dataflow in `compute`
+
+The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest.


This all tracks for me.

benesch · 2024-11-01T22:42:38Z

doc/developer/design/20241028_optimizer_releases.md

+
+## Alternatives
+
+The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_?


Left a thought about this under the second bullet point below.

benesch · 2024-11-01T22:53:30Z

doc/developer/design/20241028_optimizer_releases.md

+Optimizer engineers will be able to develop features with confidence, namely:
+
+  - introducing new transforms
+  - updating existing transforms
+  - targeting new dataflow operators
+  - changing AST types for HIR, MIR, or LIR


A missing link for me is what we actually do to qualify a new release of the optimizer. In the long term it's more clear to me: 1) we have a robust suite of synthetic benchmarks that can be run on demand and automatically in CI, plus 2) a Snowtrail like system that runs a release candidate against many (all?) real customer workloads in production and automatically compares the performance to the last release of the optimizer, plus 3) making the release available to users before forcing them onto it and allowing them to run their own qualification of the new optimizer version.

For me, these are the three things that would make it possible to gain confidence in an optimizer change. (2) seems the most power. (1) would be useful, but it depends on how representative the benchmarks we write are, and we'll need to be regularly adding new benchmarks based on our customers' workloads. And (3) is only useful to the extent that we can rely on users to actually run this qualification and report the results back.

I'm in particular nervous about relying too heavily on (3) because it depends on users being proactive, which empirically they are not, and even if they are proactive, it's a long feedback cycle. Users that only qualify warm releases can only give feedback once every six months, and that feedback hits only after the warm release has already been cut.

This solution proposed below (i.e., changing the release schedule) makes (3) possible, and perhaps makes (2) easier, because running the hypothetical Materializetrail might be expensive enough that we wouldn't want to do it every week, but only when qualifying a release. But it doesn't seem to impact (1)—doing (1) with weekly releases seems exactly as hard as doing (1) with twice-yearly releases.

I agree that (2) is the best possible thing, and it seems unavoidable. I just can't imagine this company without having that at some point in the next 1-2 years. Do you maybe have a ballpark estimate for how much effort will it be to develop (2)? I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right?

We'll probably also do some amount of (1), because running synthetic benchmarks is more lightweight than either (2) or (3), so it will be great to run it on PRs. But I can't really imagine (1) ever being comprehensive enough to catch ~all user regressions. As we'll have more and more users, they'll just keep running into new edge cases.

(3) is ok only as a temporary measure before we develop (2). It would allow us to at least partially shift some blame to users for outages caused by "sudden" regressions, when they had several months to test a new version but didn't do it. But this is far from ideal.

(Btw. if we have (2), then we could also consider having only two parallel optimizer versions instead of three: An unstable, and just one stable (instead of two stables). This is because we could catch ~all regressions before promoting an unstable to stable. Still, the drawback of having only two parallel versions would be that sometimes the unstable-to-stable promotion would be delayed by possibly many weeks due to addressing some regression, in which time all the other users for whom there is no regression wouldn't be able to get the benefits of the new version. Another drawback of having just two versions would be that normal optimizer work would need to be put on hold during that period when we are fixing regressions just before an unstable-to-stable promotion, to avoid introducing more regressions.)

The core issue---the thing that prompted discussions of release engineering---is that we're not confident qualifying optimizer releases. There's planned work to improve the situation there, and it's possible that's all we need.

I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right?

Yeah, exactly—much of the infrastructure already exists thanks to 0dt. @teskje has a "shadow replica" PR from a while back that does a good job demonstrating the approach. Honestly he's probably better positioned to estimate how much work it would be to dust that PR off and productionize it.

Here is the shadow replica PR: #21146. Though the use case for that is a bit different, I think: It allows testing of new clusterd versions within existing environments. So it can test changes to rendering, but not the optimizer. At least in its current state, it seems feasible to make the controller send different plans to the shadow replicas too, but this is a whole different can of worms!

The way I imagined an "Mztrail" would work is that we use the 0dt mechanism to spawn a second read-only environment (a shadow environment!) with a different Mz version and/or different feature flags and run our tests there. This way we could test the whole product (except for sources?), not just the cluster side, and afaict it doesn't require any special infrastructure we don't already get from 0dt.

I'm okay with this, certainly to start!

Another thing that would help qualification in certain cases is to smarten the upgrade-check tool to make it list all customer plan changes. I don't really know the details of how the upgrade-check tool works, but I imagine this to be a relatively easy thing (compared to, e.g., full mztrail). How much effort do you think this would be, @benesch?

Relatively easier than a full mztrail, for sure! But still a nontrivial amount of work. I don't think we store the old plans anywhere, so the upgrade check tool would need to somehow extract the plans from the old version in order to be able to compare them to the plans in the new version.

Could we run mzexplore to export catalogs before and after a 0dt release?

For sure, but to what end? If the goal is to have a human review the plans during release qualification, you're better off hooking into the upgrade checker workflow.

I should not have said 0dt release, you're right---yes, upgrade checker sounds like the right thing.

benesch · 2024-11-01T23:09:11Z

doc/developer/design/20241028_optimizer_releases.md

+
+  1. That cannot be case forever, as self-hosted users may not upgrade immediately.
+  2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads.
+  3. That is not always good for people working on the optimizer, who must size their work and gate it with feature flags to accommodate the weekly release schedule.


This is the point that gives me the most heartburn. In every context I've ever worked, faster releases and continuous integration has been an almost unequivocal force for good. Fast releases does mean figuring out how to split work into small changesets and juggling feature flags, which, even with good discipline for removing old feature flags, results in some unavoidable additional code complexity—you always have at least a handful of feature flags live at any time. But in my experience that complexity has always been outweighed by 1) the time savings of not having long running feature branches that have painful merge conflicts, 2) avoiding painful backwards compatibility requirements that require carefully staging new features across releases of components and testing across multiple versions, and 3) the fast iteration loop with real customers that comes from shipping early and often (synthetic benchmarks and tests won't surface what you didn't know you needed to measure/test).

It's possible to manage down the backwards compatibility pain (2) by choosing the right API boundaries, and I think you/we did a good job identifying those boundaries. But there is definitely going to be some overhead of adding new features now, where you have to think hard about the order in which you add support, and the cross product of combinations of planners/optimizers/renderers that can exist in the wild.

It is certainly possible that something about query optimizers makes them fundamentally suited to the agile-style approach to release we take. But I haven't myself been convinced that the weekly releases need to be much more painful for the optimizer than they are for other components of Materialize.

That said, it's been a long time since I've worked on the optimizer, so I ultimately defer to y'alls lived experience and intuition here. If possible, though, it'd be great to chat with folks who have worked on the optimizer at another continually deployed database (Snowflake comes to mind) and see if we can draw lessons from their experience. @antiguru mentioned he might have a connection on that front.

Yeah, if we could ask someone at some other similar company, that would be great.

The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting.

At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week!

The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting.

Yeah, I think for weekly qualification to be sustainable, you'd need a test suite that is both 1) automated and 2) yields extremely high confidence that the optimizer will not regress any existing customer workloads. And while we do have an automated test suite today (1), it definitely does not yield high confidence (2).

At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week!

Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the hot version of the optimizer and slowly whittle down any regressions over the course of the development period?

Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the hot version of the optimizer and slowly whittle down any regressions over the course of the development period?

Something like that, yes! This is why the customer-local record-and-replay we talked about feels useful here.

But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence. We need to do this anyway for the self-hosted release, since we won't be able to test on self-hosted customer workloads.

But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence.

Right, but once we've gone through that process once, it seems like we'd be able to re-run those workloads much more often than once every six months. Maybe not every week (unless they become fully automated), but every 6 weeks, or every month.

Hmm, yes, self-hosted might be the biggest reason why we can't simply rely on testing in shadow environments.

teskje · 2024-11-11T14:27:15Z

doc/developer/design/20241028_optimizer_releases.md

+
+- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three?
+
+- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice?


Do we need to consider new features in addition to bugfix backports? If we introduce some new SQL-level feature that requires optimizer support (like WMR), all three versions of the optimizer need to support that, right? Or is the idea that adapter will take the optimizer version into account when planning a user query and reject the query if the optimizer doesn't support it?

Or is the idea that adapter will take the optimizer version into account when planning a user query and reject the query if the optimizer doesn't support it?

This, yeah, although it won't be the adapter that checks the version. Here's my understanding. Adapter (i.e., the sql/adapter crates) always produces an HIR plan for the latest version of the optimizer. Each optimizer supports converting its HIR plan to the previous version's HIR plan. This conversion can fail, if the plan uses features that are not available in the previous optimizer. E.g., if there's a new HIR node in hot that's not in warm, if the user has requested the warm optimizer, the query will be rejected when trying to convert from a hot to warm plan.

There may occasionally be moments where implementing the version check in the HIR -> HIR conversion inside the optimizer is too painful, and we need to hoist the version check into the adapter, but ideally we can avoid that whenever possible.

Cool, that makes sense to me. We'll want to disallow changing the optimizer of an existing cluster, or at least disallow changing it into the colder direction. Otherwise we could end up with objects installed on a cluster that cannot be handled by that cluster's optimizer anymore during, e.g., a version upgrade.

design doc for optimizer releases, after conversations with nikhil, f…

f7fcc0b

…rank, and moritz

ggevay reviewed Oct 30, 2024

View reviewed changes

tweaks per gábor

2c67b27

ggevay approved these changes Oct 31, 2024

View reviewed changes

benesch reviewed Nov 1, 2024

View reviewed changes

teskje reviewed Nov 11, 2024

View reviewed changes

update doc based on feedback

1cea23b

mgree force-pushed the design-doc-optimizer-release-engineering branch from eee43f1 to 1cea23b Compare November 13, 2024 19:37

satisfy linter

cf9d377


		### Supporting qualification

		A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode.


		## Open questions

		- Does unstable/hot get a version number, or no? (Cf. Debian sid)


		- Does unstable/hot get a version number, or no? (Cf. Debian sid)

		- How precisely do we select features out of unstable/hot to cut a new testing/warm? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history.


		- How precisely do we select features out of unstable/hot to cut a new testing/warm? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history.

		- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster?


		- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster?

		- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three?


		## Alternatives

		The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_?


		- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three?

		- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice?

[design doc] optimizer release engineering #30233

Are you sure you want to change the base?

[design doc] optimizer release engineering #30233

Conversation

mgree commented Oct 28, 2024

Motivation

Tips for reviewer

Checklist

Choose a reason for hiding this comment

ggevay Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggevay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggevay Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggevay Oct 30, 2024 •

edited

Loading

ggevay Nov 13, 2024 •

edited

Loading

benesch Nov 1, 2024 •

edited

Loading