-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[design doc] optimizer release engineering #30233
base: main
Are you sure you want to change the base?
[design doc] optimizer release engineering #30233
Conversation
|
||
### Supporting qualification | ||
|
||
A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) would be nice, but would be a significant amount of work to add this feature.
An alternative way for customers to qualify the new optimizer version is to simply do a blue-green deployment where they change only the optimizer version, but not cut over to the new version for something like a week, and if it works fine over a week, then cut over. I think this would be minimal extra work for us.
|
||
## Open questions | ||
|
||
- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One should be able to specify unstable/hot when specifying the optimizer version for a cluster, but probably by just saying something like unstable
rather than a version number. Having a version number for it seems problematic:
- If we want to increment it at every change (e.g., when we merge a minor optimizer change we go from v5.41 to v5.42), then the problem is that the old version (v5.41) is not available anymore, so if somebody specified that by version number then they are out of luck.
- If the version number stays the same across minor optimizer changes on unstable, that's a bit weird, because version numbers are generally supposed to refer to immutable versions.
- Except if we do something like
v6-unstable
as a version number, which is ok if it doesn't refer to an immutable version. But I don't really see any advantage of this over simplyunstable
, and the disadvantage is that we'd have to modify it on experimental clusters at every release. E.g., maybe our test environments will have various clusters that should be always onunstable
, and it would be good to avoid having to having to move them fromv6-unstable
tov7-unstable
. (This move could also be automatic, but that would be essentially the same as simplyunstable
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the Debian sid model seems right to me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! (But I guess some other people should also take a look.)
Currently, users run on "Materialize weekly", but: | ||
|
||
1. That cannot be case forever, as self-hosted users may not upgrade immediately. | ||
2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the connection between the weekly release schedule and the problem, as stated, isn't quite clear. (We've talked about it—it's just not written here!) Is the problem that users ever experience regressions? Is the problem that users have no way to test new versions of the optimizer against their workload? Because in theory we could allow users to test new versions of the optimizer each week (using something the hot/warm/cold solution you've laid out below). But of course then the next problem would almost surely be that users don't have the time/willingness to test a new version of the optimizer each week.
What I'm driving at is that I don't think we have a clear answer as to what amount of breakage and on what cadence is acceptable to users. I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy, and it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.
As we've talked about, I'm not opposed to trying the slower release cadence, but I wanted to record that my intuition here is that we are underestimating how painful it will be to migrate users from one version of the optimizer to the next if new versions are released on a six month cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.
Yes, if there is 1 optimizer change in a given week, then it's immediately clear which change is to blame for a plan regression. With 6 months of changes, we'd need to do some non-trivial investigation. But I'm not too worried about this, because we have a powerful tool for this: Alexander's optimizer trace tool usually allows us to quickly (usually within 1-2 hours) figure out what's going on when an optimization is not going the way we imagined it. (Also note that I often need to run the trace tool anyway even when I know which PR is to blame, because it's often not clear how exactly the PR is breaking things, due to complex interactions between transforms.)
I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy
If we have several months between getting to know about a regression and forcing a user to the regressing version, that gives us time to fix regressions:
- Sometimes another optimizer change can create an improvement for the same user, offsetting the regression, so it doesn't matter.
- Sometimes a regression is not inherently hard to fix once we know about the specific regression; maybe the regression happened just because we didn't think of some silly edge case. So, sometimes we'll be able to come up with an easy optimizer code fix.
- Sometimes it's a lot of work to fix a regression, but it can also align with our general optimizer evolution plans, so we'll do it anyway.
- Sometimes we'll opt to not fix a regression. Even in this case, there are still various other options for addressing the regression:
- We might be able to manually rewrite the user's query to force a better plan. (This can involve chopping up the problematic part of the query to several indexed views, which are each small enough to not give much freedom to the optimizer to make bad decisions.)
- We can compensate the user with credits, so they can simply scale up.
- (We can introduce a feature flag tweaking some optimizer behavior for just that user. I hope to almost always avoid this.)
|
||
## Open questions | ||
|
||
- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the Debian sid model seems right to me too.
|
||
- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) | ||
|
||
- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure this works, but I think you might be able to depend on the MaterializeInc/materialize
repository from the repository itself. So the main branch of the repo would depend on the warm optimizer crate:
optimizer-warm = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-warm"
and cold optimizer crate:
optimizer-cold = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-cold"
That way things would still function as a monorepo on main
, but bugfixes could be backported into optimizer-warm
and optimizer-cold
as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Cargo barfs on the above, we can automate something with git subtree
. On reflection, git subtree
sounds a lot less painful than actually copying the code, I think, and will preserve git history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgotten about git subtree
! This would be a good way to both keep history and let us work with multiple versions. (We should confirm in a test repo that the approach works before committing to it.)
|
||
- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history. | ||
|
||
- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan for use case isolation is to push query parsing, planning, and optimization into clusterd
. So selecting an optimizer per cluster already achieves exactly what you want here, I think. (environmentd
would vanish, and be replaced with a lightweightcontrolplaned
that users don't interact with directly and handles no query parsing/planning/optimization directly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this. This plan will have to address the various operations that currently process queries (and so run the optimizer) in environmentd
(mentioned on L79). But figuring these things out independent of the optimizer strikes me as good idea anyway!
|
||
- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster? | ||
|
||
- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the active cluster to determine what optimizer version to use makes sense to me, and dovetails nicely with our use case isolation plans (see prior comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, then would we add a "cluster" property to views (similar to what MVs already have)? E.g.,
CREATE VIEW ... IN CLUSTER ... AS SELECT ...
I guess this would be needed for knowing what optimizer to use at bootstrap time, where we don't really have an active cluster. (The value of this property could default to the active cluster at the time of creating the view, similarly to MVs.)
The `optimizer` crate will contain the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. | ||
|
||
The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable. | ||
|
||
These crates do _not_ include: | ||
+ the SQL parser | ||
+ (\*) SQL -> HIR in `sql` | ||
+ `sequence_*` methods in adapter | ||
+ (\*) LIR -> dataflow in `compute` | ||
|
||
The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all tracks for me.
|
||
## Alternatives | ||
|
||
The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a thought about this under the second bullet point below.
Optimizer engineers will be able to develop features with confidence, namely: | ||
|
||
- introducing new transforms | ||
- updating existing transforms | ||
- targeting new dataflow operators | ||
- changing AST types for HIR, MIR, or LIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A missing link for me is what we actually do to qualify a new release of the optimizer. In the long term it's more clear to me: 1) we have a robust suite of synthetic benchmarks that can be run on demand and automatically in CI, plus 2) a Snowtrail like system that runs a release candidate against many (all?) real customer workloads in production and automatically compares the performance to the last release of the optimizer, plus 3) making the release available to users before forcing them onto it and allowing them to run their own qualification of the new optimizer version.
For me, these are the three things that would make it possible to gain confidence in an optimizer change. (2) seems the most power. (1) would be useful, but it depends on how representative the benchmarks we write are, and we'll need to be regularly adding new benchmarks based on our customers' workloads. And (3) is only useful to the extent that we can rely on users to actually run this qualification and report the results back.
I'm in particular nervous about relying too heavily on (3) because it depends on users being proactive, which empirically they are not, and even if they are proactive, it's a long feedback cycle. Users that only qualify warm releases can only give feedback once every six months, and that feedback hits only after the warm release has already been cut.
This solution proposed below (i.e., changing the release schedule) makes (3) possible, and perhaps makes (2) easier, because running the hypothetical Materializetrail might be expensive enough that we wouldn't want to do it every week, but only when qualifying a release. But it doesn't seem to impact (1)—doing (1) with weekly releases seems exactly as hard as doing (1) with twice-yearly releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that (2) is the best possible thing, and it seems unavoidable. I just can't imagine this company without having that at some point in the next 1-2 years. Do you maybe have a ballpark estimate for how much effort will it be to develop (2)? I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right?
We'll probably also do some amount of (1), because running synthetic benchmarks is more lightweight than either (2) or (3), so it will be great to run it on PRs. But I can't really imagine (1) ever being comprehensive enough to catch ~all user regressions. As we'll have more and more users, they'll just keep running into new edge cases.
(3) is ok only as a temporary measure before we develop (2). It would allow us to at least partially shift some blame to users for outages caused by "sudden" regressions, when they had several months to test a new version but didn't do it. But this is far from ideal.
(Btw. if we have (2), then we could also consider having only two parallel optimizer versions instead of three: An unstable, and just one stable (instead of two stables). This is because we could catch ~all regressions before promoting an unstable to stable. Still, the drawback of having only two parallel versions would be that sometimes the unstable-to-stable promotion would be delayed by possibly many weeks due to addressing some regression, in which time all the other users for whom there is no regression wouldn't be able to get the benefits of the new version. Another drawback of having just two versions would be that normal optimizer work would need to be put on hold during that period when we are fixing regressions just before an unstable-to-stable promotion, to avoid introducing more regressions.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core issue---the thing that prompted discussions of release engineering---is that we're not confident qualifying optimizer releases. There's planned work to improve the situation there, and it's possible that's all we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right?
Yeah, exactly—much of the infrastructure already exists thanks to 0dt. @teskje has a "shadow replica" PR from a while back that does a good job demonstrating the approach. Honestly he's probably better positioned to estimate how much work it would be to dust that PR off and productionize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the shadow replica PR: #21146. Though the use case for that is a bit different, I think: It allows testing of new clusterd
versions within existing environments. So it can test changes to rendering, but not the optimizer. At least in its current state, it seems feasible to make the controller send different plans to the shadow replicas too, but this is a whole different can of worms!
The way I imagined an "Mztrail" would work is that we use the 0dt mechanism to spawn a second read-only environment (a shadow environment!) with a different Mz version and/or different feature flags and run our tests there. This way we could test the whole product (except for sources?), not just the cluster side, and afaict it doesn't require any special infrastructure we don't already get from 0dt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this, certainly to start!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that would help qualification in certain cases is to smarten the upgrade-check tool to make it list all customer plan changes. I don't really know the details of how the upgrade-check tool works, but I imagine this to be a relatively easy thing (compared to, e.g., full mztrail). How much effort do you think this would be, @benesch?
Relatively easier than a full mztrail, for sure! But still a nontrivial amount of work. I don't think we store the old plans anywhere, so the upgrade check tool would need to somehow extract the plans from the old version in order to be able to compare them to the plans in the new version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we run mzexplore
to export catalogs before and after a 0dt release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, but to what end? If the goal is to have a human review the plans during release qualification, you're better off hooking into the upgrade checker workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should not have said 0dt release, you're right---yes, upgrade checker sounds like the right thing.
|
||
1. That cannot be case forever, as self-hosted users may not upgrade immediately. | ||
2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads. | ||
3. That is not always good for people working on the optimizer, who must size their work and gate it with feature flags to accommodate the weekly release schedule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the point that gives me the most heartburn. In every context I've ever worked, faster releases and continuous integration has been an almost unequivocal force for good. Fast releases does mean figuring out how to split work into small changesets and juggling feature flags, which, even with good discipline for removing old feature flags, results in some unavoidable additional code complexity—you always have at least a handful of feature flags live at any time. But in my experience that complexity has always been outweighed by 1) the time savings of not having long running feature branches that have painful merge conflicts, 2) avoiding painful backwards compatibility requirements that require carefully staging new features across releases of components and testing across multiple versions, and 3) the fast iteration loop with real customers that comes from shipping early and often (synthetic benchmarks and tests won't surface what you didn't know you needed to measure/test).
It's possible to manage down the backwards compatibility pain (2) by choosing the right API boundaries, and I think you/we did a good job identifying those boundaries. But there is definitely going to be some overhead of adding new features now, where you have to think hard about the order in which you add support, and the cross product of combinations of planners/optimizers/renderers that can exist in the wild.
It is certainly possible that something about query optimizers makes them fundamentally suited to the agile-style approach to release we take. But I haven't myself been convinced that the weekly releases need to be much more painful for the optimizer than they are for other components of Materialize.
That said, it's been a long time since I've worked on the optimizer, so I ultimately defer to y'alls lived experience and intuition here. If possible, though, it'd be great to chat with folks who have worked on the optimizer at another continually deployed database (Snowflake comes to mind) and see if we can draw lessons from their experience. @antiguru mentioned he might have a connection on that front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we could ask someone at some other similar company, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting.
At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting.
Yeah, I think for weekly qualification to be sustainable, you'd need a test suite that is both 1) automated and 2) yields extremely high confidence that the optimizer will not regress any existing customer workloads. And while we do have an automated test suite today (1), it definitely does not yield high confidence (2).
At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week!
Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the hot
version of the optimizer and slowly whittle down any regressions over the course of the development period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the hot version of the optimizer and slowly whittle down any regressions over the course of the development period?
Something like that, yes! This is why the customer-local record-and-replay we talked about feels useful here.
But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence. We need to do this anyway for the self-hosted release, since we won't be able to test on self-hosted customer workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence.
Right, but once we've gone through that process once, it seems like we'd be able to re-run those workloads much more often than once every six months. Maybe not every week (unless they become fully automated), but every 6 weeks, or every month.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, self-hosted might be the biggest reason why we can't simply rely on testing in shadow environments.
|
||
- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? | ||
|
||
- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider new features in addition to bugfix backports? If we introduce some new SQL-level feature that requires optimizer support (like WMR), all three versions of the optimizer need to support that, right? Or is the idea that adapter will take the optimizer version into account when planning a user query and reject the query if the optimizer doesn't support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the idea that adapter will take the optimizer version into account when planning a user query and reject the query if the optimizer doesn't support it?
This, yeah, although it won't be the adapter that checks the version. Here's my understanding. Adapter (i.e., the sql
/adapter
crates) always produces an HIR plan for the latest version of the optimizer. Each optimizer supports converting its HIR plan to the previous version's HIR plan. This conversion can fail, if the plan uses features that are not available in the previous optimizer. E.g., if there's a new HIR node in hot that's not in warm, if the user has requested the warm optimizer, the query will be rejected when trying to convert from a hot to warm plan.
There may occasionally be moments where implementing the version check in the HIR -> HIR conversion inside the optimizer is too painful, and we need to hoist the version check into the adapter, but ideally we can avoid that whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that makes sense to me. We'll want to disallow changing the optimizer of an existing cluster, or at least disallow changing it into the colder direction. Otherwise we could end up with objects installed on a cluster that cannot be handled by that cluster's optimizer anymore during, e.g., a version upgrade.
eee43f1
to
1cea23b
Compare
Design doc.
Rendered markdown
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.