-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[design doc] optimizer release engineering #30233
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Optimizer Release Engineering | ||
|
||
## The Problem | ||
|
||
Currently, users run on "Materialize weekly", but: | ||
|
||
1. That cannot be case forever, as self-hosted users may not upgrade immediately. | ||
2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads. | ||
3. That is not always good for people working on the optimizer, who must size their work and gate it with feature flags to accommodate the weekly release schedule. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the point that gives me the most heartburn. In every context I've ever worked, faster releases and continuous integration has been an almost unequivocal force for good. Fast releases does mean figuring out how to split work into small changesets and juggling feature flags, which, even with good discipline for removing old feature flags, results in some unavoidable additional code complexity—you always have at least a handful of feature flags live at any time. But in my experience that complexity has always been outweighed by 1) the time savings of not having long running feature branches that have painful merge conflicts, 2) avoiding painful backwards compatibility requirements that require carefully staging new features across releases of components and testing across multiple versions, and 3) the fast iteration loop with real customers that comes from shipping early and often (synthetic benchmarks and tests won't surface what you didn't know you needed to measure/test). It's possible to manage down the backwards compatibility pain (2) by choosing the right API boundaries, and I think you/we did a good job identifying those boundaries. But there is definitely going to be some overhead of adding new features now, where you have to think hard about the order in which you add support, and the cross product of combinations of planners/optimizers/renderers that can exist in the wild. It is certainly possible that something about query optimizers makes them fundamentally suited to the agile-style approach to release we take. But I haven't myself been convinced that the weekly releases need to be much more painful for the optimizer than they are for other components of Materialize. That said, it's been a long time since I've worked on the optimizer, so I ultimately defer to y'alls lived experience and intuition here. If possible, though, it'd be great to chat with folks who have worked on the optimizer at another continually deployed database (Snowflake comes to mind) and see if we can draw lessons from their experience. @antiguru mentioned he might have a connection on that front. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, if we could ask someone at some other similar company, that would be great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting. At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, I think for weekly qualification to be sustainable, you'd need a test suite that is both 1) automated and 2) yields extremely high confidence that the optimizer will not regress any existing customer workloads. And while we do have an automated test suite today (1), it definitely does not yield high confidence (2).
Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Something like that, yes! This is why the customer-local record-and-replay we talked about feels useful here. But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence. We need to do this anyway for the self-hosted release, since we won't be able to test on self-hosted customer workloads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, but once we've gone through that process once, it seems like we'd be able to re-run those workloads much more often than once every six months. Maybe not every week (unless they become fully automated), but every 6 weeks, or every month. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, yes, self-hosted might be the biggest reason why we can't simply rely on testing in shadow environments. |
||
|
||
## Success Criteria | ||
|
||
Customers---self-hosted or cloud---will be able to qualify new optimizers before migrating to them. | ||
|
||
Optimizer engineers will be able to develop features with confidence, namely: | ||
|
||
- introducing new transforms | ||
- updating existing transforms | ||
- targeting new dataflow operators | ||
- changing AST types for HIR, MIR, or LIR | ||
Comment on lines
+15
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A missing link for me is what we actually do to qualify a new release of the optimizer. In the long term it's more clear to me: 1) we have a robust suite of synthetic benchmarks that can be run on demand and automatically in CI, plus 2) a Snowtrail like system that runs a release candidate against many (all?) real customer workloads in production and automatically compares the performance to the last release of the optimizer, plus 3) making the release available to users before forcing them onto it and allowing them to run their own qualification of the new optimizer version. For me, these are the three things that would make it possible to gain confidence in an optimizer change. (2) seems the most power. (1) would be useful, but it depends on how representative the benchmarks we write are, and we'll need to be regularly adding new benchmarks based on our customers' workloads. And (3) is only useful to the extent that we can rely on users to actually run this qualification and report the results back. I'm in particular nervous about relying too heavily on (3) because it depends on users being proactive, which empirically they are not, and even if they are proactive, it's a long feedback cycle. Users that only qualify warm releases can only give feedback once every six months, and that feedback hits only after the warm release has already been cut. This solution proposed below (i.e., changing the release schedule) makes (3) possible, and perhaps makes (2) easier, because running the hypothetical Materializetrail might be expensive enough that we wouldn't want to do it every week, but only when qualifying a release. But it doesn't seem to impact (1)—doing (1) with weekly releases seems exactly as hard as doing (1) with twice-yearly releases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that (2) is the best possible thing, and it seems unavoidable. I just can't imagine this company without having that at some point in the next 1-2 years. Do you maybe have a ballpark estimate for how much effort will it be to develop (2)? I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right? We'll probably also do some amount of (1), because running synthetic benchmarks is more lightweight than either (2) or (3), so it will be great to run it on PRs. But I can't really imagine (1) ever being comprehensive enough to catch ~all user regressions. As we'll have more and more users, they'll just keep running into new edge cases. (3) is ok only as a temporary measure before we develop (2). It would allow us to at least partially shift some blame to users for outages caused by "sudden" regressions, when they had several months to test a new version but didn't do it. But this is far from ideal. (Btw. if we have (2), then we could also consider having only two parallel optimizer versions instead of three: An unstable, and just one stable (instead of two stables). This is because we could catch ~all regressions before promoting an unstable to stable. Still, the drawback of having only two parallel versions would be that sometimes the unstable-to-stable promotion would be delayed by possibly many weeks due to addressing some regression, in which time all the other users for whom there is no regression wouldn't be able to get the benefits of the new version. Another drawback of having just two versions would be that normal optimizer work would need to be put on hold during that period when we are fixing regressions just before an unstable-to-stable promotion, to avoid introducing more regressions.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The core issue---the thing that prompted discussions of release engineering---is that we're not confident qualifying optimizer releases. There's planned work to improve the situation there, and it's possible that's all we need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly—much of the infrastructure already exists thanks to 0dt. @teskje has a "shadow replica" PR from a while back that does a good job demonstrating the approach. Honestly he's probably better positioned to estimate how much work it would be to dust that PR off and productionize it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is the shadow replica PR: #21146. Though the use case for that is a bit different, I think: It allows testing of new The way I imagined an "Mztrail" would work is that we use the 0dt mechanism to spawn a second read-only environment (a shadow environment!) with a different Mz version and/or different feature flags and run our tests there. This way we could test the whole product (except for sources?), not just the cluster side, and afaict it doesn't require any special infrastructure we don't already get from 0dt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm okay with this, certainly to start! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Relatively easier than a full mztrail, for sure! But still a nontrivial amount of work. I don't think we store the old plans anywhere, so the upgrade check tool would need to somehow extract the plans from the old version in order to be able to compare them to the plans in the new version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For sure, but to what end? If the goal is to have a human review the plans during release qualification, you're better off hooking into the upgrade checker workflow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I should not have said 0dt release, you're right---yes, upgrade checker sounds like the right thing. |
||
|
||
Optimizer engineers will be able to deploy hotfixes to cloud customers. | ||
|
||
## Solution Proposal | ||
|
||
We propose the following solution: | ||
|
||
1. Refactor existing `sql`, `expr`, `compute-types`, and `transform` crates into `optimizer` and `optimizer-types` crates. | ||
mgree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. Version the `optimizer` crate. The `optimizer-types` crate will contain traits and types that are stable across all three versions (e.g., sufficient interfaces to calculate the dependencies of various expression `enum`s without fixing any of their details). | ||
3. Work on three versions of the `optimizer` crate at once: | ||
- **unstable**/**hot** active development | ||
- **testing**/**warm** most recently cut release of the optimizer | ||
- **stable**/**cold** previously cut release of the optimizer | ||
These versions will be tracked as copies of the crate in the repo (but see ["Alternatives"](#alternatives)). | ||
4. Rotate versions treadmill style, in some way matching the cadence of the self-hosted release. (Likely, we will want to use the cloud release to qualify a self-hosted release.) | ||
|
||
## Minimal Viable Prototype | ||
|
||
### Version names and numbering | ||
|
||
Every version of the `optimizer` crate will be numbered. Customers can run SQL like: | ||
|
||
```sql | ||
ALTER CLUSTER name SET (OPTIMIZER VERSION = version) | ||
``` | ||
|
||
where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases. | ||
|
||
### The `optimizer` and `optimizer-types` crates | ||
|
||
The `optimizer` create will containe the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. | ||
mgree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable. | ||
|
||
These crates do _not_ include: | ||
+ the SQL parser | ||
+ (\*) SQL -> HIR in `sql` | ||
+ `sequence_*` methods in adapter | ||
+ (\*) LIR -> dataflow in `compute` | ||
|
||
The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest. | ||
|
||
### Supporting qualification | ||
|
||
A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (a) would be nice, but would be a significant amount of work to add this feature. An alternative way for customers to qualify the new optimizer version is to simply do a blue-green deployment where they change only the optimizer version, but not cut over to the new version for something like a week, and if it works fine over a week, then cut over. I think this would be minimal extra work for us. |
||
|
||
## Alternatives | ||
|
||
The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Left a thought about this under the second bullet point below. |
||
|
||
## Open questions | ||
|
||
- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One should be able to specify unstable/hot when specifying the optimizer version for a cluster, but probably by just saying something like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, the Debian sid model seems right to me too. |
||
|
||
- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not actually sure this works, but I think you might be able to depend on the optimizer-warm = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-warm" and cold optimizer crate: optimizer-cold = {git = "https://github.com/MaterializeInc/materialize.git, branch = "optimizer-cold" That way things would still function as a monorepo on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If Cargo barfs on the above, we can automate something with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had forgotten about |
||
|
||
- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan for use case isolation is to push query parsing, planning, and optimization into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Love this. This plan will have to address the various operations that currently process queries (and so run the optimizer) in |
||
|
||
- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the active cluster to determine what optimizer version to use makes sense to me, and dovetails nicely with our use case isolation plans (see prior comment). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, then would we add a "cluster" property to views (similar to what MVs already have)? E.g.,
I guess this would be needed for knowing what optimizer to use at bootstrap time, where we don't really have an active cluster. (The value of this property could default to the active cluster at the time of creating the view, similarly to MVs.) |
||
|
||
- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to consider new features in addition to bugfix backports? If we introduce some new SQL-level feature that requires optimizer support (like WMR), all three versions of the optimizer need to support that, right? Or is the idea that adapter will take the optimizer version into account when planning a user query and reject the query if the optimizer doesn't support it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This, yeah, although it won't be the adapter that checks the version. Here's my understanding. Adapter (i.e., the There may occasionally be moments where implementing the version check in the HIR -> HIR conversion inside the optimizer is too painful, and we need to hoist the version check into the adapter, but ideally we can avoid that whenever possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, that makes sense to me. We'll want to disallow changing the optimizer of an existing cluster, or at least disallow changing it into the colder direction. Otherwise we could end up with objects installed on a cluster that cannot be handled by that cluster's optimizer anymore during, e.g., a version upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the connection between the weekly release schedule and the problem, as stated, isn't quite clear. (We've talked about it—it's just not written here!) Is the problem that users ever experience regressions? Is the problem that users have no way to test new versions of the optimizer against their workload? Because in theory we could allow users to test new versions of the optimizer each week (using something the hot/warm/cold solution you've laid out below). But of course then the next problem would almost surely be that users don't have the time/willingness to test a new version of the optimizer each week.
What I'm driving at is that I don't think we have a clear answer as to what amount of breakage and on what cadence is acceptable to users. I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy, and it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.
As we've talked about, I'm not opposed to trying the slower release cadence, but I wanted to record that my intuition here is that we are underestimating how painful it will be to migrate users from one version of the optimizer to the next if new versions are released on a six month cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there is 1 optimizer change in a given week, then it's immediately clear which change is to blame for a plan regression. With 6 months of changes, we'd need to do some non-trivial investigation. But I'm not too worried about this, because we have a powerful tool for this: Alexander's optimizer trace tool usually allows us to quickly (usually within 1-2 hours) figure out what's going on when an optimization is not going the way we imagined it. (Also note that I often need to run the trace tool anyway even when I know which PR is to blame, because it's often not clear how exactly the PR is breaking things, due to complex interactions between transforms.)
If we have several months between getting to know about a regression and forcing a user to the regressing version, that gives us time to fix regressions: