-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[design doc] optimizer release engineering #30233
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Optimizer Release Engineering | ||
|
||
## The Problem | ||
|
||
Currently, users run on "Materialize weekly", but: | ||
|
||
1. That cannot be case forever, as self-hosted users may not upgrade immediately. | ||
2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads. | ||
3. That is not always good for people working on the optimizer, who must size their work and gate it with feature flags to accommodate the weekly release schedule. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the point that gives me the most heartburn. In every context I've ever worked, faster releases and continuous integration has been an almost unequivocal force for good. Fast releases does mean figuring out how to split work into small changesets and juggling feature flags, which, even with good discipline for removing old feature flags, results in some unavoidable additional code complexity—you always have at least a handful of feature flags live at any time. But in my experience that complexity has always been outweighed by 1) the time savings of not having long running feature branches that have painful merge conflicts, 2) avoiding painful backwards compatibility requirements that require carefully staging new features across releases of components and testing across multiple versions, and 3) the fast iteration loop with real customers that comes from shipping early and often (synthetic benchmarks and tests won't surface what you didn't know you needed to measure/test). It's possible to manage down the backwards compatibility pain (2) by choosing the right API boundaries, and I think you/we did a good job identifying those boundaries. But there is definitely going to be some overhead of adding new features now, where you have to think hard about the order in which you add support, and the cross product of combinations of planners/optimizers/renderers that can exist in the wild. It is certainly possible that something about query optimizers makes them fundamentally unsuited to the agile-style approach to releases that we take. But I haven't myself been convinced that the weekly releases need to be much more painful for the optimizer than they are for other components of Materialize. That said, it's been a long time since I've worked on the optimizer, so I ultimately defer to y'alls lived experience and intuition here. If possible, though, it'd be great to chat with folks who have worked on the optimizer at another continually deployed database (Snowflake comes to mind) and see if we can draw lessons from their experience. @antiguru mentioned he might have a connection on that front. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, if we could ask someone at some other similar company, that would be great. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The optimizer is crosscutting enough that qualifying releases is going to be complex no matter what---doing it weekkly seems tough! Hearing from how others are doing it would be interesting. At a meta level, we arrived at this conversation because we're having trouble with the current weekly approach---qualification for us is frequently taking longer than a week! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, I think for weekly qualification to be sustainable, you'd need a test suite that is both 1) automated and 2) yields extremely high confidence that the optimizer will not regress any existing customer workloads. And while we do have an automated test suite today (1), it definitely does not yield high confidence (2).
Yeah, I guess the gap in my understanding is that we are able to do some qualification today outside of the weekly release cycle. And I imagine this is all facilitated by feature flags and unbilled replicas? How does that qualification process generalize to a world where we only ship the optimizer once every six months? Would we periodically turn on unbilled replicas for customer workloads using the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Something like that, yes! This is why the customer-local record-and-replay we talked about feels useful here. But also: in a six-month cycle, we spend more time crafting the workloads that yield high confidence. We need to do this anyway for the self-hosted release, since we won't be able to test on self-hosted customer workloads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, but once we've gone through that process once, it seems like we'd be able to re-run those workloads much more often than once every six months. Maybe not every week (unless they become fully automated), but every 6 weeks, or every month. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, yes, self-hosted might be the biggest reason why we can't simply rely on testing in shadow environments. |
||
|
||
## Success Criteria | ||
|
||
Customers---self-hosted or cloud---will be able to qualify new optimizers before migrating to them. | ||
|
||
Optimizer engineers will be able to develop features with confidence, namely: | ||
|
||
- introducing new transforms | ||
- updating existing transforms | ||
- targeting new dataflow operators | ||
- changing AST types for HIR, MIR, or LIR | ||
Comment on lines
+15
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A missing link for me is what we actually do to qualify a new release of the optimizer. In the long term it's more clear to me: 1) we have a robust suite of synthetic benchmarks that can be run on demand and automatically in CI, plus 2) a Snowtrail like system that runs a release candidate against many (all?) real customer workloads in production and automatically compares the performance to the last release of the optimizer, plus 3) making the release available to users before forcing them onto it and allowing them to run their own qualification of the new optimizer version. For me, these are the three things that would make it possible to gain confidence in an optimizer change. (2) seems the most power. (1) would be useful, but it depends on how representative the benchmarks we write are, and we'll need to be regularly adding new benchmarks based on our customers' workloads. And (3) is only useful to the extent that we can rely on users to actually run this qualification and report the results back. I'm in particular nervous about relying too heavily on (3) because it depends on users being proactive, which empirically they are not, and even if they are proactive, it's a long feedback cycle. Users that only qualify warm releases can only give feedback once every six months, and that feedback hits only after the warm release has already been cut. This solution proposed below (i.e., changing the release schedule) makes (3) possible, and perhaps makes (2) easier, because running the hypothetical Materializetrail might be expensive enough that we wouldn't want to do it every week, but only when qualifying a release. But it doesn't seem to impact (1)—doing (1) with weekly releases seems exactly as hard as doing (1) with twice-yearly releases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that (2) is the best possible thing, and it seems unavoidable. I just can't imagine this company without having that at some point in the next 1-2 years. Do you maybe have a ballpark estimate for how much effort will it be to develop (2)? I guess it would be an evolution of the 0-dt machinery, so that it could be run outside an actual release rollout, right? We'll probably also do some amount of (1), because running synthetic benchmarks is more lightweight than either (2) or (3), so it will be great to run it on PRs. But I can't really imagine (1) ever being comprehensive enough to catch ~all user regressions. As we'll have more and more users, they'll just keep running into new edge cases. (3) is ok only as a temporary measure before we develop (2). It would allow us to at least partially shift some blame to users for outages caused by "sudden" regressions, when they had several months to test a new version but didn't do it. But this is far from ideal. (Btw. if we have (2), then we could also consider having only two parallel optimizer versions instead of three: An unstable, and just one stable (instead of two stables). This is because we could catch ~all regressions before promoting an unstable to stable. Still, the drawback of having only two parallel versions would be that sometimes the unstable-to-stable promotion would be delayed by possibly many weeks due to addressing some regression, in which time all the other users for whom there is no regression wouldn't be able to get the benefits of the new version. Another drawback of having just two versions would be that normal optimizer work would need to be put on hold during that period when we are fixing regressions just before an unstable-to-stable promotion, to avoid introducing more regressions.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The core issue---the thing that prompted discussions of release engineering---is that we're not confident qualifying optimizer releases. There's planned work to improve the situation there, and it's possible that's all we need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly—much of the infrastructure already exists thanks to 0dt. @teskje has a "shadow replica" PR from a while back that does a good job demonstrating the approach. Honestly he's probably better positioned to estimate how much work it would be to dust that PR off and productionize it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is the shadow replica PR: #21146. Though the use case for that is a bit different, I think: It allows testing of new The way I imagined an "Mztrail" would work is that we use the 0dt mechanism to spawn a second read-only environment (a shadow environment!) with a different Mz version and/or different feature flags and run our tests there. This way we could test the whole product (except for sources?), not just the cluster side, and afaict it doesn't require any special infrastructure we don't already get from 0dt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm okay with this, certainly to start! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Relatively easier than a full mztrail, for sure! But still a nontrivial amount of work. I don't think we store the old plans anywhere, so the upgrade check tool would need to somehow extract the plans from the old version in order to be able to compare them to the plans in the new version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For sure, but to what end? If the goal is to have a human review the plans during release qualification, you're better off hooking into the upgrade checker workflow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I should not have said 0dt release, you're right---yes, upgrade checker sounds like the right thing. |
||
|
||
Optimizer engineers will be able to deploy hotfixes to cloud customers. | ||
|
||
## Solution Proposal | ||
|
||
We propose the following solution: | ||
|
||
1. Refactor existing `sql`, `expr`, `compute-types`, and `transform` crates to extract `optimizer` and `optimizer-types` crates. | ||
2. Version the `optimizer` crate. The `optimizer-types` crate will contain traits and types that are stable across all three versions (e.g., sufficient interfaces to calculate the dependencies of various expression `enum`s without fixing any of their details). | ||
3. Work on three versions of the `optimizer` crate at once: | ||
- **unstable**/**hot** active development | ||
- **testing**/**warm** most recently cut release of the optimizer | ||
- **stable**/**cold** previously cut release of the optimizer | ||
These versions will be tracked as copies of the crate in the repo (but see ["Alternatives"](#alternatives)). | ||
4. Rotate versions treadmill style, in some way matching the cadence of the self-hosted release. (Likely, we will want to use the cloud release to qualify a self-hosted release.) | ||
|
||
We do not need to commit to this process up front. We can instead begin by: | ||
|
||
- Refactoring the crates (as decribed below) | ||
- Cutting a **stable**/**cold** V0 release and creating a working **unstable**/**hot** branch | ||
- Creating a mechanism for dynamically selecting between the two | ||
|
||
We can aim for a six month release from **unstable**/**hot** to **testing**/**warm**, but there's no need to hold to that timeline. If things slip much past six months, though, we should consider why we slipped and try to address that in the process going forward. | ||
|
||
We will need to think very carefully about the proportion of work done on **stable**/**cold** (for customer needs) and on **unstable**/**hot** (for technical debt paydown and feature development). | ||
|
||
## Minimal Viable Prototype | ||
|
||
### Version names and numbering | ||
|
||
Every version of the `optimizer` crate will be numbered except for the currently active development branch in **unstable**/**hot**, which never receives a number. | ||
|
||
### The `optimizer` and `optimizer-types` crates | ||
|
||
The `optimizer` crate will contain the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. | ||
|
||
The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable. | ||
|
||
These crates do _not_ include: | ||
+ the SQL parser | ||
+ (\*) SQL -> HIR in `sql` | ||
+ `sequence_*` methods in adapter | ||
+ (\*) LIR -> dataflow in `compute` | ||
|
||
The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest. | ||
Comment on lines
+55
to
+65
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This all tracks for me. |
||
|
||
### Versioning the `optimizer` crate | ||
|
||
To create a new version of the `optimizer` crate, we will use `git subtree`, which should preserve features. | ||
|
||
At first, there will be two versions: V0 **stable**/**cold** and **unstable**/**hot**. | ||
|
||
### Supporting qualification | ||
|
||
A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (a) would be nice, but would be a significant amount of work to add this feature. An alternative way for customers to qualify the new optimizer version is to simply do a blue-green deployment where they change only the optimizer version, but not cut over to the new version for something like a week, and if it works fine over a week, then cut over. I think this would be minimal extra work for us. |
||
|
||
## Alternatives | ||
|
||
The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Left a thought about this under the second bullet point below. |
||
|
||
## Open questions | ||
|
||
- How do we backport features? `git subtree`/`git filter-branch` makes it easy to copy the `optimizer` crate with history, but we will not be able to use `git cherry-pick`. | ||
|
||
- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? | ||
|
||
### Where do we select optimizer versions? | ||
|
||
Are optimizer versions associated to a cluster or to an environment? | ||
|
||
If the former, we'll have: | ||
|
||
```sql | ||
ALTER CLUSTER name SET (OPTIMIZER VERSION = version) | ||
``` | ||
|
||
where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases, but raises questions about `CREATE VIEW`, `INSERT`, `UPDATE`, and `DELETE`, which run (a prefix of) the optimizer not on any particular cluster. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? | ||
|
||
If the latter, we'll have a flag to `environmentd`. Flagged environments make it easier to go mztrail in shadow environments and avoids questions about which version of the optimizer to run when not on any particular cluster, but doesn't offer a path to users for release qualification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the connection between the weekly release schedule and the problem, as stated, isn't quite clear. (We've talked about it—it's just not written here!) Is the problem that users ever experience regressions? Is the problem that users have no way to test new versions of the optimizer against their workload? Because in theory we could allow users to test new versions of the optimizer each week (using something the hot/warm/cold solution you've laid out below). But of course then the next problem would almost surely be that users don't have the time/willingness to test a new version of the optimizer each week.
What I'm driving at is that I don't think we have a clear answer as to what amount of breakage and on what cadence is acceptable to users. I'm personally worried that releasing new versions of the optimizer every six months will shake out to still result in too much breakage too frequently for users to be happy, and it will be much harder to pinpoint and fix regressions in a query if you need to look across six months of code changes rather than one week of code changes.
As we've talked about, I'm not opposed to trying the slower release cadence, but I wanted to record that my intuition here is that we are underestimating how painful it will be to migrate users from one version of the optimizer to the next if new versions are released on a six month cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if there is 1 optimizer change in a given week, then it's immediately clear which change is to blame for a plan regression. With 6 months of changes, we'd need to do some non-trivial investigation. But I'm not too worried about this, because we have a powerful tool for this: Alexander's optimizer trace tool usually allows us to quickly (usually within 1-2 hours) figure out what's going on when an optimization is not going the way we imagined it. (Also note that I often need to run the trace tool anyway even when I know which PR is to blame, because it's often not clear how exactly the PR is breaking things, due to complex interactions between transforms.)
If we have several months between getting to know about a regression and forcing a user to the regressing version, that gives us time to fix regressions: