From f7fcc0b23bded67d8b7bcc6172a2a5144b63ec22 Mon Sep 17 00:00:00 2001 From: Michael Greenberg Date: Mon, 28 Oct 2024 16:34:55 -0400 Subject: [PATCH 1/4] design doc for optimizer releases, after conversations with nikhil, frank, and moritz --- .../design/20241028_optimizer_releases.md | 81 +++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 doc/developer/design/20241028_optimizer_releases.md diff --git a/doc/developer/design/20241028_optimizer_releases.md b/doc/developer/design/20241028_optimizer_releases.md new file mode 100644 index 0000000000000..65aa508f28a55 --- /dev/null +++ b/doc/developer/design/20241028_optimizer_releases.md @@ -0,0 +1,81 @@ +# Optimizer Release Engineering + +## The Problem + +Currently, users run on "Materialize weekly", but: + + 1. That cannot be case forever, as self-hosted users may not upgrade immediately. + 2. That is not always good for users, who can experience regressions when optimizer changes don't suit their workloads. + 3. That is not always good for people working on the optimizer, who must size their work and gate it with feature flags to accommodate the weekly release schedule. + +## Success Criteria + +Customers---self-hosted or cloud---will be able to qualify new optimizers before migrating to them. + +Optimizer engineers will be able to develop features with confidence, namely: + + - introducing new transforms + - updating existing transforms + - targeting new dataflow operators + - changing AST types for HIR, MIR, or LIR + +Optimizer engineers will be able to deploy hotfixes to cloud customers. + +## Solution Proposal + +We propose the following solution: + + 1. Refactor existing `sql`, `expr`, `compute-types`, and `transform` crates into `optimizer` and `optimizer-types` crates. + 2. Version the `optimizer` crate. The `optimizer-types` crate will contain traits and types that are stable across all three versions (e.g., sufficient interfaces to calculate the dependencies of various expression `enum`s without fixing any of their details). + 3. Work on three versions of the `optimizer` crate at once: + - **unstable**/**hot** active development + - **testing**/**warm** most recently cut release of the optimizer + - **stable**/**cold** previously cut release of the optimizer + These versions will be tracked as copies of the crate in the repo (but see ["Alternatives"](#alternatives)). + 4. Rotate versions treadmill style, in some way matching the cadence of the self-hosted release. (Likely, we will want to use the cloud release to qualify a self-hosted release.) + +## Minimal Viable Prototype + +### Version names and numbering + +Every version of the `optimizer` crate will be numbered. Customers can run SQL like: + +```sql +ALTER CLUSTER name SET (OPTIMIZER VERSION = version) +``` + +where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases. + +### The `optimizer` and `optimizer-types` crates + +The `optimizer` create will containe the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. + +The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable. + +These crates do _not_ include: + + the SQL parser + + (\*) SQL -> HIR in `sql` + + `sequence_*` methods in adapter + + (\*) LIR -> dataflow in `compute` + +The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest. + +### Supporting qualification + +A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode. + +## Alternatives + +The thorniest thing here is the release cutting and copied code. Would it be better to split off the optimizer entirely into a separate repository? A separate _process_? + +## Open questions + +- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) + +- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history. + +- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster? + +- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? + +- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? From 2c67b279601ead0930692c924dc8807d3bdfb8a8 Mon Sep 17 00:00:00 2001 From: Michael Greenberg Date: Wed, 30 Oct 2024 11:28:46 -0400 Subject: [PATCH 2/4] =?UTF-8?q?tweaks=20per=20g=C3=A1bor?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- doc/developer/design/20241028_optimizer_releases.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/developer/design/20241028_optimizer_releases.md b/doc/developer/design/20241028_optimizer_releases.md index 65aa508f28a55..f01c8afdff35a 100644 --- a/doc/developer/design/20241028_optimizer_releases.md +++ b/doc/developer/design/20241028_optimizer_releases.md @@ -25,7 +25,7 @@ Optimizer engineers will be able to deploy hotfixes to cloud customers. We propose the following solution: - 1. Refactor existing `sql`, `expr`, `compute-types`, and `transform` crates into `optimizer` and `optimizer-types` crates. + 1. Refactor existing `sql`, `expr`, `compute-types`, and `transform` crates to extract `optimizer` and `optimizer-types` crates. 2. Version the `optimizer` crate. The `optimizer-types` crate will contain traits and types that are stable across all three versions (e.g., sufficient interfaces to calculate the dependencies of various expression `enum`s without fixing any of their details). 3. Work on three versions of the `optimizer` crate at once: - **unstable**/**hot** active development @@ -48,7 +48,7 @@ where `version` is a version number (or, possibly one of our three fixed names). ### The `optimizer` and `optimizer-types` crates -The `optimizer` create will containe the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. +The `optimizer` crate will contain the definitions of HIR, MIR, and LIR, along with the HIR-to-MIR and MIR-to_LIR lowerings and the MIR-to-MIR transformations. These come from the `sql` (HIR, HIR -> MIR), `expr` (MIR), `transform` (MIR -> MIR), and `compute-types` (LIR, MIR -> LIR) crates. The `optimizer-types` crate will have traits and definitions that are global across all three live versions of the optimizer. The AST types may change version to version, but the traits will be more stable. From 1cea23b6ce2723442d04812e123ad502f65d68f0 Mon Sep 17 00:00:00 2001 From: Michael Greenberg Date: Wed, 13 Nov 2024 14:34:43 -0500 Subject: [PATCH 3/4] update doc based on feedback --- .../design/20241028_optimizer_releases.md | 42 +++++++++++++------ 1 file changed, 30 insertions(+), 12 deletions(-) diff --git a/doc/developer/design/20241028_optimizer_releases.md b/doc/developer/design/20241028_optimizer_releases.md index f01c8afdff35a..5f655d3d596b2 100644 --- a/doc/developer/design/20241028_optimizer_releases.md +++ b/doc/developer/design/20241028_optimizer_releases.md @@ -34,17 +34,21 @@ We propose the following solution: These versions will be tracked as copies of the crate in the repo (but see ["Alternatives"](#alternatives)). 4. Rotate versions treadmill style, in some way matching the cadence of the self-hosted release. (Likely, we will want to use the cloud release to qualify a self-hosted release.) -## Minimal Viable Prototype +We do not need to commit to this process up front. We can instead begin by: -### Version names and numbering + - Refactoring the crates (as decribed below) + - Cutting a **stable**/**cold** V0 release and creating a working **unstable**/**hot** branch + - Creating a mechanism for dynamically selecting between the two -Every version of the `optimizer` crate will be numbered. Customers can run SQL like: +We can aim for a six month release from **unstable**/**hot** to **testing**/**warm**, but there's no need to hold to that timeline. If things slip much past six months, though, we should consider why we slipped and try to address that in the process going forward. -```sql -ALTER CLUSTER name SET (OPTIMIZER VERSION = version) -``` +We will need to think very carefully about the proportion of work done on **stable**/**cold** (for customer needs) and on **unstable**/**hot** (for technical debt paydown and feature development). -where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases. +## Minimal Viable Prototype + +### Version names and numbering + +Every version of the `optimizer` crate will be numbered except for the currently active development branch in **unstable**/**hot**, which never receives a number. ### The `optimizer` and `optimizer-types` crates @@ -60,6 +64,12 @@ These crates do _not_ include: The two bullets marked (\*) above are a tricky point in the interface. SQL will _always_ lowers to **unstable**/**hot** HIR; we will have methods to convert **unstable**/**hot** HIR to the other two versions. Similarly, LIR lowers to dataflow from **unstable**/**hot** LIR; we will have methods to convert the other two versions to the latest. +### Versioning the `optimizer` crate + +To create a new version of the `optimizer` crate, we will use `git subtree`, which should preserve features. + +At first, there will be two versions: V0 **stable**/**cold** and **unstable**/**hot**. + ### Supporting qualification A capture-and-replay mode for events that stores the events in the customer's own environment would allow for (a) customers to easily test and qualify changes, and (b) a push-button way for us to spin up unbilled clusters to evaluate these replays in read-only mode. @@ -70,12 +80,20 @@ The thorniest thing here is the release cutting and copied code. Would it be bet ## Open questions -- Does **unstable**/**hot** get a version number, or no? (Cf. Debian sid) +- How do we backport features? `git subtree`/`git filter-branch` makes it easy to copy the `optimizer` crate with history, but we will not be able to use `git cherry-pick`. -- How precisely do we select features out of **unstable**/**hot** to cut a new **testing**/**warm**? If we literally copy the crate three times in the repo, `git cherry-pick` will not be straightfoward to use and it will be easy to lose history. +- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? -- Under use-case isolation, can we say select an optimizer per `environmentd` instead of per cluster? +### Where do we select optimizer versions? -- There are several occasions where the optimizer is called independent of the cluster: views, prepared statements, and as part of `INSERT`, `UPDATE`, and `DELETE`. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? +Are optimizer versions associated to a cluster or to an environment? -- Which fixes will we backport? Correctness only? When do we backport a fix and when do we backport a "problem detector" with a notice? +If the former, we'll have: + +```sql +ALTER CLUSTER name SET (OPTIMIZER VERSION = version) +``` + +where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases, but raises questions about `CREATE VIEW`, `INSERT`, `UPDATE`, and `DELETE`, which run (a prefix of) the optimizer not on any particular cluster. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? + +If the latter, we'll have a flag to `environmentd`. Flagged environments make it easier to go mztrail in shadow environments and avoids questions about which version of the optimizer to run when not on any particular cluster, but doesn't offer a path to users for release qualification. \ No newline at end of file From cf9d377ebb16e683bfe4da5b734389d86437557e Mon Sep 17 00:00:00 2001 From: Michael Greenberg Date: Wed, 13 Nov 2024 15:17:27 -0500 Subject: [PATCH 4/4] satisfy linter --- doc/developer/design/20241028_optimizer_releases.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/developer/design/20241028_optimizer_releases.md b/doc/developer/design/20241028_optimizer_releases.md index 5f655d3d596b2..5ecb543d7a37c 100644 --- a/doc/developer/design/20241028_optimizer_releases.md +++ b/doc/developer/design/20241028_optimizer_releases.md @@ -96,4 +96,4 @@ ALTER CLUSTER name SET (OPTIMIZER VERSION = version) where `version` is a version number (or, possibly one of our three fixed names). Note that the optimizer is selected on a per cluster basis. Selecting the optimizer per cluster makes it easier to qualify optimizer releases, but raises questions about `CREATE VIEW`, `INSERT`, `UPDATE`, and `DELETE`, which run (a prefix of) the optimizer not on any particular cluster. (Look at `sequence_read_then_write` for an example.) Should these be moved off? Use the active cluster? Automatically optimize with all three? -If the latter, we'll have a flag to `environmentd`. Flagged environments make it easier to go mztrail in shadow environments and avoids questions about which version of the optimizer to run when not on any particular cluster, but doesn't offer a path to users for release qualification. \ No newline at end of file +If the latter, we'll have a flag to `environmentd`. Flagged environments make it easier to go mztrail in shadow environments and avoids questions about which version of the optimizer to run when not on any particular cluster, but doesn't offer a path to users for release qualification.