Skip to content

Commit 2145c54

Browse files
committed
start, introduce design doc for adding columns to tables
1 parent 1ff58ef commit 2145c54

File tree

1 file changed

+256
-0
lines changed

1 file changed

+256
-0
lines changed
+256
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Add Columns to Tables
2+
3+
- Associated: [issue#8233](https://github.com/MaterializeInc/database-issues/issues/8233)
4+
- Associated: [pr#29694](https://github.com/MaterializeInc/materialize/pull/29694)
5+
6+
## The Problem
7+
8+
We want to support adding columns to relations in Materialize, both tables and sources. Concretely
9+
for tables this means supporting Postgres’ syntax of `ALTER TABLE ... ADD COLUMN ...`, and for
10+
sources supporting something like `ALTER SOURCE ... REFRESH SCHEMA ...` that will read the schema
11+
from the upstream source and update the relations in Materialize accordingly.
12+
13+
When a column is added to a relation, it should not affect objects that depend on said relation.
14+
For example:
15+
16+
```sql
17+
CREATE TABLE t1 (a int);
18+
INSERT INTO t1 VALUES (1), (2), (3);
19+
20+
CREATE VIEW v1 AS SELECT * FROM t1;
21+
22+
ALTER TABLE t1 ADD COLUMN b text;
23+
24+
-- view 'v1' does not have column 'b' since it was added after 'v1' was created.
25+
SELECT * FROM v1;
26+
a
27+
---
28+
1
29+
2
30+
3
31+
```
32+
33+
The specific problem we’re aiming to address in this design doc is how can we support evolving the
34+
`RelationDesc` of an object, while upholding existing invariants around the
35+
`GlobalId -> RelationDesc` mapping.
36+
37+
## Success Criteria
38+
39+
We have aligned on a design that allows us to evolve the `RelationDesc` (schema) of an object in
40+
the Adapter, Compute, and Storage layers of Materialize. This design should either conform to
41+
the existing [Formalism](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/platform/formalism.md#materialize-formalism),
42+
or specifically describe how and why we will update the Formalism to support necessary changes.
43+
44+
## Out of Scope
45+
46+
- Schema evolution in Persist. For all intents and purposes you can assume that Persist supports
47+
evolving the schema of a shard and tracking the schemas of existing Parts.
48+
- Other types of supported schema migrations. For all intents and purposes the only kind of schema
49+
migration we are concerned with is adding a nullable column.
50+
- The syntax or implementation for supporting a feature like `ALTER SOURCE ... REFRESH SCHEMA ...`.
51+
For all intents and purposes we are only concerned with adding columns to tables.
52+
- Unifying existing types of object IDs, i.e. [issue#6336](https://github.com/MaterializeInc/database-issues/issues/6336)
53+
54+
## Context
55+
56+
### `GlobalId`
57+
58+
Within Materialize a `GlobalId` generally identifies a single object and is used as the primary key
59+
in the Catalog as well as numerous internal data structures. `GlobalId`s are also exposed to users
60+
via catalog tables, e.g. [`mz_tables`](https://materialize.com/docs/sql/system-catalog/mz_catalog/#mz_tables),
61+
where it is expected that they provide a stable mapping from ID to object name.
62+
63+
Additionally the [Formalism](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/platform/formalism.md#globalids) defines `GlobalId`s as:
64+
65+
> A `GlobalId` is a globally unique identifier used in Materialize. One of the things Materialize
66+
can identify with a `GlobalId` is a TVC. Every `GlobalId` corresponds to at most one TVC. This
67+
invariant holds over all wall-clock time: `GlobalId`s are never re-bound to different TVCs.
68+
69+
By changing the `RelationDesc` for an object you are arguably rebinding the `GlobalId` to a new
70+
TVC. A number of places all across our code base rely on this mapping of `GlobalId → RelationDesc`
71+
being stable, so we can’t modify the `RelationDesc` for a given `GlobalId`. But we also need to
72+
provide a stable external mapping of object ID to object name, so we can’t modify the `GlobalId`
73+
for a given object.
74+
75+
## Solution Proposal
76+
77+
### SQL Persistence
78+
79+
Within the Catalog we persist objects with their `create_sql` string. To track when a column was
80+
added to a table, and what version of a table a dependent object relies on, we plan to introduce a
81+
`VERSION` keyword. For example our internal `create_sql` persistence will look like:
82+
83+
```sql
84+
CREATE TABLE t1 (a int, b text VERSION ADDED 1);
85+
86+
-- view 'v1' references 't1' when it had only column 'a'
87+
CREATE VIEW v1 AS SELECT * FROM [u1 as "materialize"."public"."t1" VERSION 0];
88+
```
89+
90+
This would allow us to track the versions of a table that exist, and what version dependent objects
91+
were initially planned against.
92+
93+
### New ID Mapping
94+
95+
Introduce a new `CatalogItemId` that will be a stable 1:1 mapping of object name to object ID and
96+
keep the structure of `GlobalId` exactly how it exists currently. When adding a column to a table
97+
we will allocate a new `GlobalId` that will be a unique reference to a `(CatalogItemId, VERSION)`.
98+
In other words, a `(CatalogItemId, VERSION)` will uniquely identify a single TVC.
99+
100+
This new type will have two variants which are a subset of the variants of a `GlobalId`:
101+
102+
```rust
103+
enum CatalogItemId {
104+
// System namespace.
105+
System(u64),
106+
// User namespace.
107+
User(u64),
108+
}
109+
```
110+
111+
### Relationships
112+
113+
This allows us to introduce the following relationships between our various types:
114+
115+
- 1 `CatalogItemId` can reference many `GlobalId`s
116+
- 1 `GlobalId` will reference 1 `(CatalogItemId, VERSION)`
117+
- 1 `GlobalId` will reference at most 1 `(ShardId, SchemaId)` (Persist)
118+
- 1 `CatalogItemId` will reference at most 1 `ShardId` (Persist)
119+
120+
Using our example from before we’ll have the following:
121+
122+
- Name `"materialize"."public"."t1"`
123+
- `CatalogItemId`: `u1`
124+
- `GlobalId`s:
125+
- `u1``(CatalogItemId(u1), RelationDesc('a' int))`
126+
- `u2``(CatalogItemId(u1), RelationDesc('a' int, 'b' text))`
127+
- `ShardId`: `sXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX`
128+
129+
While not necessary and possibly out of scope of this design, with this new setup I begin to
130+
imagine a `GlobalId` as uniquely referencing a collection; in other words, `GlobalId` could be
131+
renamed to `CollectionId`.
132+
133+
> Despite the text representation of both a `CatalogItemId` and `GlobalId` being `u1`, they refer
134+
> to different things. This discrepancy already exists in our code base, e.g. `RoleId` and
135+
> `ClusterId` both have this same text representation but refer to different things.
136+
137+
## Implementation
138+
139+
`GlobalId`s are used all over the codebase: at the time of writing there are >2,000 matches for
140+
“GlobalId” in Rust files. I will need to begin prototyping before I can speak to specifics of
141+
exactly where `CatalogItemId`s will replace `GlobalId`s, but at a high level:
142+
143+
### Adapter
144+
145+
All current references to `GlobalId` in the Catalog will get replaced with `CatalogItemId`. The
146+
text representation for IDs that is persisted in `create_sql` will get parsed as `CatalogItemId`s.
147+
148+
In planning (or possibly name resolution) is where we will convert from the `CatalogItemId(u1)`
149+
and `VERSION` syntax in `create_sql` to `GlobalId`s.
150+
151+
In the durable Catalog we will reuse the existing id allocator that currently mints `GlobalId`s to
152+
mint `CatalogItemId`s. We will create a new id allocator specifically for `GlobalId`s that will be
153+
initialized to the same value as the original allocator, this prevents accidental `GlobalId` re-use
154+
if they are persisted outside the Catalog. Additionally we will extend the existing
155+
[ItemValue](https://github.com/MaterializeInc/materialize/blob/b579caa68b6d287426dead8626c0adc885205740/src/catalog/protos/objects.proto#L119-L126)
156+
protobuf type to include a map of `VERSION -> GlobalId`. Externally to map between `CatalogItemId`s
157+
and `GlobalId`s we’ll introduce a new Catalog table, `mz_internal.mz_collection_ids`.
158+
159+
### Storage
160+
161+
To me this is the largest unknown. The Storage Controller operates with `GlobalId`s which currently
162+
have a 1:1 mapping with Persist’s `ShardId`s. This design calls for many `GlobalId`s to be able to
163+
reference a single `ShardId` which breaks the existing relationship.
164+
165+
The Storage Controller will need to continue to use `GlobalId`s for operations like rendering a
166+
source, but it will also need to have some careful management of Persist Handles, e.g. if there are
167+
two open `WriteHandle`s to the same Persist Shard, writes to one of them would implicitly advance
168+
the frontier of the other. Or, dropping a `GlobalId` will need to prevent finalizing the underlying
169+
Persist Shard, if there are other `GlobalId`s that still reference said shard.
170+
171+
### Compute
172+
173+
Our Compute layer will operate entirely on `GlobalId`s. Other than some refactoring of what Catalog
174+
APIs our Compute layer uses, I don’t anticipate any material changes here.
175+
176+
## Minimal Viable Prototype
177+
178+
So far I have prototyped two alternate approaches, and am currently working on implementing the
179+
approach that introduces a new `CatalogItemId` type.
180+
181+
- [pr#29694](https://github.com/MaterializeInc/materialize/pull/29694), implements adding columns
182+
to tables by changing the `RelationDesc` associated with a Table and applying a projection on to
183+
expose only the relevant columns.
184+
- [pr#30018](https://github.com/MaterializeInc/materialize/pull/30018), stacked on top of
185+
`pr#29694`, only look at last commit. Implements adding columns to tables by adding `GlobalId`
186+
"aliases" to Tables so when a Table is altered we create a new `GlobalId`, and thus multiple
187+
`GlobalId`s can be associated with a single table.
188+
189+
190+
## Alternatives
191+
192+
### Always Apply a Projection on top of a Source
193+
194+
If changing the shape of data in a TVC is not considered as creating a new TVC, then arguably
195+
changing the `RelationDesc` of an object in Materialize would not be rebinding the `GlobalId`
196+
of the object. This shrinks the theoretical scope of the problem to just constraining what columns
197+
are used when planning objects. For example, when restarting Materialize we need to make sure when
198+
re-planning objects, they’re planned against the same `RelationDesc` that was used when they were
199+
originally created.
200+
201+
We can achieve this by threading through the correct `RelationDesc` in planning, and always
202+
applying a projection on top of the operator that reads data. This technique has been prototyped in
203+
[#29694](https://github.com/MaterializeInc/materialize/pull/29694). (Note: the test failures in
204+
this PR are related to explain plans, notably the new
205+
[alter-table.slt](https://github.com/MaterializeInc/materialize/pull/29694/files#diff-dff9699da8a6f3f1934d56574b8b3c8e47088e8149cd10847a459e9343d18f56) passes).
206+
207+
Practically an issue with this solution is that in a number of places within the codebase rely on
208+
the `GlobalId -> RelationDesc` mapping to be stable. For example, an issue not solved in that PR is
209+
how to handle Indexes that are created on Tables. While existing test cases pass there are plenty
210+
more things that could break because of violating this invariant.
211+
212+
### Update the representation of a `GlobalId`
213+
214+
Instead of introducing a new `CatalogItemId` we could extend `GlobalId` to include version
215+
information. For example:
216+
217+
```rust
218+
// Current
219+
enum GlobalId {
220+
// ... snipped
221+
User(u64),
222+
}
223+
224+
// Alternate Approach
225+
enum GlobalId {
226+
// ... snipped
227+
User(u64, u64),
228+
}
229+
```
230+
231+
Where the second `u64` in `GlobalId::User` would contain this new version information.
232+
233+
Outside of tests, everything in our codebase handles `GlobalId`s opaquely, they don’t look at the
234+
inner value. Just adding more to the `GlobalId::User` variant would be a relatively small change
235+
compared to adding a new ID type, but it would require more logical changes in the Adapter and
236+
Storage layers. The Adapter still needs to maintain a stable mapping from object ID to object name
237+
and Storage probably still needs to de-duplicate between `GlobalId`s and Persist’s `ShardId`s, both
238+
of which would require looking at the inner value of the otherwise opaque `GlobalId`.
239+
240+
### Re-use `GlobalId`, allow multiple `GlobalId`s to refer to a single Table.
241+
242+
A combination of the proposed approach and the above alternative, instead of creating a new
243+
`CatalogItemId` type or modifying the existing `GlobalId` type, just allow multiple `GlobalId`s to
244+
refer to a single object. This can be modeled as "aliases" to a single object.
245+
246+
This approach requires the fewest code changes, but introduces the most ambiguity into the code
247+
base. There are existing code paths that expect a `GlobalId` to uniquely refer to an object, e.g.
248+
in the Catalog when dropping an object or maintaining `CriticalSinceHandle`s in the Storage
249+
Controller. If we allow multiple `GlobalId`s to refer to a single object then the onus of making
250+
sure we pass the _right_ `GlobalId`, or don't pass multiple `GlobalId`s that refer to the same
251+
object, is put on the programmer. Whereas introducing a new `CatalogItemId` type designs away these
252+
invalid states.
253+
254+
## Open questions
255+
256+
1. N/a

0 commit comments

Comments
 (0)