adapter: Expression cache design doc #29908

jkosh44 · 2024-10-08T19:48:58Z

This commit adds a design doc for an optimized expression cache.

Works towards resolving #MaterializeInc/database-issues/issues/8384

Motivation

Design Doc

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

This commit adds a design doc for an optimized expression cache. Works towards resolving #MaterializeInc/database-issues/issues/8384

antiguru

Flagging that this needs close coordination with the compute team.

antiguru · 2024-10-08T19:54:33Z

doc/developer/design/2024_10_08_expression_cache.md

+
+  - deployment generation
+  - object global ID
+  - expression type (local MIR, global MIR, LIR, etc)


Keep in mind we don't have a serialization format for MIR, and the one for LIR is not designed to be cashed.

One thing I should have emphasized is that fact this is keyed by deployment generation. That means that deployment generation n will never look at a serialized expression for deployment generation m. s.t. n != m. As a consequence, the serialized representation does not need to be stable across versions and can be wildly different.

So we can invent any serialization specifically for the cache without needing to add any guarantees across versions.

Yes, this seems fine. (Just be prepared that this will be a lot of typing.)

Is it not as simple as using bincode and relying on the existing #[derive(Serialize, Deserialize)] derivations?

Just be prepared that this will be a lot of typing

Implementing a serialization format will be a lot of typing?

Oh, ok! I thought it will need to be manual protobuf (impl RustType ...). If it can simply be
#[derive(Serialize, Deserialize)], then it's easy.

danhhz · 2024-10-08T19:53:28Z

doc/developer/design/2024_10_08_expression_cache.md

+- No need to worry about coordination across K8s pods.
+- Bulk deletion is a simple directory delete.
+
+#### Cons


An idea that Aljoscha and I had in our 1:1 earlier today is to directly use persist's FileBlob for this. It's extremely well tested (most of CI uses it for persist) and solves at least some of these cons.

danhhz · 2024-10-08T19:55:14Z

doc/developer/design/2024_10_08_expression_cache.md

+- Need to worry about mocking things in memory for tests.
+- If we lose the pod, then we also lose the cache.
+
+### Persist implementation


A second idea that Aljsocha and I had in our 1:1 today is to use persist, but not with the normal consensus and blob impls (e.g. use FileBlob, ??? on the Consensus impl). That could potentially get you a persist impl where you don't need to worry about coordination (e.g. if they're both pointed at local fs)

benesch · 2024-10-09T12:27:47Z

doc/developer/design/2024_10_08_expression_cache.md

+
+  - deployment generation
+  - object global ID
+  - expression type (local MIR, global MIR, LIR, etc)


Is it not as simple as using bincode and relying on the existing #[derive(Serialize, Deserialize)] derivations?

benesch · 2024-10-09T12:40:01Z

doc/developer/design/2024_10_08_expression_cache.md

+- Need to worry about Flushing/fsync.
+- Need to worry about concurrency.
+- Need to worry about atomicity.


The standard approach here is to rely on the atomicity of rename to ensure that only a fully written/synced file will be considered part of the cache:

let mut f = File::create("path/to/entry.tmp")?; f.write_all(contents)?; f.sync_all()?; fs::rename("path/to/entry.tmp", "path/to/entry")?;

benesch · 2024-10-09T12:40:48Z

doc/developer/design/2024_10_08_expression_cache.md

+- Need to worry about Flushing/fsync.
+- Need to worry about concurrency.
+- Need to worry about atomicity.
+- Need to worry about mocking things in memory for tests.


Tests can assume access to the filesystem, so I think this is as simple as creating a temporary scratch directory in the test harness.

benesch · 2024-10-09T12:41:11Z

doc/developer/design/2024_10_08_expression_cache.md

+- Need to worry about concurrency.
+- Need to worry about atomicity.
+- Need to worry about mocking things in memory for tests.
+- If we lose the pod, then we also lose the cache.


This seems like the most substantive downside to me. Doesn't help you with a node failure.

benesch · 2024-10-09T12:46:18Z

doc/developer/design/2024_10_08_expression_cache.md

+Another potential implementation is via persist. Each cache entry would be keyed by
+`(deploy_generation, global_id, expression_type)` and the value would be a serialized version of the
+expression.


My worry with using a single persist shard here is that it seems like we might run into the same compaction problems that we did when we tried to remove the storage usage entries from the catalog shard. Do we have confidence that the usage pattern here won't result in a slow pileup of entries in the persist shard that have to be fetched/filtered through? The cache will be a lot less impactful if it winds up taking multiple seconds to read.

Another option here that uses persist is to mint a new shard for each deploy generation. That seems tricky to manage though, because you'd want to finalize the shards for any past/failed deploy generations, which is tricky to keep track of. You'd also slowly accumulate finalized shard tombstones in CRDB—one per deploy generation.

Added some text about this.

doc/developer/design/2024_10_08_expression_cache.md

benesch · 2024-10-09T14:00:25Z

doc/developer/design/2024_10_08_expression_cache.md

+implementing.
+
+```Rust
+trait ExpressionCache {


What are the durability semantics of insert_expression? Does it guarantee an fsync/compare_and_append? That would be unusably slow if called sequentially.

I think what you probably want is best effort semantics. insert_expression doesn't guarantee flushing the cache on its own, but there's a background task that periodically batches up writes/fsyncs to the cache. Should work well enough since the cache is just a perf optimization, and it's not the end of the world if not everything gets flushed to the cache before restart. In the common case for R/O there will be plenty of time for the cache to flush in the background while clusters are catching up.

What I was thinking is that insert_expression could immediately update the in-memory cache and then return a future that completes once the insert has been made durable. Then it's up to the caller whether or not they want to wait or just send the future into the background.

I've also updated the method to accept multiple entries in a single call. I think to do additional batching in the background might be overkill since DDL will be fairly rare. Thoughts?

Makes sense on all fronts! Whether additional batching happens or not is an implementation detail anyway that will depend on which cache implementation (persist vs files) you go with.

mgree · 2024-10-09T13:58:14Z

doc/developer/design/2024_10_08_expression_cache.md

+### DDL - Drop
+1. Execute catalog transaction.
+2. Invalidate cache entries via `ExpressionCache::invalidate_entries`.
+3. Re-compute and repopulate cache entries that depended on dropped entries via


Is it worth the trouble of recomputing eagerly?

Heh, we had the same thought at exactly the same moment! #29908 (comment)

I think we're going to leave this out for now, but my gut take is that it'll be worth it eventually. Re-calculating for a handful of objects will be cheap and it will help reduce recovery times in case of a crash.

Yes, agreed!

mgree · 2024-10-09T14:03:12Z

doc/developer/design/2024_10_08_expression_cache.md

+Below is a detailed set of steps that will happen in startup.
+
+1. Call `ExpressionCache::reconcile` to remove any invalid entries.
+2. While opening the catalog, for each object:


I'm a little confused by this. The design above---where (generation, global_id, expr) form a key that maps to a blob of optimized code---makes sense to drop in at various levels of calls to the Optimizer, which will, say, map HIR to one of the MIR newtypes, or MIR to LIR.

But what the design here seems like we'd be working at a higher level, skipping Optimizer entirely. In that case we could have a single, top-level cache from SQL straight to LIR or FlatPlan.

I think both designs are sensible. The former lets us make fewer changes; the latter reduces effort (if we have a cache hit from HIR to MIR, surely we can expect hits from the rest of the pipeline, too!).

But what the design here seems like we'd be working at a higher level, skipping Optimizer entirely. In that case we could have a single, top-level cache from SQL straight to LIR or FlatPlan.

Yes, as I understand it the plan is to skip the Optimizer entirely. But we still need these intermediate plan types because the catalog hangs on to the intermediate plans:

materialize/src/adapter/src/catalog.rs

Lines 145 to 151 in bff2319

#[derive(Default, Debug, Clone)]

pub struct CatalogPlans {

optimized_plan_by_id: BTreeMap<GlobalId, Arc<DataflowDescription<OptimizedMirRelationExpr>>>,

physical_plan_by_id: BTreeMap<GlobalId, Arc<DataflowDescription<mz_compute_types::plan::Plan>>>,

dataflow_metainfos: BTreeMap<GlobalId, DataflowMetainfo<Arc<OptimizerNotice>>>,

notices_by_dep_id: BTreeMap<GlobalId, SmallVec<[Arc<OptimizerNotice>; 4]>>,

}

I think for something with EXPLAIN.

I see. So we could think of the cache as mapping (generation, global_id, sql_expression) to a tuple (optimized_mir, physical_plan, metainfo, notices), then?

Yes, as I understand it the plan is to skip the Optimizer entirely. But we still need these intermediate plan types because the catalog hangs on to the intermediate plans:

Yes, exactly.

I see. So we could think of the cache as mapping (generation, global_id, sql_expression) to a tuple (optimized_mir, physical_plan, metainfo, notices), then?

Yes, but I don't think sql_expression is necessary here since within a deploy generation the SQL of a global ID can never change.

I pushed an update to the API to make this more explicit. The big change is that I've combined all the expression types into a single struct that will be stored as a single blob, instead of storing each expression type separately.

The benefit of the old approach was that when an index was dropped, we could re-write the global optimized expressions without having to also re-write the local optimized expressions. The benefit of the new approach is that the assumption that either all expression types for a given global ID exist or no expression types exists becomes much more explicit.

I'm still thinking about which approach is better, but for now I think the new approach is.

Great! Another point in favor of the all-or-nothing approach: no implicit invariant around the validity of the generated intermediate expressions.

I like the new approach better as well.

ggevay

Overall, it looks fine! But I think we need some more cache invalidations, unfortunately.

doc/developer/design/20241008_expression_cache.md

jkosh44 · 2024-10-15T18:34:46Z

Unless someone feels strongly, I'm fully committed to the persist implementation. I've just pushed an update to describe the implementation in more detail. The main reason is to eliminate downtime when we lose the environmentd pod. The main motivation for startup time perf improvements is to eliminate downtime during 0dt upgrades AND during envd failures, otherwise we would have went with the graceful cut-over approach. So I think this is a good motivation.

CC @danhhz , @benesch, @aljoscha in case any of you have strong opinions.

danhhz · 2024-10-15T19:13:59Z

Ben just merged the better version of the persist force compaction tool, so I think the biggest drawback of the persist shard appraoch ("what if we end up in the same situation as the catalog shard") is now pretty de-risked

benesch · 2024-10-15T19:16:25Z

If persist is up for us using dangerously_force_compaction on the cache shard (and fixing any bugs that might remain), works for me!

benesch

I didn't trace through the exact steps of each phase here, but the general shape of this now looks great. Thanks very much for writing this design doc up and iterating on it. Feel really good about where we landed.

benesch · 2024-10-16T15:46:48Z

doc/developer/design/20241008_expression_cache.md

+
+  - `o`.
+  - All compute objects that depend directly on `o`.
+  - All compute objects that would directly depend on `o`, if all views were inlined.


There should be an optional "re-compute and repopulate cache entries that could use the new index" step here, yeah?

Good point, added.

adapter: Expression cache design doc

329766e

This commit adds a design doc for an optimized expression cache. Works towards resolving #MaterializeInc/database-issues/issues/8384

jkosh44 marked this pull request as ready for review October 8, 2024 19:49

jkosh44 requested review from danhhz, aljoscha and benesch October 8, 2024 19:49

antiguru reviewed Oct 8, 2024

View reviewed changes

danhhz reviewed Oct 8, 2024

View reviewed changes

jkosh44 added 2 commits October 8, 2024 17:34

Fix lint

d371cdf

Add compatibility guarantees

90a3a5b

ggevay self-requested a review October 9, 2024 10:29

Fix lint

b7d4aae

benesch reviewed Oct 9, 2024

View reviewed changes

mgree reviewed Oct 9, 2024

View reviewed changes

jkosh44 added 12 commits October 9, 2024 10:13

Expand on persist cons and alternatives

1cf7925

Update

677bcfc

Add rationale for

a7ff061

Remove from API

d5bc28c

Make invalidation optional

d8cdec1

Allow batching inserts

9471973

Combine all expressions into a single blob

cad39d4

Fix lint

9eec23b

Fix file name

03753e5

Simplify API further

57e8d64

Add more alternatives

8f0bc28

Combine open methods

4a75c9e

ggevay reviewed Oct 10, 2024

View reviewed changes

doc/developer/design/20241008_expression_cache.md Show resolved Hide resolved

jkosh44 added 2 commits October 10, 2024 14:37

Add optimizer feature overrides

46aa439

Switch optimizer struct

4f3524a

ggevay reviewed Oct 10, 2024

View reviewed changes

doc/developer/design/20241008_expression_cache.md Outdated Show resolved Hide resolved

WIP

17659a2

jkosh44 added 2 commits October 15, 2024 13:48

Update insert invalidation logic

fee46f3

Commit to persist implementation

3b491b5

benesch approved these changes Oct 16, 2024

View reviewed changes

Add note about recomputed expressions

d5b6f00

jkosh44 enabled auto-merge (squash) October 16, 2024 20:55

jkosh44 merged commit bef12b1 into MaterializeInc:main Oct 16, 2024
7 checks passed

jkosh44 deleted the expr-cache-design branch October 16, 2024 23:01

	#[derive(Default, Debug, Clone)]
	pub struct CatalogPlans {
	optimized_plan_by_id: BTreeMap<GlobalId, Arc<DataflowDescription<OptimizedMirRelationExpr>>>,
	physical_plan_by_id: BTreeMap<GlobalId, Arc<DataflowDescription<mz_compute_types::plan::Plan>>>,
	dataflow_metainfos: BTreeMap<GlobalId, DataflowMetainfo<Arc<OptimizerNotice>>>,
	notices_by_dep_id: BTreeMap<GlobalId, SmallVec<[Arc<OptimizerNotice>; 4]>>,
	}

adapter: Expression cache design doc #29908

adapter: Expression cache design doc #29908

Uh oh!

Conversation

jkosh44 commented Oct 8, 2024

Motivation

Checklist

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benesch Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benesch Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkosh44 Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ggevay Oct 9, 2024 •

edited

Loading

benesch Oct 9, 2024 •

edited

Loading

benesch Oct 9, 2024 •

edited

Loading

jkosh44 Oct 9, 2024 •

edited

Loading