compute: spillable MV correction buffer #30083

teskje · 2024-10-18T16:52:01Z

This PR introduces a "correction v2" buffer that differs from the existing one in that it stores data in columnated regions that can be spilled to disk. It follows the general design of the arrangement merge batcher, with the major difference that it sorts updates by time first, in an attempt to more efficiently deal with the presence of updates in
the future (commonly introduced by temporal filters).

The new correction buffer can be switched on through a feature flag, enable_compute_correction_v2, and is switched off by default. The plan is to keep it disabled in production but have it available for emergencies where replicas fail to hydrate due to the MV memory spike. Eventually we'll want to make the new correction buffer the default, but we should do more performance testing before that.

Motivation

This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/8464

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

src/compute/src/sink/correction_v2.rs

teskje · 2025-01-03T16:30:59Z

The feature benchmarks report a bunch of regressions. Some increased CPU and memory usage is expected, but the memory regressions here are worryingly large:

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
FastPathOrderByLimit                | wallclock       |           0.827 |           0.871 |   s    |    10%     |      no       | better:  5.0% faster
FastPathOrderByLimit                | memory_mz       |         467.682 |         464.535 |   MB   |    20%     |      no       | worse:   0.7% more
FastPathOrderByLimit                | memory_clusterd |         959.396 |         149.727 |   MB   |    50%     |    !!YES!!    | worse:   6.4 TIMES more
OrderBy                             | wallclock       |           9.010 |           8.825 |   s    |    10%     |      no       | worse:   2.1% slower
OrderBy                             | memory_mz       |         473.213 |         473.499 |   MB   |    20%     |      no       | better:  0.1% less
OrderBy                             | memory_clusterd |        1179.695 |         307.178 |   MB   |    50%     |    !!YES!!    | worse:   3.8 TIMES more
DifferentialJoin                    | wallclock       |           1.396 |           1.401 |   s    |    10%     |      no       | better:  0.4% faster
DifferentialJoin                    | memory_mz       |         455.093 |         466.061 |   MB   |    20%     |      no       | better:  2.4% less
DifferentialJoin                    | memory_clusterd |         930.405 |         144.672 |   MB   |    50%     |    !!YES!!    | worse:   6.4 TIMES more
FastPathFilterIndex                 | wallclock       |           3.423 |           4.440 |   s    |    10%     |      no       | better: 22.9% faster
FastPathFilterIndex                 | memory_mz       |         470.448 |         465.775 |   MB   |    20%     |      no       | worse:   1.0% more
FastPathFilterIndex                 | memory_clusterd |         935.078 |         155.830 |   MB   |    50%     |    !!YES!!    | worse:   6.0 TIMES more
CrossJoin                           | wallclock       |           3.416 |           2.429 |   s    |    10%     |    !!YES!!    | worse:  40.7% slower
CrossJoin                           | memory_mz       |         468.540 |         463.772 |   MB   |    20%     |      no       | worse:   1.0% more
CrossJoin                           | memory_clusterd |         918.388 |         205.231 |   MB   |    50%     |    !!YES!!    | worse:   4.5 TIMES more
FullOuterJoin                       | wallclock       |           9.714 |           8.691 |   s    |    10%     |    !!YES!!    | worse:  11.8% slower
FullOuterJoin                       | memory_mz       |         461.483 |         468.254 |   MB   |    20%     |      no       | better:  1.4% less
FullOuterJoin                       | memory_clusterd |        1033.783 |         220.871 |   MB   |    50%     |    !!YES!!    | worse:   4.7 TIMES more
MFPPushdown                         | wallclock       |           0.356 |           0.363 |   s    |    10%     |      no       | better:  2.2% faster
MFPPushdown                         | memory_mz       |         470.924 |         466.919 |   MB   |    20%     |      no       | worse:   0.9% more
MFPPushdown                         | memory_clusterd |        6407.738 |         145.721 |   MB   |    50%     |    !!YES!!    | worse:  44.0 TIMES more
CountDistinct                       | wallclock       |           1.807 |           1.802 |   s    |    10%     |      no       | worse:   0.3% slower
CountDistinct                       | memory_mz       |         464.344 |         462.818 |   MB   |    20%     |      no       | worse:   0.3% more
CountDistinct                       | memory_clusterd |         727.177 |         150.299 |   MB   |    50%     |    !!YES!!    | worse:   4.8 TIMES more
AccumulateReductions                | wallclock       |          65.003 |          50.974 |   s    |    10%     |    !!YES!!    | worse:  27.5% slower
AccumulateReductions                | memory_mz       |        5145.073 |        5155.563 |   MB   |    20%     |      no       | better:  0.2% less
AccumulateReductions                | memory_clusterd |        7984.161 |         144.672 |   MB   |    50%     |    !!YES!!    | worse:  55.2 TIMES more
FastPathFilterNoIndex               | wallclock       |           1.309 |           1.316 |   s    |    10%     |      no       | better:  0.5% faster
FastPathFilterNoIndex               | memory_mz       |         501.537 |         488.853 |   MB   |    20%     |      no       | worse:   2.6% more
FastPathFilterNoIndex               | memory_clusterd |        6087.303 |         206.852 |   MB   |    50%     |    !!YES!!    | worse:  29.4 TIMES more
GroupBy                             | wallclock       |           4.553 |           4.545 |   s    |    10%     |      no       | worse:   0.2% slower
GroupBy                             | memory_mz       |         459.099 |         464.630 |   MB   |    20%     |      no       | better:  1.2% less
GroupBy                             | memory_clusterd |         960.350 |         244.617 |   MB   |    50%     |    !!YES!!    | worse:   3.9 TIMES more
FinishOrderByLimit                  | wallclock       |           1.361 |           1.386 |   s    |    10%     |      no       | better:  1.8% faster
FinishOrderByLimit                  | memory_mz       |         467.396 |         463.486 |   MB   |    20%     |      no       | worse:   0.8% more
FinishOrderByLimit                  | memory_clusterd |         916.767 |         158.215 |   MB   |    50%     |    !!YES!!    | worse:   5.8 TIMES more
GroupByMaintained                   | wallclock       |          16.215 |          16.302 |   s    |    10%     |      no       | better:  0.5% faster
GroupByMaintained                   | memory_mz       |         478.172 |         475.502 |   MB   |    20%     |      no       | worse:   0.6% more
GroupByMaintained                   | memory_clusterd |        1375.198 |         510.120 |   MB   |    50%     |    !!YES!!    | worse:   2.7 TIMES more
MinMaxMaintained                    | wallclock       |           6.351 |           6.444 |   s    |    10%     |      no       | better:  1.4% faster
MinMaxMaintained                    | memory_mz       |         465.298 |         468.063 |   MB   |    20%     |      no       | better:  0.6% less
MinMaxMaintained                    | memory_clusterd |        1131.058 |         296.307 |   MB   |    50%     |    !!YES!!    | worse:   3.8 TIMES more
MinMax                              | wallclock       |           1.538 |           1.582 |   s    |    10%     |      no       | better:  2.8% faster
MinMax                              | memory_mz       |         464.916 |         459.671 |   MB   |    20%     |      no       | worse:   1.1% more
MinMax                              | memory_clusterd |        1009.941 |         191.689 |   MB   |    50%     |    !!YES!!    | worse:   5.3 TIMES more
FastPathLimit                       | wallclock       |           0.315 |           0.322 |   s    |    10%     |      no       | better:  2.1% faster
FastPathLimit                       | memory_mz       |         469.971 |         470.257 |   MB   |    20%     |      no       | better:  0.1% less
FastPathLimit                       | memory_clusterd |        1018.524 |         216.293 |   MB   |    50%     |    !!YES!!    | worse:   4.7 TIMES more
DeltaJoinMaintained                 | wallclock       |           2.521 |           2.581 |   s    |    10%     |      no       | better:  2.3% faster
DeltaJoinMaintained                 | memory_mz       |         463.009 |         464.535 |   MB   |    20%     |      no       | better:  0.3% less
DeltaJoinMaintained                 | memory_clusterd |         989.914 |         203.991 |   MB   |    50%     |    !!YES!!    | worse:   4.9 TIMES more
DeltaJoin                           | wallclock       |           2.303 |           2.330 |   s    |    10%     |      no       | better:  1.2% faster
DeltaJoin                           | memory_mz       |         456.619 |         464.916 |   MB   |    20%     |      no       | better:  1.8% less
DeltaJoin                           | memory_clusterd |         776.482 |         161.076 |   MB   |    50%     |    !!YES!!    | worse:   4.8 TIMES more
Retraction                          | wallclock       |           5.739 |           4.276 |   s    |    10%     |    !!YES!!    | worse:  34.2% slower
Retraction                          | memory_mz       |         467.777 |         468.922 |   MB   |    20%     |      no       | better:  0.2% less
Retraction                          | memory_clusterd |        1175.880 |         282.764 |   MB   |    50%     |    !!YES!!    | worse:   4.2 TIMES more

I have a suspicion that this isn't an actual regression but a result of the usage of lgalloc, which holds onto allocated memory and only releases it slowly over time. The memory_clusterd measurement is made by inspecting the Docker memory usage after the test has completed, at which point the old implementation will have freed most of the correction buffer memory while the new implementation still retains some of it in lgalloc. I should be able to validate that suspicion by switching off lgalloc.

teskje · 2025-01-08T09:26:23Z

I retried the feature benchmarks with lgalloc disabled and things do look better:

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
CrossJoin                           | wallclock       |           4.623 |           2.237 |   s    |    10%     |    !!YES!!    | worse:   2.1 TIMES slower
CrossJoin                           | memory_mz       |         468.159 |         476.074 |   MB   |    20%     |      no       | better:  1.7% less
CrossJoin                           | memory_clusterd |         213.051 |         157.261 |   MB   |    50%     |      no       | worse:  35.5% more
FastPathOrderByLimit                | wallclock       |           0.846 |           0.845 |   s    |    10%     |      no       | worse:   0.2% slower
FastPathOrderByLimit                | memory_mz       |         463.104 |         474.739 |   MB   |    20%     |      no       | better:  2.5% less
FastPathOrderByLimit                | memory_clusterd |         133.514 |          87.414 |   MB   |    50%     |    !!YES!!    | worse:  52.7% more
FastPathFilterIndex                 | wallclock       |           3.614 |           3.142 |   s    |    10%     |    !!YES!!    | worse:  15.0% slower
FastPathFilterIndex                 | memory_mz       |         481.033 |         466.251 |   MB   |    20%     |      no       | worse:   3.2% more
FastPathFilterIndex                 | memory_clusterd |         139.046 |          93.822 |   MB   |    50%     |      no       | worse:  48.2% more
AccumulateReductions                | wallclock       |          79.178 |          51.314 |   s    |    10%     |    !!YES!!    | worse:  54.3% slower
AccumulateReductions                | memory_mz       |        5117.416 |        5090.714 |   MB   |    20%     |      no       | worse:   0.5% more
AccumulateReductions                | memory_clusterd |         563.717 |          88.825 |   MB   |    50%     |    !!YES!!    | worse:   6.3 TIMES more
FullOuterJoin                       | wallclock       |           9.918 |           8.866 |   s    |    10%     |    !!YES!!    | worse:  11.9% slower
FullOuterJoin                       | memory_mz       |         458.431 |         460.243 |   MB   |    20%     |      no       | better:  0.4% less
FullOuterJoin                       | memory_clusterd |         195.122 |         161.076 |   MB   |    50%     |      no       | worse:  21.1% more
MFPPushdown                         | wallclock       |           0.343 |           0.363 |   s    |    10%     |      no       | better:  5.4% faster
MFPPushdown                         | memory_mz       |         482.559 |         473.213 |   MB   |    20%     |      no       | worse:   2.0% more
MFPPushdown                         | memory_clusterd |         464.916 |          83.294 |   MB   |    50%     |    !!YES!!    | worse:   5.6 TIMES more
FastPathFilterNoIndex               | wallclock       |           1.275 |           1.503 |   s    |    10%     |      no       | better: 15.1% faster
FastPathFilterNoIndex               | memory_mz       |         508.404 |         494.099 |   MB   |    20%     |      no       | worse:   2.9% more
FastPathFilterNoIndex               | memory_clusterd |         537.968 |         139.618 |   MB   |    50%     |    !!YES!!    | worse:   3.9 TIMES more
Retraction                          | wallclock       |           6.622 |           4.453 |   s    |    10%     |    !!YES!!    | worse:  48.7% slower
Retraction                          | memory_mz       |         458.622 |         467.682 |   MB   |    20%     |      no       | better:  1.9% less
Retraction                          | memory_clusterd |         300.980 |         175.953 |   MB   |    50%     |    !!YES!!    | worse:  71.1% more

Still a higher than expected memory regression for some of them, I want to look into those.

teskje · 2025-01-15T12:50:57Z

src/compute/src/sink/correction_v2.rs

+    let mut heap = MergeHeap::from_iter(cursors);
+    let mut merged = Chain::default();
+    while let Some(cursor1) = heap.pop() {
+        let (data, time, mut diff) = cursor1.get();


Here we'd have the option to reuse whole chunks, if all updates in the current chunk are less than the first update in the next cursor on the heap. However, if we do this naively we could end up with a lot empty space in our chunks, and therefore chains that have a lot more chunks than they'd need were their updates tightly packed.

teskje · 2025-01-15T13:01:33Z

src/compute/src/sink/correction_v2.rs

+//! Unfortunately, performing consolidation as described above can break the chain invariant and we
+//! might need to restore it by merging chains, including ones containing future updates. This is
+//! something that would be great to fix! In the meantime the hope is that in steady state it
+//! doesn't matter too much because either there are no future retractions and U is approximately
+//! equal to N, or the amount of future retractions is much larger than the amount of current
+//! changes, in which case removing the current changes has a good chance of leaving the chain
+//! invariant intact.


This is, I think, the main wrinkle in the current implementation. It'd be nice to never have to look at future updates when consolidating the current ones. All alternatives I've come up with so far are not great:

Don't consolidate the correction updates at all, just provide a read-only merging iterator. That's sufficient for the MV sink to work, but it doesn't explain when the correction contents will be consolidated. Inserts will trigger merges, but if few inserts happen, we can't rely on those. We'd need some form of idle merging, but that adds significant complexity.

Skip restoring the chain invariant and rely on subsequent inserts to do so. It's not clear to me if that actually improves anything or just moves work from one operation to the other. It definitely makes the data structure harder to reason about, since now the chain invariant isn't an invariant anymore and we can't make crisp statements about the complexity of inserts anymore.

teskje · 2025-01-15T13:03:56Z

misc/python/materialize/mzcompose/__init__.py

@@ -94,6 +94,7 @@ def get_default_system_parameters(
        "enable_alter_swap": "true",
        "enable_columnation_lgalloc": "true",
        "enable_compute_chunked_stack": "true",
+        "enable_compute_correction_v2": "true",


I plan to remove this before merging. Currently the new correction implementation regresses some of the feature benchmarks and while some of that might be expected, I should probably do some profiling before deciding to accept the regressions.

This commit introduces a "correction v2" buffer that differs from the existing one in that it stores data in columnated regions that can be spilled to disk. It follows the general design of the arrangement merge batcher, with the major difference that it sorts updates by time first, in an attempt to more efficiently deal with the presence of updates in the future (commonly introduced by temporal filters).

This commit adds a new dyncfg, `enable_compute_correction_v2`, that controlls whether the MV sink v2 should use the old or the new implementation of the correction buffer.

antiguru

Some comments, but I'm still reading.

antiguru · 2025-01-20T09:30:39Z

src/compute/src/sink/correction_v2.rs

+}
+
+/// Merge the given chains, advancing times by the given `since` in the process.
+fn merge_chains<D: Data>(chains: Vec<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {


Suggested change

fn merge_chains<D: Data>(chains: Vec<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {

fn merge_chains<D: Data>(chains: impl Iterator<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {

antiguru · 2025-01-20T09:35:03Z

src/compute/src/sink/correction_v2.rs

+    fn update_metrics(&mut self) {
+        let mut new_size = LengthAndCapacity::default();
+        for chain in &mut self.chains {
+            new_size += chain.get_size();
+        }
+
+        let old_size = self.total_size;
+        let len_delta = UpdateDelta::new(new_size.length, old_size.length);
+        let cap_delta = UpdateDelta::new(new_size.capacity, old_size.capacity);
+        self.metrics
+            .report_correction_update_deltas(len_delta, cap_delta);
+        self.worker_metrics
+            .report_correction_update_totals(new_size.length, new_size.capacity);
+
+        self.total_size = new_size;
+    }


We should use something similar to the arrangement size logging instead. This only exposes it through metrics. Can fix as a separate PR!

antiguru · 2025-01-20T09:42:08Z

src/compute/src/sink/correction_v2.rs

+/// in spirit.
+struct Chunk<D: Data> {
+    /// The contained updates.
+    data: StackWrapper<(D, Timestamp, Diff)>,


Since data has a static capacity, you probably want a different type here. Array might be that.

antiguru · 2025-01-20T14:13:54Z

src/compute/src/sink/correction_v2.rs

+    }
+
+    /// Return the chunk capacity implied by [`CHUNK_SIZE_BYTES`] and the update size.
+    fn capacity() -> usize {


When using Array, this should be desired_capacity, and all capacity checks should be against the array's capacity. Lgalloc returns memory with at least the requested capacity.

teskje force-pushed the correction-lgalloc branch 2 times, most recently from ffdf64d to 77afd64 Compare October 21, 2024 08:31

antiguru reviewed Oct 21, 2024

View reviewed changes

src/compute/src/sink/correction_v2.rs Outdated Show resolved Hide resolved

teskje force-pushed the correction-lgalloc branch 2 times, most recently from 25b0987 to 0b2e359 Compare October 25, 2024 12:13

teskje force-pushed the correction-lgalloc branch 4 times, most recently from ffb5488 to 4697f9c Compare November 25, 2024 10:22

teskje force-pushed the correction-lgalloc branch 3 times, most recently from 4356350 to 185c99c Compare December 1, 2024 12:06

teskje force-pushed the correction-lgalloc branch 2 times, most recently from e95c32f to 49c0569 Compare January 3, 2025 11:39

teskje force-pushed the correction-lgalloc branch from 49c0569 to e01a7dc Compare January 6, 2025 12:11

teskje force-pushed the correction-lgalloc branch from 9d51f8f to c8846d6 Compare January 15, 2025 12:30

teskje changed the title ~~[WIP] compute: spillable MV correction buffer~~ compute: spillable MV correction buffer Jan 15, 2025

teskje commented Jan 15, 2025

View reviewed changes

teskje force-pushed the correction-lgalloc branch 4 times, most recently from 0993a98 to 6607ce2 Compare January 15, 2025 16:08

antiguru self-requested a review January 15, 2025 16:12

teskje force-pushed the correction-lgalloc branch 3 times, most recently from c825ca6 to f0c6e77 Compare January 17, 2025 11:32

teskje force-pushed the correction-lgalloc branch from f0c6e77 to 2262dee Compare January 17, 2025 15:08

teskje added 3 commits January 17, 2025 19:06

compute: make correction version configurable

a7ef173

This commit adds a new dyncfg, `enable_compute_correction_v2`, that controlls whether the MV sink v2 should use the old or the new implementation of the correction buffer.

compute: wire up sink metrics with CorrectionV2

70d0355

teskje force-pushed the correction-lgalloc branch from 2262dee to 70d0355 Compare January 17, 2025 18:06

teskje marked this pull request as ready for review January 20, 2025 08:59

teskje requested a review from a team as a code owner January 20, 2025 08:59

teskje requested a review from frankmcsherry January 20, 2025 09:08

antiguru reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: spillable MV correction buffer #30083

compute: spillable MV correction buffer #30083

teskje commented Oct 18, 2024 •

edited

Loading

teskje commented Jan 3, 2025 •

edited

Loading

teskje commented Jan 8, 2025

teskje Jan 15, 2025

teskje Jan 15, 2025

teskje Jan 15, 2025

antiguru left a comment

antiguru Jan 20, 2025

antiguru Jan 20, 2025

antiguru Jan 20, 2025

antiguru Jan 20, 2025

	fn merge_chains<D: Data>(chains: Vec<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {
	fn merge_chains<D: Data>(chains: impl Iterator<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {

compute: spillable MV correction buffer #30083

Are you sure you want to change the base?

compute: spillable MV correction buffer #30083

Conversation

teskje commented Oct 18, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

teskje commented Jan 3, 2025 • edited Loading

teskje commented Jan 8, 2025

teskje Jan 15, 2025

Choose a reason for hiding this comment

teskje Jan 15, 2025

Choose a reason for hiding this comment

teskje Jan 15, 2025

Choose a reason for hiding this comment

antiguru left a comment

Choose a reason for hiding this comment

antiguru Jan 20, 2025

Choose a reason for hiding this comment

antiguru Jan 20, 2025

Choose a reason for hiding this comment

antiguru Jan 20, 2025

Choose a reason for hiding this comment

antiguru Jan 20, 2025

Choose a reason for hiding this comment

teskje commented Oct 18, 2024 •

edited

Loading

teskje commented Jan 3, 2025 •

edited

Loading