Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: spillable MV correction buffer #30083

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Oct 18, 2024

This PR introduces a "correction v2" buffer that differs from the existing one in that it stores data in columnated regions that can be spilled to disk. It follows the general design of the arrangement merge batcher, with the major difference that it sorts updates by time first, in an attempt to more efficiently deal with the presence of updates in
the future (commonly introduced by temporal filters).

The new correction buffer can be switched on through a feature flag, enable_compute_correction_v2, and is switched off by default. The plan is to keep it disabled in production but have it available for emergencies where replicas fail to hydrate due to the MV memory spike. Eventually we'll want to make the new correction buffer the default, but we should do more performance testing before that.

Motivation

  • This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/8464

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@teskje teskje force-pushed the correction-lgalloc branch 2 times, most recently from ffdf64d to 77afd64 Compare October 21, 2024 08:31
@teskje teskje force-pushed the correction-lgalloc branch 2 times, most recently from 25b0987 to 0b2e359 Compare October 25, 2024 12:13
@teskje teskje force-pushed the correction-lgalloc branch 4 times, most recently from ffb5488 to 4697f9c Compare November 25, 2024 10:22
@teskje teskje force-pushed the correction-lgalloc branch 3 times, most recently from 4356350 to 185c99c Compare December 1, 2024 12:06
@teskje teskje force-pushed the correction-lgalloc branch 2 times, most recently from e95c32f to 49c0569 Compare January 3, 2025 11:39
@teskje
Copy link
Contributor Author

teskje commented Jan 3, 2025

The feature benchmarks report a bunch of regressions. Some increased CPU and memory usage is expected, but the memory regressions here are worryingly large:

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
FastPathOrderByLimit                | wallclock       |           0.827 |           0.871 |   s    |    10%     |      no       | better:  5.0% faster
FastPathOrderByLimit                | memory_mz       |         467.682 |         464.535 |   MB   |    20%     |      no       | worse:   0.7% more
FastPathOrderByLimit                | memory_clusterd |         959.396 |         149.727 |   MB   |    50%     |    !!YES!!    | worse:   6.4 TIMES more
OrderBy                             | wallclock       |           9.010 |           8.825 |   s    |    10%     |      no       | worse:   2.1% slower
OrderBy                             | memory_mz       |         473.213 |         473.499 |   MB   |    20%     |      no       | better:  0.1% less
OrderBy                             | memory_clusterd |        1179.695 |         307.178 |   MB   |    50%     |    !!YES!!    | worse:   3.8 TIMES more
DifferentialJoin                    | wallclock       |           1.396 |           1.401 |   s    |    10%     |      no       | better:  0.4% faster
DifferentialJoin                    | memory_mz       |         455.093 |         466.061 |   MB   |    20%     |      no       | better:  2.4% less
DifferentialJoin                    | memory_clusterd |         930.405 |         144.672 |   MB   |    50%     |    !!YES!!    | worse:   6.4 TIMES more
FastPathFilterIndex                 | wallclock       |           3.423 |           4.440 |   s    |    10%     |      no       | better: 22.9% faster
FastPathFilterIndex                 | memory_mz       |         470.448 |         465.775 |   MB   |    20%     |      no       | worse:   1.0% more
FastPathFilterIndex                 | memory_clusterd |         935.078 |         155.830 |   MB   |    50%     |    !!YES!!    | worse:   6.0 TIMES more
CrossJoin                           | wallclock       |           3.416 |           2.429 |   s    |    10%     |    !!YES!!    | worse:  40.7% slower
CrossJoin                           | memory_mz       |         468.540 |         463.772 |   MB   |    20%     |      no       | worse:   1.0% more
CrossJoin                           | memory_clusterd |         918.388 |         205.231 |   MB   |    50%     |    !!YES!!    | worse:   4.5 TIMES more
FullOuterJoin                       | wallclock       |           9.714 |           8.691 |   s    |    10%     |    !!YES!!    | worse:  11.8% slower
FullOuterJoin                       | memory_mz       |         461.483 |         468.254 |   MB   |    20%     |      no       | better:  1.4% less
FullOuterJoin                       | memory_clusterd |        1033.783 |         220.871 |   MB   |    50%     |    !!YES!!    | worse:   4.7 TIMES more
MFPPushdown                         | wallclock       |           0.356 |           0.363 |   s    |    10%     |      no       | better:  2.2% faster
MFPPushdown                         | memory_mz       |         470.924 |         466.919 |   MB   |    20%     |      no       | worse:   0.9% more
MFPPushdown                         | memory_clusterd |        6407.738 |         145.721 |   MB   |    50%     |    !!YES!!    | worse:  44.0 TIMES more
CountDistinct                       | wallclock       |           1.807 |           1.802 |   s    |    10%     |      no       | worse:   0.3% slower
CountDistinct                       | memory_mz       |         464.344 |         462.818 |   MB   |    20%     |      no       | worse:   0.3% more
CountDistinct                       | memory_clusterd |         727.177 |         150.299 |   MB   |    50%     |    !!YES!!    | worse:   4.8 TIMES more
AccumulateReductions                | wallclock       |          65.003 |          50.974 |   s    |    10%     |    !!YES!!    | worse:  27.5% slower
AccumulateReductions                | memory_mz       |        5145.073 |        5155.563 |   MB   |    20%     |      no       | better:  0.2% less
AccumulateReductions                | memory_clusterd |        7984.161 |         144.672 |   MB   |    50%     |    !!YES!!    | worse:  55.2 TIMES more
FastPathFilterNoIndex               | wallclock       |           1.309 |           1.316 |   s    |    10%     |      no       | better:  0.5% faster
FastPathFilterNoIndex               | memory_mz       |         501.537 |         488.853 |   MB   |    20%     |      no       | worse:   2.6% more
FastPathFilterNoIndex               | memory_clusterd |        6087.303 |         206.852 |   MB   |    50%     |    !!YES!!    | worse:  29.4 TIMES more
GroupBy                             | wallclock       |           4.553 |           4.545 |   s    |    10%     |      no       | worse:   0.2% slower
GroupBy                             | memory_mz       |         459.099 |         464.630 |   MB   |    20%     |      no       | better:  1.2% less
GroupBy                             | memory_clusterd |         960.350 |         244.617 |   MB   |    50%     |    !!YES!!    | worse:   3.9 TIMES more
FinishOrderByLimit                  | wallclock       |           1.361 |           1.386 |   s    |    10%     |      no       | better:  1.8% faster
FinishOrderByLimit                  | memory_mz       |         467.396 |         463.486 |   MB   |    20%     |      no       | worse:   0.8% more
FinishOrderByLimit                  | memory_clusterd |         916.767 |         158.215 |   MB   |    50%     |    !!YES!!    | worse:   5.8 TIMES more
GroupByMaintained                   | wallclock       |          16.215 |          16.302 |   s    |    10%     |      no       | better:  0.5% faster
GroupByMaintained                   | memory_mz       |         478.172 |         475.502 |   MB   |    20%     |      no       | worse:   0.6% more
GroupByMaintained                   | memory_clusterd |        1375.198 |         510.120 |   MB   |    50%     |    !!YES!!    | worse:   2.7 TIMES more
MinMaxMaintained                    | wallclock       |           6.351 |           6.444 |   s    |    10%     |      no       | better:  1.4% faster
MinMaxMaintained                    | memory_mz       |         465.298 |         468.063 |   MB   |    20%     |      no       | better:  0.6% less
MinMaxMaintained                    | memory_clusterd |        1131.058 |         296.307 |   MB   |    50%     |    !!YES!!    | worse:   3.8 TIMES more
MinMax                              | wallclock       |           1.538 |           1.582 |   s    |    10%     |      no       | better:  2.8% faster
MinMax                              | memory_mz       |         464.916 |         459.671 |   MB   |    20%     |      no       | worse:   1.1% more
MinMax                              | memory_clusterd |        1009.941 |         191.689 |   MB   |    50%     |    !!YES!!    | worse:   5.3 TIMES more
FastPathLimit                       | wallclock       |           0.315 |           0.322 |   s    |    10%     |      no       | better:  2.1% faster
FastPathLimit                       | memory_mz       |         469.971 |         470.257 |   MB   |    20%     |      no       | better:  0.1% less
FastPathLimit                       | memory_clusterd |        1018.524 |         216.293 |   MB   |    50%     |    !!YES!!    | worse:   4.7 TIMES more
DeltaJoinMaintained                 | wallclock       |           2.521 |           2.581 |   s    |    10%     |      no       | better:  2.3% faster
DeltaJoinMaintained                 | memory_mz       |         463.009 |         464.535 |   MB   |    20%     |      no       | better:  0.3% less
DeltaJoinMaintained                 | memory_clusterd |         989.914 |         203.991 |   MB   |    50%     |    !!YES!!    | worse:   4.9 TIMES more
DeltaJoin                           | wallclock       |           2.303 |           2.330 |   s    |    10%     |      no       | better:  1.2% faster
DeltaJoin                           | memory_mz       |         456.619 |         464.916 |   MB   |    20%     |      no       | better:  1.8% less
DeltaJoin                           | memory_clusterd |         776.482 |         161.076 |   MB   |    50%     |    !!YES!!    | worse:   4.8 TIMES more
Retraction                          | wallclock       |           5.739 |           4.276 |   s    |    10%     |    !!YES!!    | worse:  34.2% slower
Retraction                          | memory_mz       |         467.777 |         468.922 |   MB   |    20%     |      no       | better:  0.2% less
Retraction                          | memory_clusterd |        1175.880 |         282.764 |   MB   |    50%     |    !!YES!!    | worse:   4.2 TIMES more

I have a suspicion that this isn't an actual regression but a result of the usage of lgalloc, which holds onto allocated memory and only releases it slowly over time. The memory_clusterd measurement is made by inspecting the Docker memory usage after the test has completed, at which point the old implementation will have freed most of the correction buffer memory while the new implementation still retains some of it in lgalloc. I should be able to validate that suspicion by switching off lgalloc.

@teskje teskje force-pushed the correction-lgalloc branch from 49c0569 to e01a7dc Compare January 6, 2025 12:11
@teskje
Copy link
Contributor Author

teskje commented Jan 8, 2025

I retried the feature benchmarks with lgalloc disabled and things do look better:

NAME                                | TYPE            |      THIS       |      OTHER      |  UNIT  | THRESHOLD  |  Regression?  | 'THIS' is
--------------------------------------------------------------------------------------------------------------------------------------------------------
CrossJoin                           | wallclock       |           4.623 |           2.237 |   s    |    10%     |    !!YES!!    | worse:   2.1 TIMES slower
CrossJoin                           | memory_mz       |         468.159 |         476.074 |   MB   |    20%     |      no       | better:  1.7% less
CrossJoin                           | memory_clusterd |         213.051 |         157.261 |   MB   |    50%     |      no       | worse:  35.5% more
FastPathOrderByLimit                | wallclock       |           0.846 |           0.845 |   s    |    10%     |      no       | worse:   0.2% slower
FastPathOrderByLimit                | memory_mz       |         463.104 |         474.739 |   MB   |    20%     |      no       | better:  2.5% less
FastPathOrderByLimit                | memory_clusterd |         133.514 |          87.414 |   MB   |    50%     |    !!YES!!    | worse:  52.7% more
FastPathFilterIndex                 | wallclock       |           3.614 |           3.142 |   s    |    10%     |    !!YES!!    | worse:  15.0% slower
FastPathFilterIndex                 | memory_mz       |         481.033 |         466.251 |   MB   |    20%     |      no       | worse:   3.2% more
FastPathFilterIndex                 | memory_clusterd |         139.046 |          93.822 |   MB   |    50%     |      no       | worse:  48.2% more
AccumulateReductions                | wallclock       |          79.178 |          51.314 |   s    |    10%     |    !!YES!!    | worse:  54.3% slower
AccumulateReductions                | memory_mz       |        5117.416 |        5090.714 |   MB   |    20%     |      no       | worse:   0.5% more
AccumulateReductions                | memory_clusterd |         563.717 |          88.825 |   MB   |    50%     |    !!YES!!    | worse:   6.3 TIMES more
FullOuterJoin                       | wallclock       |           9.918 |           8.866 |   s    |    10%     |    !!YES!!    | worse:  11.9% slower
FullOuterJoin                       | memory_mz       |         458.431 |         460.243 |   MB   |    20%     |      no       | better:  0.4% less
FullOuterJoin                       | memory_clusterd |         195.122 |         161.076 |   MB   |    50%     |      no       | worse:  21.1% more
MFPPushdown                         | wallclock       |           0.343 |           0.363 |   s    |    10%     |      no       | better:  5.4% faster
MFPPushdown                         | memory_mz       |         482.559 |         473.213 |   MB   |    20%     |      no       | worse:   2.0% more
MFPPushdown                         | memory_clusterd |         464.916 |          83.294 |   MB   |    50%     |    !!YES!!    | worse:   5.6 TIMES more
FastPathFilterNoIndex               | wallclock       |           1.275 |           1.503 |   s    |    10%     |      no       | better: 15.1% faster
FastPathFilterNoIndex               | memory_mz       |         508.404 |         494.099 |   MB   |    20%     |      no       | worse:   2.9% more
FastPathFilterNoIndex               | memory_clusterd |         537.968 |         139.618 |   MB   |    50%     |    !!YES!!    | worse:   3.9 TIMES more
Retraction                          | wallclock       |           6.622 |           4.453 |   s    |    10%     |    !!YES!!    | worse:  48.7% slower
Retraction                          | memory_mz       |         458.622 |         467.682 |   MB   |    20%     |      no       | better:  1.9% less
Retraction                          | memory_clusterd |         300.980 |         175.953 |   MB   |    50%     |    !!YES!!    | worse:  71.1% more

Still a higher than expected memory regression for some of them, I want to look into those.

@teskje teskje force-pushed the correction-lgalloc branch from 9d51f8f to c8846d6 Compare January 15, 2025 12:30
@teskje teskje changed the title [WIP] compute: spillable MV correction buffer compute: spillable MV correction buffer Jan 15, 2025
let mut heap = MergeHeap::from_iter(cursors);
let mut merged = Chain::default();
while let Some(cursor1) = heap.pop() {
let (data, time, mut diff) = cursor1.get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we'd have the option to reuse whole chunks, if all updates in the current chunk are less than the first update in the next cursor on the heap. However, if we do this naively we could end up with a lot empty space in our chunks, and therefore chains that have a lot more chunks than they'd need were their updates tightly packed.

Comment on lines +80 to +88
//! Unfortunately, performing consolidation as described above can break the chain invariant and we
//! might need to restore it by merging chains, including ones containing future updates. This is
//! something that would be great to fix! In the meantime the hope is that in steady state it
//! doesn't matter too much because either there are no future retractions and U is approximately
//! equal to N, or the amount of future retractions is much larger than the amount of current
//! changes, in which case removing the current changes has a good chance of leaving the chain
//! invariant intact.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, I think, the main wrinkle in the current implementation. It'd be nice to never have to look at future updates when consolidating the current ones. All alternatives I've come up with so far are not great:

  • Don't consolidate the correction updates at all, just provide a read-only merging iterator. That's sufficient for the MV sink to work, but it doesn't explain when the correction contents will be consolidated. Inserts will trigger merges, but if few inserts happen, we can't rely on those. We'd need some form of idle merging, but that adds significant complexity.
  • Skip restoring the chain invariant and rely on subsequent inserts to do so. It's not clear to me if that actually improves anything or just moves work from one operation to the other. It definitely makes the data structure harder to reason about, since now the chain invariant isn't an invariant anymore and we can't make crisp statements about the complexity of inserts anymore.

@@ -94,6 +94,7 @@ def get_default_system_parameters(
"enable_alter_swap": "true",
"enable_columnation_lgalloc": "true",
"enable_compute_chunked_stack": "true",
"enable_compute_correction_v2": "true",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to remove this before merging. Currently the new correction implementation regresses some of the feature benchmarks and while some of that might be expected, I should probably do some profiling before deciding to accept the regressions.

@teskje teskje force-pushed the correction-lgalloc branch 4 times, most recently from 0993a98 to 6607ce2 Compare January 15, 2025 16:08
@antiguru antiguru self-requested a review January 15, 2025 16:12
@teskje teskje force-pushed the correction-lgalloc branch 3 times, most recently from c825ca6 to f0c6e77 Compare January 17, 2025 11:32
@teskje teskje force-pushed the correction-lgalloc branch from f0c6e77 to 2262dee Compare January 17, 2025 15:08
This commit introduces a "correction v2" buffer that differs from the
existing one in that it stores data in columnated regions that can be
spilled to disk. It follows the general design of the arrangement merge
batcher, with the major difference that it sorts updates by time first,
in an attempt to more efficiently deal with the presence of updates in
the future (commonly introduced by temporal filters).
This commit adds a new dyncfg, `enable_compute_correction_v2`, that
controlls whether the MV sink v2 should use the old or the new
implementation of the correction buffer.
@teskje teskje force-pushed the correction-lgalloc branch from 2262dee to 70d0355 Compare January 17, 2025 18:06
@teskje teskje marked this pull request as ready for review January 20, 2025 08:59
@teskje teskje requested a review from a team as a code owner January 20, 2025 08:59
@teskje teskje requested a review from frankmcsherry January 20, 2025 09:08
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but I'm still reading.

}

/// Merge the given chains, advancing times by the given `since` in the process.
fn merge_chains<D: Data>(chains: Vec<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn merge_chains<D: Data>(chains: Vec<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {
fn merge_chains<D: Data>(chains: impl Iterator<Chain<D>>, since: &Antichain<Timestamp>) -> Chain<D> {

Comment on lines +340 to +355
fn update_metrics(&mut self) {
let mut new_size = LengthAndCapacity::default();
for chain in &mut self.chains {
new_size += chain.get_size();
}

let old_size = self.total_size;
let len_delta = UpdateDelta::new(new_size.length, old_size.length);
let cap_delta = UpdateDelta::new(new_size.capacity, old_size.capacity);
self.metrics
.report_correction_update_deltas(len_delta, cap_delta);
self.worker_metrics
.report_correction_update_totals(new_size.length, new_size.capacity);

self.total_size = new_size;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use something similar to the arrangement size logging instead. This only exposes it through metrics. Can fix as a separate PR!

/// in spirit.
struct Chunk<D: Data> {
/// The contained updates.
data: StackWrapper<(D, Timestamp, Diff)>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since data has a static capacity, you probably want a different type here. Array might be that.

}

/// Return the chunk capacity implied by [`CHUNK_SIZE_BYTES`] and the update size.
fn capacity() -> usize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using Array, this should be desired_capacity, and all capacity checks should be against the array's capacity. Lgalloc returns memory with at least the requested capacity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants