Rr/sc 60366 sparse global order reader merge #5417

rroelke · 2025-01-02T20:35:16Z

The story contains more details, but in brief this pull request adds an additional mode to the sparse global order reader in which we pre-process the minimum bounding rectangles of all tiles from all fragments to determine a single global order in which all of the tiles must be loaded.

This pre-processing step is implemented using a "parallel merge" algorithm which merges the tiles from each fragment (which are arranged in global order within the fragment).

Parallel Merge

The parallel merge code lives in tiledb/common/algorithm/parallel_merge.h. It is written generically to merge streams of a copyable type T using any type which can compare T (default is std::less<T> of course). An explanation of the algorithm is provided within the file.

The top-level function parallel_merge is asynchronous, i.e. it returns a future which can be polled to see how much of the merge has already completed. This enables callers to begin processing merged data from the head of the eventual output before the tail of the eventual output has finished.

Sparse Global Order Reader

We extend the sparse global order reader with a new configuration sm.query.sparse_global_order.preprocess_tile_merge. If nonzero, the sparse global order reader will run a parallel merge on the fragments to find the unified tile order and then use that to populate result tiles.

preprocess_compute_result_tile_order kicks off the parallel merge.
create_result_tiles_using_preprocess advances along the global tile order to create result tiles.

The fields which are used for the old "per fragment result tiles" mode have been encapsulated into their own struct to emphasize that their use does not overlap with this new mode.

create_result_tiles_using_preprocess does not need a per-fragment memory budget; instead it pulls tiles off of the globally ordered tile list until it has saturated the memory budget as much as it can.

Tiles in the unified global order are arranged on their lower bound. The upper bounds of the tiles in the list may be out of order. To prevent cells from tile A to be emitted out of order with cells from tile B, we augment add_next_cell_to_queue to check the lower bound of the tiles which have not populated result tiles yet.

The value of sm.query.sparse_global_order.preprocess_tile_merge configures the minimum amount of work that each parallel unit of the merge will do. This is so we can benchmark with different values without re-compiling; we will either want to recommend a value to customers, or choose one and flip this to a boolean.

Serialization

The unified global tile order is state which must be communicated back and forth between the client and REST server. We can either serialize this whole list (16 bytes per tile across all fragments) or we can re-compute the parallel merge each time we run a submit on the REST server side. ~~The current implementation chooses the latter, assuming that smaller messages are preferred to the additional CPU overhead.~~ It turns out that we must serialize the tile order. The parallel merge algorithm should be deterministic, but it turns out that some aspect of the REST server state sometimes causes the subarray qualifying tile ranges to vary from one iteration to the next, which means that we cannot recompute the tile order in the same way.

Testing

Testing of all changes is augmented using rapidcheck. With this library, rather than writing some test data examples, we write properties which contain generic claims about what the expected output must look like for a given input. The rapidcheck runtime generates arbitrary inputs to the property to test our claims.

The parallel merge algorithm is tested in unit_parallel_merge.cc and has rapidcheck properties implemented for each step of the algorithm.

The sparse global order reader tests are in unit-sparse-global-order-reader.cc. The gist is that we have a generic function CSparseGlobalOrderFx::run which writes a bunch of fragments, and then reads the data back in global order, comparing against an expected result. There's a fair bit of refactoring to support this. For 1D arrays we have tests Sparse global order reader: fragment skew, fragment interleave, and fragment many overlap which set up inputs which are expected to exercise some of the edge cases in the global order reader. And then we add rapidcheck 1D and rapidcheck 2D tests which generate totally arbitrary 1D and 2D inputs respectively.

Performance Results

I still have more to do here, but things are looking pretty good... will fill in more details here as I have them. Notes are here.

TYPE: FEATURE | BUG | IMPROVEMENT
DESC: sparse global order reader determine global order of result tiles

…nction

… first pass

…l SECTION

rroelke · 2025-02-06T19:06:23Z

@ypatia

I did manage to find a bug today (inspired by SC-61471).

In brief, the recent round of changes all is about the merge bound. If duplicates are not allowed, then the merge bound comparison must be a strict inequality, not <= as it is now. If a loaded tile contains a coordinate which is equal to the merge bound, then it is incorrect to emit it when duplicates are off since the next un-loaded tile might also contain that coordinate.

This change was pretty small in the reader but propagated to the tests as the addition of the new merge bound duplication test, as well as randomizing allow_dups in some of the existing rapidcheck tests.

This reverts commit b6ba300.

…obal-order-reader-merge

ypatia · 2025-02-10T13:57:11Z

/backport to release-2.27

github-actions · 2025-02-10T13:57:25Z

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13242518400

…5417) (#5443) Backport of #5417 to release-2.27 --- Sc-60366 Implements "pre-process tile order" mode for the sparse global order reader. In brief, this does a first pass over the fragment metadata tile MBRs to create a single unified list of all the (fragment ID, tile ID) pairs arranged approximately in global order. When running natively, the tile order is computed a single time the first time the query is submitted, and then kept in memory. When running on the REST server, the tile order is recomputed for each query message. Evidence indicates that the overhead of this is low; we expect that it is lower than serializing the tile order would be. - Add parallel merge algorithm - Add rapidcheck library and some common scaffolding code - Add `tiledb_submit_a_b` performance testing binary - Add preprocess tile order mode to sparse global order reader, on by default --- TYPE: FEATURE | BUG | IMPROVEMENT DESC: sparse global order reader determine global order of result tiles (cherry picked from commit 3c617e3)

teo-tsirpanis · 2025-02-11T17:39:17Z

tiledb/sm/query/readers/sparse_global_order_reader.cc

+        // this is a tile which qualified for the subarray and was
+        // a created result tile, we must continue processing it
+        bool budget_exceeded;
+        while ((budget_exceeded = add_result_tile(


This line caused a warning-turned-error in the backport PR on VS2019 (which I suppressed). The add_result_tile function is pre-existing, but I must say that returning a boolean indicating whether the budget was exceeded is very confusing.

…5417) (#5443) Backport of #5417 to release-2.27 --- TYPE: IMPROVEMENT DESC: sparse global order reader determine global order of result tiles --------- Co-authored-by: Ypatia Tsavliri <ypatia@tiledb.com> Co-authored-by: Ryan Roelke <rroelke@users.noreply.github.com> Co-authored-by: Isaiah Norton <ihnorton@users.noreply.github.com> Co-authored-by: Theodore Tsirpanis <teo@tsirpanis.gr>

teo-tsirpanis · 2025-02-12T13:33:53Z

/backport to release-2.27

(messed up the previous backport)

github-actions · 2025-02-12T13:34:06Z

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13286600898

…5417) (#5448) Backport of #5417 to release-2.27 --- TYPE: IMPROVEMENT DESC: sparse global order reader determine global order of result tiles --------- Co-authored-by: Ryan Roelke <ryan.roelke@tiledb.com> Co-authored-by: Theodore Tsirpanis <teo@tsirpanis.gr>

#5417 made several changes to `SparseGlobalOrderReader::merge_result_cell_slabs`. One of these changes erroneously caused coordinates which do not pass the query condition to be emitted. This pull request restores the correct behavior. The crux of the change is ``` length = 1; ``` to ``` length = to_process.max_slab_length(1); ``` The affected line of code occurs during coordinate de-duplication. `merge_result_cell_slabs` puts coordinates from the leading tile of each fragment into a priority queue and pops them off. If the coordinate at the head of the queue matches what we just pulled off, then we continue popping until there is no longer a match. This de-duplication can cause a fragment to run out of loaded tiles, in which case it is necessary to end the loop to get more data so that we do not emit coordinates out of order. That's how we get to this `length = 1` code, which signals that we should emit a result slab of length 1. In contrast, `max_slab_length(1)` checks the query condition bitmap to see if the duplicated coordinate actually belongs in the query result. This was found by a customer data set which has many fragments which are essentially copies of the whole coordinate domain. This scenario de-duplicates many coordinates and is likely to hit the condition described. The new test `Sparse global order reader: fragment full copy 1d` simulates this set-up. A large fraction of the changes in this PR are the test scaffolding changes to support evaluating query conditions upon our `FxRun1D` and `FxRun2D` structures, so we can continue to check whether the expected and actual results match in the presence of a query condition. --- TYPE: BUG DESC: Fix query condition regression introduced by #5417

#5417 made several changes to `SparseGlobalOrderReader::merge_result_cell_slabs`. One of these changes erroneously caused coordinates which do not pass the query condition to be emitted. This pull request restores the correct behavior. The crux of the change is ``` length = 1; ``` to ``` length = to_process.max_slab_length(1); ``` The affected line of code occurs during coordinate de-duplication. `merge_result_cell_slabs` puts coordinates from the leading tile of each fragment into a priority queue and pops them off. If the coordinate at the head of the queue matches what we just pulled off, then we continue popping until there is no longer a match. This de-duplication can cause a fragment to run out of loaded tiles, in which case it is necessary to end the loop to get more data so that we do not emit coordinates out of order. That's how we get to this `length = 1` code, which signals that we should emit a result slab of length 1. In contrast, `max_slab_length(1)` checks the query condition bitmap to see if the duplicated coordinate actually belongs in the query result. This was found by a customer data set which has many fragments which are essentially copies of the whole coordinate domain. This scenario de-duplicates many coordinates and is likely to hit the condition described. The new test `Sparse global order reader: fragment full copy 1d` simulates this set-up. A large fraction of the changes in this PR are the test scaffolding changes to support evaluating query conditions upon our `FxRun1D` and `FxRun2D` structures, so we can continue to check whether the expected and actual results match in the presence of a query condition. --- TYPE: BUG DESC: Fix query condition regression introduced by #5417 (cherry picked from commit 022af98)

rroelke added 30 commits December 4, 2024 10:48

Add cpp parallel merge implementation

f3ea201

Move template parallel merge impl into header and template Compare fu…

ba90274

…nction

Domain::dimensions

b90353b

Remove trailing newline

54a7de2

TransientReadState::contains_tile

ef488f8

void SparseGlobalOrderReader<BitmapType>::compute_result_tile_order()…

900ebb2

… first pass

Fix uint64_t in ParallelMerge

334eb0d

Fix errors in compute_result_tile_order

399023b

Fix indexing error in GlobalOrderMBRCmp

0593e30

Change test section names so that -c option can be used

007e43d

Passing 'qc removes full tile' unit test with compute_result_tile_order

169b20e

Remove per-fragment memory fields

a8cd4b0

TEST_CASE parallel merge rapidcheck verify_split_point_stream_bounds

a9d78d0

ParallelMergeException

0299ca7

VerifyIdentifyMergeUnit

804fc6f

VerifyTournamentMerge

0285add

VerifyParallelMerge

54cbed8

Add more parallel merge test cases for uint8_t and numbers in range 0,10

0183686

Remove commented-out code

da5d191

Fix exceeding memory budget

1225c21

Add UntypedDatum methods to Range

3833eaa

Refactor GlobalCmp

5a20d6b

Add fragment skew, fragment interleave tests

3c6e4b2

Compare against next tile MBR in add_next_cell_to_queue

d281adb

Fill in memory budget exception

b43381d

Add TODO for test case

d2d06fa

tdb::pmr::emplace_unique

a55ddf3

Use pmr in parallel merge

fc8936e

Test tweaks

a2c6a8d

Relax condition for throwing 'Cannot load a single tile'

9023c1c

rroelke added 6 commits February 6, 2025 10:50

Remove unused capnp ResultTileId

03c540b

Refactor stdx::select

b6ba300

::run_execute does not permit 'Cannot load tile offsets' for top-leve…

d455c02

…l SECTION

Add merge bound duplication test

0b3fd1c

add_next_cell checks equality with merge bound

221c79e

Toggle allow_dups in some of the rapidcheck tests

2a6839b

rroelke added 6 commits February 6, 2025 15:24

Fix rapidcheck allow_dups issues

d9f7610

Fix uniqueness of rapidcheck 1d test

8451b66

Revert "Refactor stdx::select"

9c15ce3

This reverts commit b6ba300.

msvc errors

76fe5e3

Merge remote-tracking branch 'origin/main' into rr/sc-60366-sparse-gl…

8b3d834

…obal-order-reader-merge

Rapidcheck ignore 'Cannot allocate space for preprocess'

7992ba4

rroelke merged commit 3c617e3 into main Feb 7, 2025
59 checks passed

rroelke deleted the rr/sc-60366-sparse-global-order-reader-merge branch February 7, 2025 02:15

github-actions bot mentioned this pull request Feb 10, 2025

[Backport release-2.27] Rr/sc 60366 sparse global order reader merge (#5417) #5443

Merged

teo-tsirpanis reviewed Feb 11, 2025

View reviewed changes

github-actions bot mentioned this pull request Feb 12, 2025

[Backport release-2.27] Rr/sc 60366 sparse global order reader merge (#5417) #5448

Merged

jdblischak mentioned this pull request Feb 14, 2025

Nightly feedstock build failed TileDB-Inc/conda-forge-nightly-controller#167

Open

rroelke mentioned this pull request Feb 17, 2025

Fix query condition regression introduced by #5417 #5461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rr/sc 60366 sparse global order reader merge #5417

Rr/sc 60366 sparse global order reader merge #5417

rroelke commented Jan 2, 2025 •

edited

Loading

rroelke commented Feb 6, 2025

ypatia commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

teo-tsirpanis Feb 11, 2025

teo-tsirpanis commented Feb 12, 2025 •

edited

Loading

github-actions bot commented Feb 12, 2025

Rr/sc 60366 sparse global order reader merge #5417

Rr/sc 60366 sparse global order reader merge #5417

Conversation

rroelke commented Jan 2, 2025 • edited Loading

Parallel Merge

Sparse Global Order Reader

Serialization

Testing

Performance Results

rroelke commented Feb 6, 2025

ypatia commented Feb 10, 2025

github-actions bot commented Feb 10, 2025

teo-tsirpanis Feb 11, 2025

Choose a reason for hiding this comment

teo-tsirpanis commented Feb 12, 2025 • edited Loading

github-actions bot commented Feb 12, 2025

rroelke commented Jan 2, 2025 •

edited

Loading

teo-tsirpanis commented Feb 12, 2025 •

edited

Loading