Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query condition regression introduced by #5417 #5461

Merged

Conversation

rroelke
Copy link
Member

@rroelke rroelke commented Feb 17, 2025

#5417 made several changes to SparseGlobalOrderReader::merge_result_cell_slabs. One of these changes erroneously caused coordinates which do not pass the query condition to be emitted. This pull request restores the correct behavior.

The crux of the change is

            length = 1;

to

            length = to_process.max_slab_length(1);

The affected line of code occurs during coordinate de-duplication. merge_result_cell_slabs puts coordinates from the leading tile of each fragment into a priority queue and pops them off. If the coordinate at the head of the queue matches what we just pulled off, then we continue popping until there is no longer a match.

This de-duplication can cause a fragment to run out of loaded tiles, in which case it is necessary to end the loop to get more data so that we do not emit coordinates out of order. That's how we get to this length = 1 code, which signals that we should emit a result slab of length 1. In contrast, max_slab_length(1) checks the query condition bitmap to see if the duplicated coordinate actually belongs in the query result.

This was found by a customer data set which has many fragments which are essentially copies of the whole coordinate domain. This scenario de-duplicates many coordinates and is likely to hit the condition described. The new test Sparse global order reader: fragment full copy 1d simulates this set-up.

A large fraction of the changes in this PR are the test scaffolding changes to support evaluating query conditions upon our FxRun1D and FxRun2D structures, so we can continue to check whether the expected and actual results match in the presence of a query condition.


TYPE: BUG
DESC: Fix query condition regression introduced by #5417

@rroelke rroelke force-pushed the rr/sc-61471-sparse-global-order-query-condition-regression branch from 9fbc7a0 to b8e615c Compare February 18, 2025 02:00
@rroelke rroelke requested a review from ypatia February 18, 2025 03:39
// no side effects. Internally we will modify `base::pos_` because
// the comparators need that to get the right coordinate.
// This is only a problem if this is called concurrently (and even then
// it would have been a problem anyway).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean that it would have been a problem anyway? Before this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this comment is unclear. I wanted to declare it const because it has no side effects if it is only called in a serializable way. But that requires the const_cast because we have to temporarily modify state, hence the discussion of thread safety.

I have sympathies for the notion that because of the temporary state change this should not be declared const. What I intended to highlight with this comment is that whether or not this is const the thread safety issue exists, because the risk still exists that thread A saves and restores a temporary position used by thread B.

Comment on lines +278 to +280
* @postcondition this method has no side-effects if it is never called
* concurrently by multiple threads. If multiple threads call this method
* concurrently then results may be undefined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for highlighting this. We need to be more aware that not all current reader steps are designed for concurrency and that they can silently cause undefined behavior.

@rroelke rroelke merged commit 022af98 into main Feb 18, 2025
59 checks passed
@rroelke rroelke deleted the rr/sc-61471-sparse-global-order-query-condition-regression branch February 18, 2025 14:34
@rroelke
Copy link
Member Author

rroelke commented Feb 18, 2025

/backport to release-2.27

(to fix the regression introduced over there)

Copy link
Contributor

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13393039545

Copy link
Contributor

@rroelke an error occurred while backporting to "release-2.27", please check the run log for details!

Error: @rroelke is not a repo collaborator, backporting is not allowed. If you're a collaborator please make sure your TileDB-Inc team membership visibility is set to Public on https://github.com/orgs/TileDB-Inc/people?query=rroelke

@rroelke
Copy link
Member Author

rroelke commented Feb 18, 2025

/backport to release-2.27

(take 2)

Copy link
Contributor

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13394045510

Copy link
Contributor

@rroelke an error occurred while backporting to "release-2.27", please check the run log for details!

Error: @rroelke is not a repo collaborator, backporting is not allowed. If you're a collaborator please make sure your TileDB-Inc team membership visibility is set to Public on https://github.com/orgs/TileDB-Inc/people?query=rroelke

@rroelke
Copy link
Member Author

rroelke commented Feb 18, 2025

/backport to release-2.27

(take 3)

Copy link
Contributor

Started backporting to release-2.27: https://github.com/TileDB-Inc/TileDB/actions/runs/13394134690

ihnorton pushed a commit that referenced this pull request Feb 18, 2025
#5417 made several changes to
`SparseGlobalOrderReader::merge_result_cell_slabs`. One of these changes
erroneously caused coordinates which do not pass the query condition to
be emitted. This pull request restores the correct behavior.

The crux of the change is
```
            length = 1;
```
to
```
            length = to_process.max_slab_length(1);
```

The affected line of code occurs during coordinate de-duplication.
`merge_result_cell_slabs` puts coordinates from the leading tile of each
fragment into a priority queue and pops them off. If the coordinate at
the head of the queue matches what we just pulled off, then we continue
popping until there is no longer a match.

This de-duplication can cause a fragment to run out of loaded tiles, in
which case it is necessary to end the loop to get more data so that we
do not emit coordinates out of order. That's how we get to this `length
= 1` code, which signals that we should emit a result slab of length 1.
In contrast, `max_slab_length(1)` checks the query condition bitmap to
see if the duplicated coordinate actually belongs in the query result.

This was found by a customer data set which has many fragments which are
essentially copies of the whole coordinate domain. This scenario
de-duplicates many coordinates and is likely to hit the condition
described. The new test `Sparse global order reader: fragment full copy
1d` simulates this set-up.

A large fraction of the changes in this PR are the test scaffolding
changes to support evaluating query conditions upon our `FxRun1D` and
`FxRun2D` structures, so we can continue to check whether the expected
and actual results match in the presence of a query condition.

---
TYPE: BUG
DESC: Fix query condition regression introduced by #5417

(cherry picked from commit 022af98)
ihnorton pushed a commit that referenced this pull request Feb 19, 2025
#5417 made several changes to
`SparseGlobalOrderReader::merge_result_cell_slabs`. One of these changes
erroneously caused coordinates which do not pass the query condition to
be emitted. This pull request restores the correct behavior.

The crux of the change is
```
            length = 1;
```
to
```
            length = to_process.max_slab_length(1);
```

The affected line of code occurs during coordinate de-duplication.
`merge_result_cell_slabs` puts coordinates from the leading tile of each
fragment into a priority queue and pops them off. If the coordinate at
the head of the queue matches what we just pulled off, then we continue
popping until there is no longer a match.

This de-duplication can cause a fragment to run out of loaded tiles, in
which case it is necessary to end the loop to get more data so that we
do not emit coordinates out of order. That's how we get to this `length
= 1` code, which signals that we should emit a result slab of length 1.
In contrast, `max_slab_length(1)` checks the query condition bitmap to
see if the duplicated coordinate actually belongs in the query result.

This was found by a customer data set which has many fragments which are
essentially copies of the whole coordinate domain. This scenario
de-duplicates many coordinates and is likely to hit the condition
described. The new test `Sparse global order reader: fragment full copy
1d` simulates this set-up.

A large fraction of the changes in this PR are the test scaffolding
changes to support evaluating query conditions upon our `FxRun1D` and
`FxRun2D` structures, so we can continue to check whether the expected
and actual results match in the presence of a query condition.

---
TYPE: BUG
DESC: Fix query condition regression introduced by #5417

(cherry picked from commit 022af98)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants