-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348
base: main
Are you sure you want to change the base?
Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348
Conversation
…servingMergeStream
} | ||
|
||
fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool { | ||
unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a 'safety:' note to say why is is ok to use unsafe here. An example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 for review, good example, i will address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be good to justify the use of unchecked (which I think is ok here)
The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked
SO maybe the safety argument is mostly "The left/right_idx must within range of each array"
It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing
Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere
Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Reproducer
main: 8s According to the flamegraph, an extra overhead of |
Thank you @2010YOUY01 for review, i may know the problem about the above Reproducer:
SELECT l_shipmode, l_comment, l_partkey
FROM lineitem
ORDER BY l_shipmode; It will show the performance improvement. And finally, i think we need to create a follow-up ticket to improve and investigate the regression case. It will be valuable for us to improve it. Thanks! |
Updated the result for short string sort which will benefit a lot from StringView type, add here is the Q 11 for sort test: - const SORT_QUERIES: [&'static str; 10] = [
+ const SORT_QUERIES: [&'static str; 11] = [
// Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column
r#"
SELECT l_linenumber, l_partkey
@@ -159,6 +159,12 @@ impl RunOpt {
FROM lineitem
ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment
"#,
+ // Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload column
+ r#"
+ SELECT l_shipmode, l_comment, l_partkey
+ FROM lineitem
+ ORDER BY l_shipmode;
+ "#,
]; This PR: Q11 iteration 0 took 5645.3 ms and returned 59986052 rows
Q11 iteration 1 took 5641.1 ms and returned 59986052 rows
Q11 iteration 2 took 5520.6 ms and returned 59986052 rows
Q11 avg time: 5602.33 ms The main: Q11 iteration 0 took 6687.5 ms and returned 59986052 rows
Q11 iteration 1 took 6504.5 ms and returned 59986052 rows
Q11 iteration 2 took 6544.6 ms and returned 59986052 rows
Q11 avg time: 6578.87 ms About 20% performance improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go
} | ||
|
||
fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool { | ||
unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it would be good to justify the use of unchecked (which I think is ok here)
The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked
SO maybe the safety argument is mostly "The left/right_idx must within range of each array"
It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing
Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere
Thank you @alamb for review, good suggestion, and i checked the nullable check is checked in the parent wrapper call, for example: impl<T: CursorValues> CursorValues for ArrayValues<T> {
fn len(&self) -> usize {
self.values.len()
}
fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
match (l.is_null(l_idx), r.is_null(r_idx)) {
(true, true) => true,
(false, false) => T::eq(&l.values, l_idx, &r.values, r_idx),
_ => false,
}
}
fn eq_to_previous(cursor: &Self, idx: usize) -> bool {
assert!(idx > 0);
match (cursor.is_null(idx), cursor.is_null(idx - 1)) {
(true, true) => true,
(false, false) => T::eq(&cursor.values, idx, &cursor.values, idx - 1),
_ => false,
}
}
fn compare(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> Ordering {
match (l.is_null(l_idx), r.is_null(r_idx)) {
(true, true) => Ordering::Equal,
(true, false) => match l.options.nulls_first {
true => Ordering::Less,
false => Ordering::Greater,
},
(false, true) => match l.options.nulls_first {
true => Ordering::Greater,
false => Ordering::Less,
},
(false, false) => match l.options.descending {
true => T::compare(&r.values, r_idx, &l.values, l_idx),
false => T::compare(&l.values, l_idx, &r.values, r_idx),
},
}
}
} I try to address comments and suggestions in latest PR. And for longer string compare regression for StringView, #15348 (comment) |
Added some new testing, we need to improve High Cardinality Performance for sorting with utf8_view, and the most performance regression is with sort_partitioned. Comparison: UTF8 vs UTF8_VIEW Sorting PerformanceBased on the benchmark results, we compare Low Cardinality Performance
Observations
High Cardinality Performance
Observations
Key Takeaways
|
I did some POC of the automatically concat_batches which is totally another improvement ticket besides this PR: Very good performance improvement i can see, need more testing and investigation. And it's not limited to utf8_view enabled, i did not apply this PR to the testing for above comments result. |
Which issue does this PR close?
Rationale for this change
Support Utf8View datatype single column comparisons for SortPreservingMergeStream
What changes are included in this PR?
Support Utf8View datatype single column comparisons for SortPreservingMergeStream
Are these changes tested?
Yes
Are there any user-facing changes?
Support Utf8View datatype single column comparisons for SortPreservingMergeStream