Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

zhuqi-lucas · 2025-03-21T09:54:25Z

Which issue does this PR close?

Closes partof #15096

Rationale for this change

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

What changes are included in this PR?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

Are these changes tested?

Yes

Are there any user-facing changes?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

…servingMergeStream

Omega359 · 2025-03-21T11:32:08Z

datafusion/physical-plan/src/sorts/cursor.rs

+    }
+
+    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+        unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }


Please add a 'safety:' note to say why is is ok to use unsafe here. An example

Thank you @Omega359 for review, good example, i will address it.

I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

2010YOUY01 · 2025-03-21T11:52:49Z

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer

cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3

main: 8s
pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream
It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

zhuqi-lucas · 2025-03-21T12:35:50Z

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer
cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3
main: 8s pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

Thank you @2010YOUY01 for review, i may know the problem about the above Reproducer:

The q3 sort bench mark is a special case sort by l_comment which is mostly long string larger than 12 bytes, meanwhile it has many case with same prefix, it means the 4 bytes view are also same, so the compare logic will go to the last part to compare the buffer, it will make the compare regression.
You can try to sort the normal case which the string is mostly less than 12 bytes. And if some cases larger than 12 bytes, we also will optimize use the 4 bytes view to compare, for example change the q3 to sql which will use the normal string to order by:

SELECT l_shipmode, l_comment, l_partkey
        FROM lineitem
        ORDER BY l_shipmode;

It will show the performance improvement.

And finally, i think we need to create a follow-up ticket to improve and investigate the regression case. It will be valuable for us to improve it. Thanks!

zhuqi-lucas · 2025-03-22T09:54:34Z

Updated the result for short string sort which will benefit a lot from StringView type, add here is the Q 11 for sort test:

-    const SORT_QUERIES: [&'static str; 10] = [
+    const SORT_QUERIES: [&'static str; 11] = [
         // Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column
         r#"
         SELECT l_linenumber, l_partkey
@@ -159,6 +159,12 @@ impl RunOpt {
         FROM lineitem
         ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment
         "#,
+        // Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload column
+        r#"
+        SELECT l_shipmode, l_comment, l_partkey
+        FROM lineitem
+        ORDER BY l_shipmode;
+        "#,
     ];

This PR:

Q11 iteration 0 took 5645.3 ms and returned 59986052 rows
Q11 iteration 1 took 5641.1 ms and returned 59986052 rows
Q11 iteration 2 took 5520.6 ms and returned 59986052 rows
Q11 avg time: 5602.33 ms

The main:

Q11 iteration 0 took 6687.5 ms and returned 59986052 rows
Q11 iteration 1 took 6504.5 ms and returned 59986052 rows
Q11 iteration 2 took 6544.6 ms and returned 59986052 rows
Q11 avg time: 6578.87 ms

About 20% performance improvement.

alamb

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

alamb · 2025-03-22T14:30:48Z

datafusion/physical-plan/src/sorts/cursor.rs

+    }
+
+    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+        unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }


I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

zhuqi-lucas · 2025-03-23T10:07:26Z

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

Thank you @alamb for review, good suggestion, and i checked the nullable check is checked in the parent wrapper call, for example:

impl<T: CursorValues> CursorValues for ArrayValues<T> {
    fn len(&self) -> usize {
        self.values.len()
    }

    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => true,
            (false, false) => T::eq(&l.values, l_idx, &r.values, r_idx),
            _ => false,
        }
    }

    fn eq_to_previous(cursor: &Self, idx: usize) -> bool {
        assert!(idx > 0);
        match (cursor.is_null(idx), cursor.is_null(idx - 1)) {
            (true, true) => true,
            (false, false) => T::eq(&cursor.values, idx, &cursor.values, idx - 1),
            _ => false,
        }
    }

    fn compare(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> Ordering {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => Ordering::Equal,
            (true, false) => match l.options.nulls_first {
                true => Ordering::Less,
                false => Ordering::Greater,
            },
            (false, true) => match l.options.nulls_first {
                true => Ordering::Greater,
                false => Ordering::Less,
            },
            (false, false) => match l.options.descending {
                true => T::compare(&r.values, r_idx, &l.values, l_idx),
                false => T::compare(&l.values, l_idx, &r.values, r_idx),
            },
        }
    }
}

I try to address comments and suggestions in latest PR. And for longer string compare regression for StringView, #15348 (comment)
i still need time to investigate more, i am willing to create a new ticket to investigate and dig into. Thanks.

zhuqi-lucas · 2025-03-24T05:21:34Z

Added some new testing, we need to improve High Cardinality Performance for sorting with utf8_view, and the most performance regression is with sort_partitioned.

Comparison: UTF8 vs UTF8_VIEW Sorting Performance

Based on the benchmark results, we compare utf8 and utf8_view across different sorting methods, including low cardinality and high cardinality cases.

Low Cardinality Performance

Sorting Method	`utf8` Time (ms)	`utf8_view` Time (ms)	`utf8_view` Improvement
merge sorted	3.8926	3.6713	5.7% faster
sort merge	3.9152	3.6265	7.4% faster
sort	6.0351	5.7904	4.1% faster
sort partitioned	236.24 µs	167.18 µs	29.2% faster

Observations

utf8_view is consistently faster across all sorting methods.
The most significant improvement is in sort partitioned (29.2% faster).
sort merge also benefits significantly (7.4% faster), likely due to utf8_view reducing memory allocations or copies.

High Cardinality Performance

Sorting Method	`utf8` Time (ms)	`utf8_view` Time (ms)	`utf8_view` Improvement
merge sorted	4.6662	5.0999	-9.3% (slower)
sort merge	4.7102	5.7224	-21.5% (slower)
sort	7.0020	6.3274	9.6% faster
sort partitioned	242.99 µs	679.86 µs	-180% (much slower)

Observations

utf8_view performs worse for high cardinality cases:
- merge sorted is 9.3% slower.
- sort merge is 21.5% slower.
- sort partitioned is 180% slower, a drastic drop.
However, utf8_view still improves the sort method by 9.6%, likely due to reduced string operations.

Key Takeaways

For low cardinality, utf8_view is the better choice, especially for sort partitioned and sort merge, with 7.4% to 29.2% improvements.
For high cardinality, utf8_view underperforms in merge sorted, sort merge, and especially sort partitioned, making it a worse choice.

zhuqi-lucas · 2025-03-24T07:25:15Z

I compared the sort_partition for utf8 and utf8view benchmark flamegraph for high cardinality:

The utf8_view:

The utf8:

It looks like the utf8 sort partition, will reservation size less memory besides utf8view, so it optimize to use concat_batches:

// If less than sort_in_place_threshold_bytes, concatenate and sort in place
        if self.reservation.size() < self.sort_in_place_threshold_bytes {
            // Concatenate memory batches together and sort
            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
            self.in_mem_batches.clear();
            self.reservation
                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
            let reservation = self.reservation.take();
            return self.sort_batch_stream(batch, metrics, reservation);
        }

So it will be much fast. But why Utf8View reserve more memory for each partition, i need to to continue dig into.

Updated, when i change the sort_in_place_threshold_bytes default value from 1M to 2M, the sort_partition for utf8_view has huge improvement from 679.86 µs to 179.79 µs:

sort partitioned utf8 view high cardinality
                        time:   [178.27 µs 179.79 µs 181.19 µs]

Create a follow-up ticket for this improvement:

#15375

zhuqi-lucas · 2025-03-24T09:48:06Z

I did some POC of the automatically concat_batches which is totally another improvement ticket besides this PR:

#15375 (comment)

Very good performance improvement i can see, need more testing and investigation. And it's not limited to utf8_view enabled, i did not apply this PR to the testing for above comments result.

Perf: Support Utf8View datatype single column comparisons for SortPre…

8343d5e

…servingMergeStream

Omega359 reviewed Mar 21, 2025

View reviewed changes

alamb reviewed Mar 22, 2025

View reviewed changes

Add safety and bench sql

93e46fb

zhuqi-lucas added 3 commits March 23, 2025 18:09

fix

1a3857c

Fix

d3808c1

Add benchmark testing

ef32003

github-actions bot added the core Core DataFusion crate label Mar 24, 2025

zhuqi-lucas mentioned this pull request Mar 24, 2025

Perf: Support automatically concat_batches for sort which will improve performance #15375

Open

Weijun-H changed the title ~~Perf: Support Utf8View datatype single column comparisons for SortPre…~~ Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

zhuqi-lucas commented Mar 21, 2025 •

edited by Weijun-H

Loading

Omega359 Mar 21, 2025

zhuqi-lucas Mar 21, 2025

alamb Mar 22, 2025

2010YOUY01 commented Mar 21, 2025

zhuqi-lucas commented Mar 21, 2025 •

edited

Loading

zhuqi-lucas commented Mar 22, 2025 •

edited

Loading

alamb left a comment

alamb Mar 22, 2025

zhuqi-lucas commented Mar 23, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Are you sure you want to change the base?

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Conversation

zhuqi-lucas commented Mar 21, 2025 • edited by Weijun-H Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 Mar 21, 2025

Choose a reason for hiding this comment

zhuqi-lucas Mar 21, 2025

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

2010YOUY01 commented Mar 21, 2025

zhuqi-lucas commented Mar 21, 2025 • edited Loading

zhuqi-lucas commented Mar 22, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

zhuqi-lucas commented Mar 23, 2025 • edited Loading

zhuqi-lucas commented Mar 24, 2025 • edited Loading

Comparison: UTF8 vs UTF8_VIEW Sorting Performance

Low Cardinality Performance

Observations

High Cardinality Performance

Observations

Key Takeaways

zhuqi-lucas commented Mar 24, 2025 • edited Loading

zhuqi-lucas commented Mar 24, 2025 • edited Loading

zhuqi-lucas commented Mar 21, 2025 •

edited by Weijun-H

Loading

zhuqi-lucas commented Mar 21, 2025 •

edited

Loading

zhuqi-lucas commented Mar 22, 2025 •

edited

Loading

zhuqi-lucas commented Mar 23, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading