Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 21, 2025

Which issue does this PR close?

Rationale for this change

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

What changes are included in this PR?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

Are these changes tested?

Yes

Are there any user-facing changes?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

}

fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a 'safety:' note to say why is is ok to use unsafe here. An example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 for review, good example, i will address it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

@2010YOUY01
Copy link
Contributor

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer

cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3

main: 8s
pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream
It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 21, 2025

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer

cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3

main: 8s pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

Thank you @2010YOUY01 for review, i may know the problem about the above Reproducer:

  1. The q3 sort bench mark is a special case sort by l_comment which is mostly long string larger than 12 bytes, meanwhile it has many case with same prefix, it means the 4 bytes view are also same, so the compare logic will go to the last part to compare the buffer, it will make the compare regression.
  2. You can try to sort the normal case which the string is mostly less than 12 bytes. And if some cases larger than 12 bytes, we also will optimize use the 4 bytes view to compare, for example change the q3 to sql which will use the normal string to order by:
SELECT l_shipmode, l_comment, l_partkey
        FROM lineitem
        ORDER BY l_shipmode;

It will show the performance improvement.

And finally, i think we need to create a follow-up ticket to improve and investigate the regression case. It will be valuable for us to improve it. Thanks!

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 22, 2025

Updated the result for short string sort which will benefit a lot from StringView type, add here is the Q 11 for sort test:

-    const SORT_QUERIES: [&'static str; 10] = [
+    const SORT_QUERIES: [&'static str; 11] = [
         // Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column
         r#"
         SELECT l_linenumber, l_partkey
@@ -159,6 +159,12 @@ impl RunOpt {
         FROM lineitem
         ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment
         "#,
+        // Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload column
+        r#"
+        SELECT l_shipmode, l_comment, l_partkey
+        FROM lineitem
+        ORDER BY l_shipmode;
+        "#,
     ];

This PR:

Q11 iteration 0 took 5645.3 ms and returned 59986052 rows
Q11 iteration 1 took 5641.1 ms and returned 59986052 rows
Q11 iteration 2 took 5520.6 ms and returned 59986052 rows
Q11 avg time: 5602.33 ms

The main:

Q11 iteration 0 took 6687.5 ms and returned 59986052 rows
Q11 iteration 1 took 6504.5 ms and returned 59986052 rows
Q11 iteration 2 took 6544.6 ms and returned 59986052 rows
Q11 avg time: 6578.87 ms

About 20% performance improvement.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

}

fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 23, 2025

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

Thank you @alamb for review, good suggestion, and i checked the nullable check is checked in the parent wrapper call, for example:

impl<T: CursorValues> CursorValues for ArrayValues<T> {
    fn len(&self) -> usize {
        self.values.len()
    }

    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => true,
            (false, false) => T::eq(&l.values, l_idx, &r.values, r_idx),
            _ => false,
        }
    }

    fn eq_to_previous(cursor: &Self, idx: usize) -> bool {
        assert!(idx > 0);
        match (cursor.is_null(idx), cursor.is_null(idx - 1)) {
            (true, true) => true,
            (false, false) => T::eq(&cursor.values, idx, &cursor.values, idx - 1),
            _ => false,
        }
    }

    fn compare(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> Ordering {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => Ordering::Equal,
            (true, false) => match l.options.nulls_first {
                true => Ordering::Less,
                false => Ordering::Greater,
            },
            (false, true) => match l.options.nulls_first {
                true => Ordering::Greater,
                false => Ordering::Less,
            },
            (false, false) => match l.options.descending {
                true => T::compare(&r.values, r_idx, &l.values, l_idx),
                false => T::compare(&l.values, l_idx, &r.values, r_idx),
            },
        }
    }
}

I try to address comments and suggestions in latest PR. And for longer string compare regression for StringView, #15348 (comment)
i still need time to investigate more, i am willing to create a new ticket to investigate and dig into. Thanks.

@github-actions github-actions bot added the core Core DataFusion crate label Mar 24, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 24, 2025

Added some new testing, we need to improve High Cardinality Performance for sorting with utf8_view, and the most performance regression is with sort_partitioned.

Comparison: UTF8 vs UTF8_VIEW Sorting Performance

Based on the benchmark results, we compare utf8 and utf8_view across different sorting methods, including low cardinality and high cardinality cases.


Low Cardinality Performance

Sorting Method utf8 Time (ms) utf8_view Time (ms) utf8_view Improvement
merge sorted 3.8926 3.6713 5.7% faster
sort merge 3.9152 3.6265 7.4% faster
sort 6.0351 5.7904 4.1% faster
sort partitioned 236.24 µs 167.18 µs 29.2% faster

Observations

  • utf8_view is consistently faster across all sorting methods.
  • The most significant improvement is in sort partitioned (29.2% faster).
  • sort merge also benefits significantly (7.4% faster), likely due to utf8_view reducing memory allocations or copies.

High Cardinality Performance

Sorting Method utf8 Time (ms) utf8_view Time (ms) utf8_view Improvement
merge sorted 4.6662 5.0999 -9.3% (slower)
sort merge 4.7102 5.7224 -21.5% (slower)
sort 7.0020 6.3274 9.6% faster
sort partitioned 242.99 µs 679.86 µs -180% (much slower)

Observations

  • utf8_view performs worse for high cardinality cases:
    • merge sorted is 9.3% slower.
    • sort merge is 21.5% slower.
    • sort partitioned is 180% slower, a drastic drop.
  • However, utf8_view still improves the sort method by 9.6%, likely due to reduced string operations.

Key Takeaways

  • For low cardinality, utf8_view is the better choice, especially for sort partitioned and sort merge, with 7.4% to 29.2% improvements.
  • For high cardinality, utf8_view underperforms in merge sorted, sort merge, and especially sort partitioned, making it a worse choice.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 24, 2025

I compared the sort_partition for utf8 and utf8view benchmark flamegraph for high cardinality:

The utf8_view:

image

The utf8:
image

It looks like the utf8 sort partition, will reservation size less memory besides utf8view, so it optimize to use concat_batches:

// If less than sort_in_place_threshold_bytes, concatenate and sort in place
        if self.reservation.size() < self.sort_in_place_threshold_bytes {
            // Concatenate memory batches together and sort
            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
            self.in_mem_batches.clear();
            self.reservation
                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
            let reservation = self.reservation.take();
            return self.sort_batch_stream(batch, metrics, reservation);
        }

So it will be much fast. But why Utf8View reserve more memory for each partition, i need to to continue dig into.

Updated, when i change the sort_in_place_threshold_bytes default value from 1M to 2M, the sort_partition for utf8_view has huge improvement from 679.86 µs to 179.79 µs:

sort partitioned utf8 view high cardinality
                        time:   [178.27 µs 179.79 µs 181.19 µs]

Create a follow-up ticket for this improvement:

#15375

@Weijun-H Weijun-H changed the title Perf: Support Utf8View datatype single column comparisons for SortPre… Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream Mar 24, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 24, 2025

I did some POC of the automatically concat_batches which is totally another improvement ticket besides this PR:

#15375 (comment)

Very good performance improvement i can see, need more testing and investigation. And it's not limited to utf8_view enabled, i did not apply this PR to the testing for above comments result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants