Improve performance of `first_value` by implementing special `GroupsAccumulator` #15266

UBarney · 2025-03-17T09:46:55Z

Which issue does this PR close?

Rationale for this change

benchmark sql	main	thisPR
`select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4;`	35.6s	7s
`select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;`	0.979s	0.86s

What changes are included in this PR?

add FirstGroupsAccumulator

Are these changes tested?

Yes. Add new unit test and fuzz test.

Are there any user-facing changes?

Dandandan · 2025-03-18T09:26:13Z

datafusion/functions-aggregate/src/first_last.rs

@@ -179,6 +292,423 @@ impl AggregateUDFImpl for FirstValue {
    }
 }

+struct FirstGroupsAccumulator<T>


Suggested change

struct FirstGroupsAccumulator<T>

struct FirstPrimitiveGroupsAccumulator<T>

?

2010YOUY01 · 2025-03-18T09:36:45Z

datafusion/functions-aggregate/src/first_last.rs

+
+        let mut ordering_buf = Vec::with_capacity(self.ordering_req.len());
+
+        for (group_idx, idx) in self


(Just took a quick look, please correct me if I'm wrong)
Inside this function, it seems to

'compress' the current input batch with get_filtered_min_of_each_group() (if there are multiple entries for the same group, only keep the smallest one according to the specified order)

Update the global state for the minimal value corresponding to all seen groups

Why is it split into two steps instead of directly updating the global state?

Inside this function, it seems to

'compress' the current input batch with get_filtered_min_of_each_group() (if there are multiple entries for the same group, only keep the smallest one according to the specified order)
Update the global state for the minimal value corresponding to all seen groups

Yes. You are right.

Why is it split into two steps instead of directly updating the global state?

According to this

Returns the first element in an aggregation group according to the requested ordering

The reason for splitting it into two steps is that it performs better when cardinality is low.
benchmark sql: select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;

version time

main 0.979s

thisPR 0.86s

without get_filtered_min_of_each_group 1.25s

extract_row_at_idx_to_buf has a relatively high overhead. First call get_filtered_min_of_each_group to avoid the problem where extract_row_at_idx_to_buf would be called multiple times when the same group_idx exists

At first, I implemented it like this, but the performance actually got worse.
(At that time, I added the benchmark in datafusion/core/benches/aggregate_query_sql.rs, and the performance degraded from 3.9ms to 7ms.) 😂

blaginin · 2025-03-18T14:35:41Z

datafusion/functions-aggregate/src/first_last.rs

+    // Once we see the first value, we set the `is_sets[group_idx]` flag
+    is_sets: BooleanBufferBuilder,
+    // null_builder[group_idx] == false => vals[group_idx] is null
+    null_builder: BooleanBufferBuilder,


should we use NullState for that?

No. NullState does not pass NULL values to value_fn (see this). However, we cannot filter out values[0] == null in update_batch when adding respect null

blaginin · 2025-03-18T14:37:28Z

datafusion/functions-aggregate/src/first_last.rs

+        if self.is_sets.len() < new_size {
+            self.is_sets.append_n(new_size - self.is_sets.len(), false);
+        }


i think you can use .resize?

Yes. Using .resize is better approach. 56b11c4

2010YOUY01

Thank you for the nice work, the implementation is very readable.

An alternative I think is to store Row format for the order keys instead of vec<ScalarValue> to accelerate comparison, however, I don't think using Rows will be faster: here the number of comparison for each value is less than sorting, so the row conversion overhead might dominate.

There is only one thing I want to make sure: Is the following null group case covered by existing tests? If not we should include them

select first(a order by b) group by c, d;

a  b  c     d
_  _  null null
_  _  1    null
_  _  null 1

2010YOUY01 · 2025-03-19T03:52:29Z

datafusion/functions-aggregate/src/first_last.rs

+{
+    fn update_batch(
+        &mut self,
+        values_with_orderings: &[ArrayRef],


nit: At first, I thought input is ordered because the name of this argument, perhaps we can use values_and_order_cols?
Also, we can add a comment with example like e.g. first_value(a order by b): values_and_order_cols will be [a, b]

Done. Also add test. 297de26

datafusion/functions-aggregate/src/first_last.rs

Dandandan · 2025-03-21T08:32:47Z

datafusion/functions-aggregate/src/first_last.rs

+                continue;
+            }
+
+            if !result.contains_key(&group_idx)


This could be optimized to 1 lookup (e.g. using HashMap::entry). It looks in your profile this is a hot function.

(Now it's 3 lookups worst case)

Dandandan · 2025-03-21T08:44:26Z

datafusion/functions-aggregate/src/first_last.rs

+
+            if !result.contains_key(&group_idx)
+                || comparator
+                    .compare(*result.get(&group_idx).unwrap(), idx_in_val)


Does this really happen? idx_in_val is increasing monotonically using enumerate ?

Could you clarify what specific scenario you're referring to with "Does this really happen?" Are you concerned about:

a. idx_in_val decreasing (being smaller than a previous value) within the loop using enumerate()?

b. result[group_idx] increasing monotonically ?

For a, as I understand it, idx_in_val would only potentially decrease if group_indices.len() > usize::MAX.

For b, Yes. in the fuzz test , result[group_idx] will increase.

I meant idx_in_val is a strictly increasing number from enumerate. so it seams a previous idx is never greater than a new value, so the case should be never hit (so I would expect the code not to be there) / always be false.

In some cases, comparator.compare(*result.get(&group_idx).unwrap(), idx_in_val) may return true

It compare "values" at the wrapped columns with given indices, array may not sorted by order by fields.

We can verify that by adding assert!(false); in L564. The fuzz test fail adding it

Dandandan · 2025-03-21T08:48:13Z

datafusion/functions-aggregate/src/first_last.rs

+        vals: &PrimitiveArray<T>,
+        is_set_arr: Option<&BooleanArray>,
+    ) -> Result<HashMap<usize, usize>> {
+        let mut result = HashMap::with_capacity(orderings.len()); // group_idx -> idx_in_orderings


I am wondering if we can remove the use of a hashmap here... It shouldn't be needed to do perform it like this?

Are you concerned about performance overhead from using HashMap? We could also make this function return (group_idx_to_idx_in_orderings: Vec<usize>, mask: BooleanBufferBuilder) and check if there's a performance improvement by running benchmark sqls. Returning BooleanBufferBuilder is because this function contains filtering logic.

But if total_num_groups is large, group_idx_to_idx_in_orderings: Vec<usize> may consume lots of memory.......

Yes I think a large portion of current overhead comes from the use of HashMap.

Thinking about it more, I thiknk probably the most efficient would be directly changing the accumulator state than via hashmap creation.

After adding min_of_each_group_buf: (Vec<usize>, BooleanBufferBuilder), to FirstPrimitiveGroupsAccumulator, it run slightly faster.

benchmark sql d63 44a

select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4; 7s 6.83s

select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode; 0.86s 0.79s

2010YOUY01 · 2025-03-25T03:38:06Z

I haven't been following the recent conversations regarding hashmap optimization, but I also feel, if it needs pre-aggregate to make the low-cardinality case run faster, there might be some inefficiency inside the global state update implementation.
Since this PR already got significant performance improvement, we can play around with the further optimization in follow-up PRs, I plan to merge it shortly unless there are other considerations.

…ccumulator` (apache#15266) * Improve speed of first_value by implementing special GroupsAccumulator * rename and other improvements * `append_n` -> `resize` * address comment * use HashMap::entry * remove hashMap in get_filtered_min_of_each_group

github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Mar 17, 2025

UBarney force-pushed the group_first_val branch from 9b0908b to 0e7bf1d Compare March 18, 2025 07:44

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Mar 18, 2025

UBarney changed the title ~~Improve speed of first_value by implementing special GroupsAccumulator~~ Improve performance of first_value by implementing special GroupsAccumulator Mar 18, 2025

UBarney marked this pull request as ready for review March 18, 2025 08:21

Dandandan reviewed Mar 18, 2025

View reviewed changes

2010YOUY01 reviewed Mar 18, 2025

View reviewed changes

blaginin reviewed Mar 18, 2025

View reviewed changes

UBarney added 3 commits March 19, 2025 03:25

Improve speed of first_value by implementing special GroupsAccumulator

d633c29

rename and other improvements

94e44d5

append_n -> resize

56b11c4

UBarney force-pushed the group_first_val branch from f35d2da to 56b11c4 Compare March 19, 2025 03:25

2010YOUY01 reviewed Mar 19, 2025

View reviewed changes

address comment

297de26

2010YOUY01 approved these changes Mar 20, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into u_group_first_val

95b354d

Dandandan reviewed Mar 21, 2025

View reviewed changes

use HashMap::entry

699bcac

alamb added the performance Make DataFusion faster label Mar 21, 2025

remove hashMap in get_filtered_min_of_each_group

44a212c

UBarney requested a review from 2010YOUY01 March 25, 2025 02:28

2010YOUY01 merged commit 923bfb7 into apache:main Mar 26, 2025
27 checks passed

xudong963 mentioned this pull request Mar 26, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Open

38 tasks

UBarney mentioned this pull request Apr 3, 2025

Improve performance of last_value by implementing special GroupsAccumulator #15542

Merged

andygrove mentioned this pull request Apr 8, 2025

Improve performance of dropDuplicates apache/datafusion-comet#1275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `first_value` by implementing special `GroupsAccumulator` #15266

Improve performance of `first_value` by implementing special `GroupsAccumulator` #15266

UBarney commented Mar 17, 2025 •

edited

Loading

Dandandan Mar 18, 2025

UBarney Mar 18, 2025

2010YOUY01 Mar 18, 2025

UBarney Mar 18, 2025 •

edited

Loading

blaginin Mar 18, 2025

UBarney Mar 19, 2025

blaginin Mar 18, 2025

UBarney Mar 19, 2025 •

edited

Loading

2010YOUY01 left a comment

2010YOUY01 Mar 19, 2025

UBarney Mar 19, 2025

Dandandan Mar 21, 2025

Dandandan Mar 21, 2025

UBarney Mar 21, 2025

Dandandan Mar 21, 2025

UBarney Mar 21, 2025 •

edited

Loading

Dandandan Mar 22, 2025

UBarney Mar 23, 2025 •

edited

Loading

Dandandan Mar 21, 2025

UBarney Mar 21, 2025

UBarney Mar 21, 2025

Dandandan Mar 22, 2025

UBarney Mar 23, 2025 •

edited

Loading

2010YOUY01 commented Mar 25, 2025

	struct FirstGroupsAccumulator<T>
	struct FirstPrimitiveGroupsAccumulator<T>


		let mut ordering_buf = Vec::with_capacity(self.ordering_req.len());

		for (group_idx, idx) in self

benchmark sql	d63	44a
`select id2, id4, first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4;`	7s	6.83s
`select l_shipmode, first_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;`	0.86s	0.79s

Improve performance of first_value by implementing special GroupsAccumulator #15266

Improve performance of first_value by implementing special GroupsAccumulator #15266

Conversation

UBarney commented Mar 17, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UBarney Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UBarney Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

2010YOUY01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UBarney Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UBarney Mar 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

UBarney Mar 23, 2025 • edited Loading

Choose a reason for hiding this comment

2010YOUY01 commented Mar 25, 2025

Improve performance of `first_value` by implementing special `GroupsAccumulator` #15266

Improve performance of `first_value` by implementing special `GroupsAccumulator` #15266

UBarney commented Mar 17, 2025 •

edited

Loading

UBarney Mar 18, 2025 •

edited

Loading

UBarney Mar 19, 2025 •

edited

Loading

UBarney Mar 21, 2025 •

edited

Loading

UBarney Mar 23, 2025 •

edited

Loading

UBarney Mar 23, 2025 •

edited

Loading