Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify array_has UDF to InList expr when haystack is constant #15354

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidhewitt
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR indirectly addresses #14533 not by actually changing the array_has evaluation but instead by simplifying it to the equivalent InList expression where the haystack is not varying per-row.

The array_has udf has to operate row-by-row because it may have a varying haystack. The InList expression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me list InList also supports some kind of Set optimization.

What changes are included in this PR?

Add a simplify implementation to array_has UDF which will produce an InList expr when the haystack is a literal list.

Are these changes tested?

Yes, see test additions in the diff.

I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others.

> CREATE TABLE test AS (SELECT substr(md5(i)::text, 1, 32) as haystack FROM generate_series(1, 100000) t(i));
0 row(s) fetched. 
Elapsed 0.027 seconds.

> SELECT * FROM test limit 1;
+----------------------------------+
| haystack                         |
+----------------------------------+
| 7f4b18de3cfeb9b4ac78c381ee2ad278 |
+----------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE haystack = '7f4b18de3cfeb9b4ac78c381ee2ad278';
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> SELECT count(*) FROM test WHERE haystack IN ('7f4b18de3cfeb9b4ac78c381ee2ad278');
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

> SELECT count(*) FROM test WHERE haystack = ANY(['7f4b18de3cfeb9b4ac78c381ee2ad278']);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT count(*) FROM test WHERE array_has(['7f4b18de3cfeb9b4ac78c381ee2ad278'], haystack);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row(s) fetched. 
Elapsed 0.002 seconds.

I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it.

Are there any user-facing changes?

Simplification results will change.

Comment on lines +157 to +158
// TODO: support LargeList / FixedSizeList?
// (not supported by `convert_array_to_scalar_vec`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input wanted on whether this should be resolved (maybe by handling those data types in convert_aray_to_scalar_vec).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest doing the work in a follow on ticket / PR. So that would mean:

  1. Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)
  2. leaving a TODO with a link to that ticket here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidhewitt -- this looks great to me other than needing some end to end .slt tests

Ideally the tests would

  1. Run the queries in array_has UDF performance is slow for smaller number of needles #14533 (or point to where they are already run)
  2. Do an EXPLAIN ... that shows the rewrite happening

The rationale is to make sure that this code is hooked up correctly when running queries

It looks to me list InList also supports some kind of Set optimization.

Yes the inlist uses a hash table for larger element needles

Comment on lines +157 to +158
// TODO: support LargeList / FixedSizeList?
// (not supported by `convert_array_to_scalar_vec`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest doing the work in a follow on ticket / PR. So that would mean:

  1. Filing a ticket with a reproducer (perhaps modified from array_has UDF performance is slow for smaller number of needles #14533)
  2. leaving a TODO with a link to that ticket here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_has UDF performance is slow for smaller number of needles
2 participants