simplify array_has
UDF to InList
expr when haystack is constant
#15354
+97
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
array_has
UDF performance is slow for smaller number of needles #14533Rationale for this change
This PR indirectly addresses #14533 not by actually changing the
array_has
evaluation but instead by simplifying it to the equivalentInList
expression where the haystack is not varying per-row.The
array_has
udf has to operate row-by-row because it may have a varying haystack. TheInList
expression, on the other hand, can operate in a columnar fashion by evaluating each of the N haystack items for equality against the needle and OR the results. It looks to me listInList
also supports some kind ofSet
optimization.What changes are included in this PR?
Add a
simplify
implementation toarray_has
UDF which will produce anInList
expr when the haystack is a literal list.Are these changes tested?
Yes, see test additions in the diff.
I also reran the original example from #14533 and we see that now the last two statements are now on equivalent performance as the others.
I would be happy to contribute a benchmark, but because this involves first simplifying the UDF expression this looked somewhat nontrivial and I'd welcome advice on where to place it.
Are there any user-facing changes?
Simplification results will change.