Skip to content

Restore IN_LIST performance -- Implement specialized StaticFilters for different data types #18824

@alamb

Description

@alamb

In #18449 we made InList set handling generic for all types. However, I think we lost some specialization for types that may have slowed things down

The idea is to improve the INLIST performance by using specialized HashSets for different data types, and thus avoiding dynamic dispatch for different types

in #18449 we implemented such a specialization for Int32 but we should probably do it for all the types that had a specialization previously

  1. All primitive types (Int8, Int32, etc)
  2. Boolean
  3. Utf8/LargeUtf8/Utf8View
  4. Binary/LargeBinary/BinaryView

As @adriangb says:

I'm surprised that doing dynamic dispatch once per batch we evaluate as opposed to twice per batch we evaluate makes that much of a difference. What would make sense that makes a difference to me is doing it once per element vs. once per batch. But I guess that's what benchmarks say!

That does leave me with a question... could we squeeze out even more performance if we specialize for ~ all scalar types? It wouldn't be that hard to write a macro and have AI do the copy pasta of implementing it for all of the types... I'll open a follow up ticket.

Originally posted by @adriangb in #18449 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededperformanceMake DataFusion fasterregressionSomething that used to work no longer does

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions