Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
MATCH_RECOGNIZE
for Advanced Pattern Matching #13583.What is this?
Proof of Concept PR of MATCH_RECOGNIZE.
The full MATCH_RECOGNIZE specs are pretty complex, so for an initial implementation I limited it to the case where the DEFINE clause is not aware of the current match attempt boundaries (same restrictions as current Snowflake implementation, according to https://docs.snowflake.com/en/sql-reference/constructs/match_recognize) which means defined symbols are independent of each other (no
DEFINE B AS B.PRICE > LAST(A.price)
,) and window functions in DEFINE need an explicit window frame through OVER.This restriction allows processing each clause sequentially:
DEFINE
is converted to a Projection of each symbol's expression unto a virtual Boolean column__mr_symbol_<symbol>
(with eventually a Window node before)PATTERN
is converted to a customMatchRecognizePattern
node, which compiles the pattern to a NFA and runs it on the boolean__mr_symbol_<symbol>
columns. It also emits metadata columns:__mr_match_number
,__mr_match_sequence_number
,__mr_classifier
... which are used by the MEASURES clauseMEASURES
is converted to Window functions operating on the rows emitted by the previousPATTERN
clause, and leveraging the metadata columns (in particular, partitionned by the__mr_match_number
column to run measures independently on each match). ForONE ROW PER MATCH
option, only the last row of each partition is filtered (note that ideally this should be done by running theMEASURES
as Aggregations, I just took a POC shortcut here.)How to review?
datafusion/sqllogictest/test_files/match_recognize
cover extensiveMATCH_RECOGNIZE
cases, both asEXPLAIN
and actual queries. They should be a good starting point to understand what is supported and how the plan is built. They have not been validated, so the actual queries definitely contain critical correctness bugs.MATCH_RECOGNIZE
AST into a Logical Plan, which happens indatafusion/sql/src/relation/mod.rs
. This is the most important flow to review as it will condition all future development ofMARCH_RECOGNIZE
.pattern_matcher.rs
. It's meant as a generic NFA operating on the virtual__mr_symbol_<symbol>
columns. Large parts of this file are AI generated and will need to be entirely rewritten for production quality code, so don't look too much into detail here!MATCH_RECOGNIZE
functions PREV/NEXT/FIRST/LAST were AI generzted as well, with a prompt similar to "write PREV/NEXT/FIRST/LAST following the existing Window functions structure". The goal was just to get something more or less working to hook into queries: don't review in depth either!MATCH_RECOGNIZE in DataFusion
A walk-through of the current draft implementation
This note explains how a
MATCH_RECOGNIZE
statement is translated from SQL into DataFusion's logical and physical plans.The supported features are based off the current Snowflake documentation, and limited by the
sqlparser-rs
crate's support of MATCH_RECOGNIZE (no FINAL/RUNNING for MEASURES, EXCLUDE and PERMUTE of symbols only.)1. High-level flow
SQL parsing – the SQL module recognises the new grammar (table factor
MATCH_RECOGNIZE (…)
) and produces aTableFactor::MatchRecognize
AST node.Logical planning – the SQL planner turns the AST into a hierarchy of logical plan nodes:
DEFINE
,PATTERN
,MEASURES
,ROWS PER MATCH
,AFTER MATCH
Projection
,WindowAgg
,Filter
LogicalPlan::MatchRecognizePattern
node that carries the compiled pattern.Physical planning – the core planner detects the new logical node and produces a
MatchRecognizePatternExec
.All remaining operators (projection, window, filter, repartition, …) are produced exactly the same way as for "ordinary" SQL.
Execution –
MatchRecognizePatternExec
implements pattern matching at runtime, augments every output record batch with five metadata columns and yields the augmented stream.Upstream projections / filters / windows consume those virtual columns.
The rest of this document focuses on step 2 – how the planner constructs the logical plan.
2. SQL planner extensions
2.1 New planner context
PlannerContext
now contains an optionalMatchRecognizeContext
.When the planner descends into a
MATCH_RECOGNIZE
clause it enables the context to enforce the special scoping rules forsymbol.column
, e.g.A.price
;The context also exposes the
PARTITION BY
,ORDER BY
andROWS PER MATCH
clauses so that helper functions can derive default window frames or adjust partitioning.2.2 Handling
DEFINE
For every symbol reference found in the pattern the planner must be able to supply a predicate expression:
Expr
.TRUE
.defines : Vec<(Expr, String)>
, each carrying the predicate as well as the symbol name.The planner then inserts a projection immediately above the input:
Those columns serve one single purpose: they are consumed by the pattern matcher at execution time.
If a
DEFINE
expression contains window functions itself the planner inserts a window node underneath this projection first; after rebasing the expressions the overall shape becomes:2.3 Handling
PATTERN
PATTERN
is compiled into a nested value of the enumdatafusion_expr::match_recognize::Pattern
(symbol, concatenation, alternation, repetition, …).The planner then creates a dedicated logical node
The node itself is purely declarative – it only describes the pattern; the projection added earlier already made all predicates available.
2.4 Handling
MEASURES
MEASURES
is conceptually just another projection applied after pattern detection.Each measure expression is individually planned through
sql_to_expr_with_match_recognize_measures_context
.That function
enables the special context so that
A.price
is valid,implicitly appends hidden columns expected by specialised functions
(
FIRST
,LAST
,PREV
,NEXT
,CLASSIFIER
, …) andasks every registered
ExprPlanner
to post-process the expression.The default planners turn symbol predicates into the dedicated
window UDF calls (
mr_first
,mr_prev
,classifier
, …) and rewriteaggregate functions such as
If at least one measure contains a window function, another window node is pushed below the final projection (including a sort & repartition identical to ordinary SQL).
Finally the planner calls
rows_filter
androws_projection
helpers to apply the semantics ofROWS PER MATCH
:ONE ROW
) → filter on__mr_is_last_match_row
ALL ROWS SHOW
→ filter on__mr_is_included_row
ALL ROWS OMIT EMPTY
→__mr_is_included_row
and classifier ≠'(empty)'
WITH UNMATCHED
→ no additional filterand to choose the projection list (last-row only or all input columns).
The complete logical plan therefore has the following skeleton (greatly simplified):
2.5 Virtual columns
The pattern executor generates five metadata columns:
__mr_classifier
Utf8
'(empty)'
)__mr_match_number
UInt64
__mr_match_sequence_number
UInt64
__mr_is_last_match_row
Boolean
__mr_is_included_row
Boolean
They are appended to the schema in
pattern_schema()
and used directlyby filters, partitioning and measures.
3. Physical planning and execution
The core planner recognises
LogicalPlan::MatchRecognizePattern
andinstantiates
MatchRecognizePatternExec
.MatchRecognizePatternExec
Pattern
,partition_by
andorder_by
WindowAggExec
execute()
byPatternMatcher
) which scans the partition once, emits matches and populates the metadata columns,AFTER MATCH SKIP
andROWS PER MATCH
.All projections / window aggregates / filters produced earlier continue to behave exactly as they do for ordinary queries.
4. Summary
MATCH_RECOGNIZE
is implemented entirely as a normal combination of projections, filters and window aggregates plus one dedicated pattern-matching node.DEFINE
⇒ boolean columns (__mr_symbol_*
)PATTERN
⇒MatchRecognizePattern
nodeMEASURES
⇒ projection of window / aggregate functions over metadataEverything above the pattern node reuses DataFusion's existing machinery; physical execution differs only in the single custom executor that performs the row-wise NFA scan.