Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685

geoffreyclaude · 2025-07-04T15:30:32Z

Which issue does this PR close?

Draft attempt at Feature Request: Implement MATCH_RECOGNIZE for Advanced Pattern Matching #13583.

What is this?

Proof of Concept PR of MATCH_RECOGNIZE.

The full MATCH_RECOGNIZE specs are pretty complex, so for an initial implementation I limited it to the case where the DEFINE clause is not aware of the current match attempt boundaries (same restrictions as current Snowflake implementation, according to https://docs.snowflake.com/en/sql-reference/constructs/match_recognize) which means defined symbols are independent of each other (no DEFINE B AS B.PRICE > LAST(A.price),) and window functions in DEFINE need an explicit window frame through OVER.

This restriction allows processing each clause sequentially:

DEFINE is converted to a Projection of each symbol's expression unto a virtual Boolean column __mr_symbol_<symbol> (with eventually a Window node before)
PATTERN is converted to a custom MatchRecognizePattern node, which compiles the pattern to a NFA and runs it on the boolean __mr_symbol_<symbol> columns. It also emits metadata columns: __mr_match_number, __mr_match_sequence_number, __mr_classifier... which are used by the MEASURES clause
MEASURES is converted to Window functions operating on the rows emitted by the previous PATTERN clause, and leveraging the metadata columns (in particular, partitionned by the __mr_match_number column to run measures independently on each match). For ONE ROW PER MATCH option, only the last row of each partition is filtered (note that ideally this should be done by running the MEASURES as Aggregations, I just took a POC shortcut here.)

How to review?

The sqllogic tests in the datafusion/sqllogictest/test_files/match_recognize cover extensive MATCH_RECOGNIZE cases, both as EXPLAIN and actual queries. They should be a good starting point to understand what is supported and how the plan is built. They have not been validated, so the actual queries definitely contain critical correctness bugs.
The core change is the parsing of the MATCH_RECOGNIZE AST into a Logical Plan, which happens in datafusion/sql/src/relation/mod.rs. This is the most important flow to review as it will condition all future development of MARCH_RECOGNIZE.
The actual pattern matching happens in pattern_matcher.rs. It's meant as a generic NFA operating on the virtual __mr_symbol_<symbol> columns. Large parts of this file are AI generated and will need to be entirely rewritten for production quality code, so don't look too much into detail here!
Similarly, the specific MATCH_RECOGNIZE functions PREV/NEXT/FIRST/LAST were AI generzted as well, with a prompt similar to "write PREV/NEXT/FIRST/LAST following the existing Window functions structure". The goal was just to get something more or less working to hook into queries: don't review in depth either!
The main question a preliminary review should answer is if the generated Logical and Execution Plans make sense, leveraging existing DataFusion nodes, or if it would be better done almost from scratch inside a single large node.

MATCH_RECOGNIZE in DataFusion

A walk-through of the current draft implementation

⚠️ WARNING – Experimental Feature
This implementation is not production ready. Large portions of the code – especially pattern_matcher.rs and functions-window/src/match_recognize.rs – were vibe-coded with minimal review.

Expect missing edge-case handling, correctness issues, and performance problems.

This document itself is AI-generated and may contain inaccuracies or omissions. Use it only as a starting point; always verify details against the source code before relying on them.

This note explains how a MATCH_RECOGNIZE statement is translated from SQL into DataFusion's logical and physical plans.

The supported features are based off the current Snowflake documentation, and limited by the sqlparser-rs crate's support of MATCH_RECOGNIZE (no FINAL/RUNNING for MEASURES, EXCLUDE and PERMUTE of symbols only.)

⚠️ Symbol predicates inside the DEFINE clause are not yet supported.

1. High-level flow

SQL parsing – the SQL module recognises the new grammar (table factor MATCH_RECOGNIZE (…)) and produces a TableFactor::MatchRecognize AST node.
Logical planning – the SQL planner turns the AST into a hierarchy of logical plan nodes:
- normalisation of DEFINE, PATTERN, MEASURES, ROWS PER MATCH, AFTER MATCH
- explicit Projection, WindowAgg, Filter
- a new LogicalPlan::MatchRecognizePattern node that carries the compiled pattern.
Physical planning – the core planner detects the new logical node and produces a MatchRecognizePatternExec.
All remaining operators (projection, window, filter, repartition, …) are produced exactly the same way as for "ordinary" SQL.
Execution – MatchRecognizePatternExec implements pattern matching at runtime, augments every output record batch with five metadata columns and yields the augmented stream.
Upstream projections / filters / windows consume those virtual columns.

The rest of this document focuses on step 2 – how the planner constructs the logical plan.

2. SQL planner extensions

2.1 New planner context

PlannerContext now contains an optional MatchRecognizeContext.
When the planner descends into a MATCH_RECOGNIZE clause it enables the context to enforce the special scoping rules for

qualified identifiers – they must be of the form symbol.column, e.g. A.price;
window and aggregate functions – extra implicit arguments are added (see below).

The context also exposes the PARTITION BY, ORDER BY and ROWS PER MATCH clauses so that helper functions can derive default window frames or adjust partitioning.

2.2 Handling `DEFINE`

For every symbol reference found in the pattern the planner must be able to supply a predicate expression:

DEFINE
    A AS price < 50,
    B AS price > 60

All symbols that appear in the pattern are collected first.
If a symbol has an explicit definition, that definition is planned into a regular Expr.
If it does not have a definition, the planner synthesises the constant expression TRUE.
All predicates are gathered in defines : Vec<(Expr, String)>, each carrying the predicate as well as the symbol name.

The planner then inserts a projection immediately above the input:

… → Projection {
        // unchanged input columns
        company,
        price_date,
        price,
        // one boolean column per symbol
        predicate_for_A AS __mr_symbol_A,
        predicate_for_B AS __mr_symbol_B,
        …
    }

Those columns serve one single purpose: they are consumed by the pattern matcher at execution time.

If a DEFINE expression contains window functions itself the planner inserts a window node underneath this projection first; after rebasing the expressions the overall shape becomes:

Projection(add __mr_symbol_…)
  Window
    Input

2.3 Handling `PATTERN`

PATTERN is compiled into a nested value of the enum
datafusion_expr::match_recognize::Pattern (symbol, concatenation, alternation, repetition, …).

The planner then creates a dedicated logical node

LogicalPlan::MatchRecognizePattern {
    input:      Projection(with __mr_symbol_…)
    partition_by: […]
    order_by: […]
    pattern:    Pattern     // compiled tree
    after_skip: Option<…>
    rows_per_match: Option<…>
    symbols:    Vec<String> // ["A","B",…] in declaration order
}

The node itself is purely declarative – it only describes the pattern; the projection added earlier already made all predicates available.

2.4 Handling `MEASURES`

MEASURES is conceptually just another projection applied after pattern detection.

Each measure expression is individually planned through
sql_to_expr_with_match_recognize_measures_context.
That function
- enables the special context so that A.price is valid,
- implicitly appends hidden columns expected by specialised functions
  (FIRST, LAST, PREV, NEXT, CLASSIFIER, …) and
- asks every registered ExprPlanner to post-process the expression.
  The default planners turn symbol predicates into the dedicated
  window UDF calls (mr_first, mr_prev, classifier, …) and rewrite
  aggregate functions such as
```
COUNT(A.*)  ->  COUNT( CASE WHEN __mr_classifier = 'A' THEN 1 END )
SUM(A.price) -> SUM( CASE WHEN __mr_classifier = 'A' THEN A.price END )
```
If at least one measure contains a window function, another window node is pushed below the final projection (including a sort & repartition identical to ordinary SQL).
Finally the planner calls rows_filter and rows_projection helpers to apply the semantics of ROWS PER MATCH:
- default (ONE ROW) → filter on __mr_is_last_match_row
- ALL ROWS SHOW → filter on __mr_is_included_row
- ALL ROWS OMIT EMPTY→ __mr_is_included_row and classifier ≠ '(empty)'
- WITH UNMATCHED → no additional filter
and to choose the projection list (last-row only or all input columns).

The complete logical plan therefore has the following skeleton (greatly simplified):

01)Filter <- ROWS PER MATCH filter
02)--Projection <- MEASURES projections
03)----WindowAggr <- MEASURES Window functions
04)------MatchRecognizePattern <- Runs PATTERN on symbols, emits metadata virtual columns
05)--------Projection <- DEFINE symbols projected to virtual columns `__mr_symbol_<…>`
06)----------WindowAggr <- DEFINE Window functions
07)------------TableScan <- Input

2.5 Virtual columns

The pattern executor generates five metadata columns:

Name	Type	Meaning
`__mr_classifier`	`Utf8`	symbol the current row matched (or `'(empty)'`)
`__mr_match_number`	`UInt64`	running match counter, starts at 1
`__mr_match_sequence_number`	`UInt64`	position inside current match, starts at 1
`__mr_is_last_match_row`	`Boolean`	true on the final row of every match
`__mr_is_included_row`	`Boolean`	true if row is not excluded

They are appended to the schema in pattern_schema() and used directly
by filters, partitioning and measures.

3. Physical planning and execution

The core planner recognises LogicalPlan::MatchRecognizePattern and
instantiates MatchRecognizePatternExec.
MatchRecognizePatternExec
- receives the compiled Pattern, partition_by and order_by
- exposes ordering/partitioning requirements identical to WindowAggExec
- implements execute() by
  1. materialising a completely ordered partition,
  2. running the pattern matcher (PatternMatcher) which scans the partition once, emits matches and populates the metadata columns,
  3. honouring AFTER MATCH SKIP and ROWS PER MATCH.
All projections / window aggregates / filters produced earlier continue to behave exactly as they do for ordinary queries.

4. Summary

MATCH_RECOGNIZE is implemented entirely as a normal combination of projections, filters and window aggregates plus one dedicated pattern-matching node.
DEFINE ⇒ boolean columns (__mr_symbol_*)
PATTERN ⇒ MatchRecognizePattern node
MEASURES ⇒ projection of window / aggregate functions over metadata
Everything above the pattern node reuses DataFusion's existing machinery; physical execution differs only in the single custom executor that performs the row-wise NFA scan.

geoffreyclaude · 2025-07-05T15:09:30Z

datafusion/sql/src/relation/mod.rs

+/// Helper: determine optional filter predicate based on ROWS PER MATCH
+fn rows_filter(rpm: &Option<RowsPerMatch>) -> Option<Expr> {
+    match rpm {
+        // ONE ROW PER MATCH (default): keep last match row only


Implementation "shortcut" for proof-of-concept: a more efficient implementation would be to have real aggregation on the ONE ROW PER. MATCH, grouped by the partition keys and match number.

https://docs.snowflake.com/en/sql-reference/constructs/match_recognize

geoffreyclaude added 2 commits July 4, 2025 16:01

feat: Add MATCH_RECOGNIZE symbol field to Column and Wildcard

dfa3bba

feat: Add MATCH_RECOGNIZE MatchRecognizePattern logical node

ff4378f

geoffreyclaude commented Jul 5, 2025

View reviewed changes

geoffreyclaude mentioned this pull request Jul 8, 2025

Feature Request: Implement MATCH_RECOGNIZE for Advanced Pattern Matching #13583

Open

geoffreyclaude force-pushed the match_recognize branch 7 times, most recently from 18e7cc9 to 1041e50 Compare July 16, 2025 13:10

geoffreyclaude added 3 commits July 16, 2025 15:35

feat: Implement MATCH_RECOGNIZE following Snowflake documentation

cc3cfc7

https://docs.snowflake.com/en/sql-reference/constructs/match_recognize

test: Add MATCH_RECOGNIZE sqllogic tests

7f59c8d

doc: Add MATCH_RECOGNIZE architecture documentation

260d025

geoffreyclaude force-pushed the match_recognize branch from 1041e50 to 260d025 Compare July 16, 2025 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685

Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685

Uh oh!

geoffreyclaude commented Jul 4, 2025 •

edited

Loading

Uh oh!

geoffreyclaude Jul 5, 2025

Uh oh!

Uh oh!

Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685

Are you sure you want to change the base?

Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching #16685

Uh oh!

Conversation

geoffreyclaude commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What is this?

How to review?

MATCH_RECOGNIZE in DataFusion

1. High-level flow

2. SQL planner extensions

2.1 New planner context

2.2 Handling DEFINE

2.3 Handling PATTERN

2.4 Handling MEASURES

2.5 Virtual columns

3. Physical planning and execution

4. Summary

Uh oh!

geoffreyclaude Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geoffreyclaude commented Jul 4, 2025 •

edited

Loading

2.2 Handling `DEFINE`

2.3 Handling `PATTERN`

2.4 Handling `MEASURES`