feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. #18935

KKould · 2025-11-06T09:38:47Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

upgrade nom to version 8.0.0. Use the first token check to reduce branch traversal in expr_element.

before:
├─ deep_function_call  802.2 µs      │ 1.207 ms      │ 842 µs        │ 850.6 µs      │ 100     │ 100
├─ deep_query          242.3 µs      │ 426.3 µs      │ 254.2 µs      │ 257.3 µs      │ 100     │ 100
├─ large_query         1.104 ms      │ 1.264 ms      │ 1.14 ms       │ 1.142 ms      │ 100     │ 100
├─ large_statement     1.097 ms      │ 1.2 ms        │ 1.15 ms       │ 1.148 ms      │ 100     │ 100
╰─ wide_expr           282.4 µs      │ 368.6 µs      │ 298 µs        │ 298.7 µs      │ 100     │ 100

after update nom to 8.0.0
├─ deep_function_call  747.4 µs      │ 1.1 ms        │ 771 µs        │ 776.5 µs      │ 100     │ 100
├─ deep_query          102.4 µs      │ 171.2 µs      │ 108 µs        │ 109.4 µs      │ 100     │ 100
├─ large_query         630.8 µs      │ 733 µs        │ 650 µs        │ 652.4 µs      │ 100     │ 100
├─ large_statement     621.5 µs      │ 687.5 µs      │ 642.9 µs      │ 645 µs        │ 100     │ 100
╰─ wide_expr           212.4 µs      │ 461.1 µs      │ 223.4 µs      │ 229.8 µs      │ 100     │ 100

after(nom_language version)
├─ deep_function_call  242.8 µs      │ 525.3 µs      │ 258.9 µs      │ 262.8 µs      │ 100     │ 100
├─ deep_query          235.6 µs      │ 364.8 µs      │ 244.8 µs      │ 249.3 µs      │ 100     │ 100
├─ large_query         362.9 µs      │ 451.6 µs      │ 376.5 µs      │ 379.7 µs      │ 100     │ 100
├─ large_statement     364.8 µs      │ 418.4 µs      │ 380.2 µs      │ 382.8 µs      │ 100     │ 100
╰─ wide_expr           96.97 µs      │ 270.2 µs      │ 102.8 µs      │ 105.3 µs      │ 100     │ 100

after(pratt parser version) now
├─ deep_function_call  81.55 µs      │ 290.1 µs      │ 91.42 µs      │ 94.28 µs      │ 100     │ 100
├─ deep_query          237.3 µs      │ 460.7 µs      │ 250.4 µs      │ 255.7 µs      │ 100     │ 100
├─ large_query         150.1 µs      │ 282.1 µs      │ 163.9 µs      │ 167 µs        │ 100     │ 100
├─ large_statement     149.7 µs      │ 185.4 µs      │ 161.4 µs      │ 161.8 µs      │ 100     │ 100
├─ wide_embedding      2.435 ms      │ 2.8 ms        │ 2.503 ms      │ 2.514 ms      │ 100     │ 100
╰─ wide_expr           30.73 µs      │ 45.14 µs      │ 31.18 µs      │ 31.78 µs      │ 100     │ 100

By determining the index of the first token of the current input to the relevant possible branches, we avoid trying all branches, which brings a significant improvement, reducing the time spent in the branch's wire_expr by almost half.

if let Some(token_0) = i.tokens.first() {
        use TokenKind::*;

        macro_rules! try_dispatch {
            ($($pat:pat => $body:expr),+ $(,)?) => {{
                if let Some(result) = try_token!(token_0, $($pat => $body),+) {
                    if matches!(&result, Ok(_) | Err(nom::Err::Failure(_))) {
                        return result;
                    }
                }
            }};
        }

        try_dispatch!(
            IS => with_span!(rule!(#is_null | #is_distinct_from)).parse(i),
            IN => with_span!(rule!(#in_list | #in_subquery)).parse(i),
            LIKE => with_span!(rule!(#like_subquery | #binary_op)).parse(i),
            EXISTS => with_span!(exists).parse(i),
            BETWEEN => with_span!(between).parse(i),
            CAST | TRY_CAST => with_span!(cast).parse(i),
            ....
}
// The try-parse operation in the function call is very expensive, easy to stack overflow
// so we manually check here whether the second token exists in LParen to avoid entering the loop
if i.tokens
    .get(1)
    .map(|token| token.kind == LParen)
    .unwrap_or(false)
{
    return with_span!(function_call).parse(i);
}

with_span!(alt((rule!(
    #column_ref : "<column>"
    | #map_access : "[<key>] | .<key> | :<key>"
    | #literal : "<literal>"
),)))
.parse(i)

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

…ith nom_language. Use the first token check to reduce branch traversal in expr_element.

Copilot

Pull Request Overview

This PR upgrades the nom parser combinator library from version 7 to version 8, along with updating nom-rule from 0.4 to 0.5.1. The upgrade includes API migrations to accommodate nom 8's new trait system and error reporting improvements.

Key changes:

Migration from nom 7 to nom 8 API, including the new Input trait and Parser trait with associated types
Addition of .parse() calls throughout the codebase to invoke parsers using nom 8's Parser trait
Implementation of custom Input trait for the token slice type
Performance optimizations through dispatch macros that reduce backtracking
Improved error messages in test outputs with more specific token expectations

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
Cargo.toml	Updated nom to 8.0.0 and nom-rule to 0.5.1
Cargo.lock	Locked updated dependency versions
src/query/ast/src/parser/input.rs	Implemented nom 8's `Input` trait for custom `Input` type
src/query/ast/src/parser/common.rs	Updated parser signatures to use nom 8's `Parser` trait with associated types, added `parser_fn` helper
src/query/ast/src/parser/expr.rs	Contains bugs: Incorrect use of `INVERTED` keyword instead of `INTERVAL` for interval literal parsing, added dispatch optimizations
src/query/ast/src/parser/*.rs	Added `.parse()` calls to invoke parsers using nom 8's API
src/query/ast/tests/it/testdata/*.txt	Updated error messages reflecting nom 8's improved error reporting
src/query/ast/benches/bench.rs	Updated benchmark results showing significant performance improvements

Comments suppressed due to low confidence (1)

src/query/ast/src/parser/expr.rs:1492

The variable name inverted_expr is misleading. This parser handles the INVERTED keyword followed by a literal string and casts it to an Interval type, but the name suggests it's related to inverted indexes or inverted data structures. The previous code used interval_expr for handling INTERVAL <literal_string> syntax. Consider renaming this to interval_string_expr or similar to clarify that it parses interval literals in string form.

    let inverted_expr = map(
        rule! {
            INVERTED ~ #consumed(literal_string)
        },
        |(_, (span, date))| ExprElement::Cast {
            expr: Box::new(Expr::Literal {
                span: transform_span(span.tokens),
                value: Literal::String(date),
            }),
            target_type: TypeName::Interval,
        },
    );

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/query/ast/src/parser/expr.rs

… number of branches and stack usage (otherwise, stack overflow is extremely likely).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/query/ast/src/parser/expr.rs

…in try_dispatch, create array_number function for numeric arrays, handle negative numbers directly

KKould changed the title ~~perf: improve the performance of time functions by parsing them using…~~ perf: improve the performance of time functions by parsing them using hard code Nov 6, 2025

KKould changed the title ~~perf: improve the performance of time functions by parsing them using hard code~~ : improve the performance of time functions by parsing them using hard code Nov 6, 2025

KKould changed the title ~~: improve the performance of time functions by parsing them using hard code~~ feat: improve the performance of time functions by parsing them using hard code Nov 6, 2025

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Nov 6, 2025

KKould self-assigned this Nov 6, 2025

refactor: upgrade nom to version 8.0.0 and replace the pratt parser w…

e4187d3

…ith nom_language. Use the first token check to reduce branch traversal in expr_element.

KKould force-pushed the perf/parse_time_function branch from 6114ae0 to e4187d3 Compare November 10, 2025 06:54

KKould changed the title ~~feat: improve the performance of time functions by parsing them using hard code~~ feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. Nov 10, 2025

KKould added 2 commits November 10, 2025 15:00

chore: codefmt

222c1bc

refactor: replace nom-language to pratt-parser

41203ff

KKould requested review from TCeason, b41sh, Copilot and sundy-li and removed request for sundy-li November 10, 2025 11:23

Copilot AI reviewed Nov 10, 2025

View reviewed changes

src/query/ast/src/parser/expr.rs Outdated Show resolved Hide resolved

src/query/ast/src/parser/expr.rs Outdated Show resolved Hide resolved

chore: fix unit test

962c24f

TCeason reviewed Nov 11, 2025

View reviewed changes

src/query/ast/src/parser/expr.rs Show resolved Hide resolved

src/query/ast/src/parser/expr.rs Show resolved Hide resolved

perf: optimize parse embedding

21eb23f

KKould force-pushed the perf/parse_time_function branch from 82aecc5 to 21eb23f Compare November 11, 2025 03:44

KKould added 3 commits November 11, 2025 12:39

chore: fix unit test

e3911c4

perf: use try_dispatch to categorize the statement_body, reducing the…

27c66ce

… number of branches and stack usage (otherwise, stack overflow is extremely likely).

chore: codefmt

e6e01b6

KKould force-pushed the perf/parse_time_function branch from 5dbc507 to e6e01b6 Compare November 11, 2025 18:45

KKould added 2 commits November 12, 2025 13:33

fix: remove parse cut on statement_body

a8595d8

fix: remove parse cut on statement_body

4560edb

KKould marked this pull request as ready for review November 12, 2025 07:36

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

src/query/ast/src/parser/expr.rs Show resolved Hide resolved

b41sh reviewed Nov 12, 2025

View reviewed changes

src/query/ast/src/parser/expr.rs Outdated Show resolved Hide resolved

src/query/ast/src/parser/expr.rs Outdated Show resolved Hide resolved

chore: integrate bracket_map_access into the array

595aec1

KKould force-pushed the perf/parse_time_function branch 2 times, most recently from ccb06c2 to 1d1a8e4 Compare November 13, 2025 04:01

chore: optimize array parsing: keep column_id/literal, add fast path …

6b623ab

…in try_dispatch, create array_number function for numeric arrays, handle negative numbers directly

KKould force-pushed the perf/parse_time_function branch from 1d1a8e4 to 6b623ab Compare November 13, 2025 05:45

chore: fix scalars test

1ea88b3

KKould requested a review from b41sh November 13, 2025 08:59

b41sh approved these changes Nov 14, 2025

View reviewed changes

BohuTANG merged commit 525ef26 into databendlabs:main Nov 14, 2025
87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. #18935

feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. #18935

KKould commented Nov 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. #18935

feat: upgrade nom to version 8.0.0 and accelerate expr_element using the first token. #18935

Conversation

KKould commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Type of change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KKould commented Nov 6, 2025 •

edited

Loading