Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Nov 13, 2025

Until now, the full state of capturing groups was stored in each stack frame in two vectors. This is wasteful: Capturing groups usually don't change completely between two stack frames, and it results in two allocations per stack frame when the regex contains at least one capturing group, even if the associated unwinding opcode does not actually make use of the stored capturing group state.

This PR removes one of the two vectors stored in stack frames, the vector that stores the extents of the capturing groups. It is replaced by two new unwinding opcodes, _Capture_restore_begin and _Capture_restore_end, which are assigned to stack frames pushed on the stack while processing _N_capture and _N_capture_end nodes for the purpose of restoring the capturing group extents during unwinding. This means that changes to the extents of capturing groups are now represented as new frames on the stack and no longer in vectors stored in each stack frame.

Note that this restoration of the extents is always performed in the general stack unwinding loop, whether the regex matched successfully or not. This is necessary to support leftmost-longest mode correctly, which has to keep trying matches even when a successful match has already been found. But we do not want this to happen when the regex or a positive lookahead assertion matches successfully in ECMAScript mode, otherwise all capturing group information would be erased in the final result. But this is mostly no longer an issue following #5828 and #5835 because stack unwinding is now skipped in these cases. (Skipping for negative lookahead assertions is not required for this, but it was a natural byproduct in #5835.)

For positive lookahead assertions, though, we have to make an adjustment: While it is guaranteed that the capture groups inside the assertion are unmatched when processing the assertion is started, the ranges of the capturing groups might become meaningful again while backtracking. This means we have to restore the begin and end iterators of capturing groups correctly during backtracking, so we have to retain the _Capture_restore_begin and _Capture_restore_end frames. However, we should still throw out all other stack frames that got pushed while the lookahead assertion was processed.

We do not have to do the same for negative lookahead assertions because the capturing groups are always unmatched when the lookahead assertion was completed (whether by succeeding or failing), so the begin and end iterators are meaningless.

Benchmarks

First, the results of the existing regex_search benchmark:

benchmark before [ns] after [ns] "speedup"
bm_lorem_search/"^bibe"/2 59.99 64.17 0.93
bm_lorem_search/"^bibe"/3 62.78 62.78 1.00
bm_lorem_search/"^bibe"/4 59.38 64.52 0.92
bm_lorem_search/"bibe"/2 3069.20 3138.95 0.98
bm_lorem_search/"bibe"/3 5998.88 6417.41 0.93
bm_lorem_search/"bibe"/4 11474.60 12207.00 0.94
bm_lorem_search/"bibe".collate/2 3138.95 3208.71 0.98
bm_lorem_search/"bibe".collate/3 5998.88 5998.88 1.00
bm_lorem_search/"bibe".collate/4 11718.80 11718.80 1.00
bm_lorem_search/"(bibe)"/2 5440.85 9207.55 0.59
bm_lorem_search/"(bibe)"/3 10881.70 18833.90 0.58
bm_lorem_search/"(bibe)"/4 21763.10 37666.70 0.58
bm_lorem_search/"(bibe)+"/2 13113.80 17648.00 0.74
bm_lorem_search/"(bibe)+"/3 26227.70 34494.00 0.76
bm_lorem_search/"(bibe)+"/4 54687.50 69754.50 0.78
bm_lorem_search/"(?:bibe)+"/2 7672.99 6835.94 1.12
bm_lorem_search/"(?:bibe)+"/3 15066.90 14439.10 1.04
bm_lorem_search/"(?:bibe)+"/4 29157.30 27866.80 1.05
bm_lorem_search/R"(\bbibe)"/2 96256.90 112305.00 0.86
bm_lorem_search/R"(\bbibe)"/3 199507.00 224609.00 0.89
bm_lorem_search/R"(\bbibe)"/4 392369.00 449219.00 0.87
bm_lorem_search/R"(\Bibe)"/2 230164.00 273438.00 0.84
bm_lorem_search/R"(\Bibe)"/3 515625.00 544085.00 0.95
bm_lorem_search/R"(\Bibe)"/4 962182.00 1025390.00 0.94
bm_lorem_search/R"((?=....)bibe)"/2 5625.00 5312.50 1.06
bm_lorem_search/R"((?=....)bibe)"/3 10253.90 10498.00 0.98
bm_lorem_search/R"((?=....)bibe)"/4 20856.30 22216.50 0.94
bm_lorem_search/R"((?=bibe)....)"/2 5189.74 5000.00 1.04
bm_lorem_search/R"((?=bibe)....)"/3 10044.60 9521.48 1.05
bm_lorem_search/R"((?=bibe)....)"/4 19949.50 19252.40 1.04
bm_lorem_search/R"((?!lorem)bibe)"/2 5000.00 4649.14 1.08
bm_lorem_search/R"((?!lorem)bibe)"/3 10044.60 10044.60 1.00
bm_lorem_search/R"((?!lorem)bibe)"/4 19670.90 18415.30 1.07

I guess these look as if this change is actually a pessimization. But the worse performance in the most affected benchmarks happen for two reasons:

  • These regexes are very simple and never perform more than two iterations of a loop on this input. This means that the stack remains at a low size. For such low sizes, (almost) every addition to the underlying vector results in a reallocation . When a capturing group gets matched after this PR, there are two more frames on the stack, so there are two more reallocations. This especially affects the regex (bibe): The number of stack frames increases from zero to two (and thus from zero stack allocations to two).
  • Each stack frame still stores the capturing group validity vector. Since there are more stack frames when a capturing group gets matched, more of these validity vectors get allocated for regexes with captures. This most negatively affects a regex like (a)*: Per iteration, there is one allocation less because the extent vector is gone, but there are also two stack frames and thus two allocations more because each of them stores a validity vector. In total, this means that this regex results in one more allocation per iteration than before.

The first cause can be remedied by implementing some kind of small vector optimization, the second by removing the validity vectors as well (which is what the next PR will do).

In the regex_search benchmark specifically, four more allocations are done for regexes (bibe) and (bibe)+ per regex_search() call, which is the main reason for the observed worse running time.

To see that this change can actually substantially improve the running time for longer matches, I added a new benchmark for regex_match() matching a long sequence of a's with different regexes:

name before [ns] after [ns] speedup
bm_match_sequence_of_as/"a*"/100 5859.38 4708.44 1.24
bm_match_sequence_of_as/"a*"/200 10009.80 8021.76 1.25
bm_match_sequence_of_as/"a*"/400 34379.00 16497.00 2.08
bm_match_sequence_of_as/"a*?"/100 3609.79 3369.15 1.07
bm_match_sequence_of_as/"a*?"/200 6696.43 6452.29 1.04
bm_match_sequence_of_as/"a*?"/400 13183.50 13113.80 1.01
bm_match_sequence_of_as/"(?:a)*"/100 6277.90 4649.14 1.35
bm_match_sequence_of_as/"(?:a)*"/200 10498.00 8196.15 1.28
bm_match_sequence_of_as/"(?:a)*"/400 36272.30 16392.30 2.21
bm_match_sequence_of_as/"(a)*"/100 20089.50 32889.70 0.61
bm_match_sequence_of_as/"(a)*"/200 38505.00 71498.30 0.54
bm_match_sequence_of_as/"(a)*"/400 102539.00 128691.00 0.80
bm_match_sequence_of_as/"(?:b|a)*"/100 9277.34 7498.60 1.24
bm_match_sequence_of_as/"(?:b|a)*"/200 17159.80 13497.40 1.27
bm_match_sequence_of_as/"(?:b|a)*"/400 39236.90 26994.90 1.45
bm_match_sequence_of_as/"(b|a)*"/100 24588.20 32087.10 0.77
bm_match_sequence_of_as/"(b|a)*"/200 43945.30 62779.00 0.70
bm_match_sequence_of_as/"(b|a)*"/400 97656.20 136719.00 0.71
bm_match_sequence_of_as/"(a)(?:b|a)*"/100 22949.20 13392.90 1.71
bm_match_sequence_of_as/"(a)(?:b|a)*"/200 41712.60 24065.00 1.73
bm_match_sequence_of_as/"(a)(?:b|a)*"/400 89979.20 51562.50 1.75
bm_match_sequence_of_as/"(a)(b|a)*"/100 22460.90 30029.80 0.75
bm_match_sequence_of_as/"(a)(b|a)*"/200 42968.80 61383.90 0.70
bm_match_sequence_of_as/"(a)(b|a)*"/400 96256.90 125558.00 0.77
bm_match_sequence_of_as/"(a)(?:b|a)*c"/100 22495.60 14439.10 1.56
bm_match_sequence_of_as/"(a)(?:b|a)*c"/200 45515.60 27273.90 1.67
bm_match_sequence_of_as/"(a)(?:b|a)*c"/400 100442.00 53013.40 1.89

This shows performance improvement across the board, except for the mentioned problematic regexes iterating a capturing group surrounding a simple repeated pattern like (a)*.

@muellerj2 muellerj2 requested a review from a team as a code owner November 13, 2025 21:30
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Nov 13, 2025
@muellerj2 muellerj2 force-pushed the regex-remove-capture-extent-vectors-from-stack-frames branch from b481d3f to 36044b4 Compare November 13, 2025 21:32
@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Nov 13, 2025
@StephanTLavavej StephanTLavavej self-assigned this Nov 13, 2025
@muellerj2
Copy link
Contributor Author

This relies on one subtlety: Clearly, the extents of capturing groups are no longer restored to prior values for negative lookahead assertions (when the assertion failed) and positive lookahead assertions (when the assertion succeeded, but the matcher had to unwind the stack beyond the assertion again). But this is unproblematic because the capturing groups are then in an unmatched state as well, so no meaning is assigned to the positions in the input string the begin and end iterators of these capturing groups point to. When the capturing groups do get matched again later, the iterators have been reassigned to the actual extents of the matched groups.

I thought about this again and it turns out this claim is wrong for positive lookahead assertions: A positive lookahead assertion might be surrounded by a loop and thus be reentrant, and then the values of the begin and end iterators do have to be restored even if the capturing groups are unmatched when the positive lookahead assertion is entered again.

I fixed this and also added some tests that fail when the begin and end iterators are not restored correctly during backtracking. (This also shows that the test coverage is still lacking for lookahead assertions, particularly when they are placed inside loops.)

Comment on lines +3968 to +3969
_Frames[_Frame_idx]._Match_state._Cur = _Group._Begin;
_Group._Begin = _Tgt_state._Cur;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested: I observe that _STD exchange could be used for this code pattern, although we don't universally use it. (This also occurs immediately below for _N_end_capture.)

using namespace std;
using namespace regex_constants;

void bm_match_sequence_of_as(benchmark::State& state, const char* pattern, syntax_option_type syntax = ECMAScript) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested: The syntax is never customized, but I see that it's imitating regex_search.cpp, and I suppose it's not too confusing to leave as-is.

@StephanTLavavej StephanTLavavej removed their assignment Nov 18, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 18, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 19, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 44c4abc into microsoft:main Nov 19, 2025
44 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Nov 19, 2025
@StephanTLavavej
Copy link
Member

Thanks for this latest entry in the Regex Cinematic Universe! 🦸 🎥 🍿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants