`<regex>`: Remove capture extent vectors from stack frames #5865

muellerj2 · 2025-11-13T21:30:51Z

Until now, the full state of capturing groups was stored in each stack frame in two vectors. This is wasteful: Capturing groups usually don't change completely between two stack frames, and it results in two allocations per stack frame when the regex contains at least one capturing group, even if the associated unwinding opcode does not actually make use of the stored capturing group state.

This PR removes one of the two vectors stored in stack frames, the vector that stores the extents of the capturing groups. It is replaced by two new unwinding opcodes, _Capture_restore_begin and _Capture_restore_end, which are assigned to stack frames pushed on the stack while processing _N_capture and _N_capture_end nodes for the purpose of restoring the capturing group extents during unwinding. This means that changes to the extents of capturing groups are now represented as new frames on the stack and no longer in vectors stored in each stack frame.

Note that this restoration of the extents is always performed in the general stack unwinding loop, whether the regex matched successfully or not. This is necessary to support leftmost-longest mode correctly, which has to keep trying matches even when a successful match has already been found. But we do not want this to happen when the regex or a positive lookahead assertion matches successfully in ECMAScript mode, otherwise all capturing group information would be erased in the final result. But this is mostly no longer an issue following #5828 and #5835 because stack unwinding is now skipped in these cases. (Skipping for negative lookahead assertions is not required for this, but it was a natural byproduct in #5835.)

For positive lookahead assertions, though, we have to make an adjustment: While it is guaranteed that the capture groups inside the assertion are unmatched when processing the assertion is started, the ranges of the capturing groups might become meaningful again while backtracking. This means we have to restore the begin and end iterators of capturing groups correctly during backtracking, so we have to retain the _Capture_restore_begin and _Capture_restore_end frames. However, we should still throw out all other stack frames that got pushed while the lookahead assertion was processed.

We do not have to do the same for negative lookahead assertions because the capturing groups are always unmatched when the lookahead assertion was completed (whether by succeeding or failing), so the begin and end iterators are meaningless.

Benchmarks

First, the results of the existing regex_search benchmark:

benchmark	before [ns]	after [ns]	"speedup"
bm_lorem_search/"^bibe"/2	59.99	64.17	0.93
bm_lorem_search/"^bibe"/3	62.78	62.78	1.00
bm_lorem_search/"^bibe"/4	59.38	64.52	0.92
bm_lorem_search/"bibe"/2	3069.20	3138.95	0.98
bm_lorem_search/"bibe"/3	5998.88	6417.41	0.93
bm_lorem_search/"bibe"/4	11474.60	12207.00	0.94
bm_lorem_search/"bibe".collate/2	3138.95	3208.71	0.98
bm_lorem_search/"bibe".collate/3	5998.88	5998.88	1.00
bm_lorem_search/"bibe".collate/4	11718.80	11718.80	1.00
bm_lorem_search/"(bibe)"/2	5440.85	9207.55	0.59
bm_lorem_search/"(bibe)"/3	10881.70	18833.90	0.58
bm_lorem_search/"(bibe)"/4	21763.10	37666.70	0.58
bm_lorem_search/"(bibe)+"/2	13113.80	17648.00	0.74
bm_lorem_search/"(bibe)+"/3	26227.70	34494.00	0.76
bm_lorem_search/"(bibe)+"/4	54687.50	69754.50	0.78
bm_lorem_search/"(?:bibe)+"/2	7672.99	6835.94	1.12
bm_lorem_search/"(?:bibe)+"/3	15066.90	14439.10	1.04
bm_lorem_search/"(?:bibe)+"/4	29157.30	27866.80	1.05
bm_lorem_search/R"(\bbibe)"/2	96256.90	112305.00	0.86
bm_lorem_search/R"(\bbibe)"/3	199507.00	224609.00	0.89
bm_lorem_search/R"(\bbibe)"/4	392369.00	449219.00	0.87
bm_lorem_search/R"(\Bibe)"/2	230164.00	273438.00	0.84
bm_lorem_search/R"(\Bibe)"/3	515625.00	544085.00	0.95
bm_lorem_search/R"(\Bibe)"/4	962182.00	1025390.00	0.94
bm_lorem_search/R"((?=....)bibe)"/2	5625.00	5312.50	1.06
bm_lorem_search/R"((?=....)bibe)"/3	10253.90	10498.00	0.98
bm_lorem_search/R"((?=....)bibe)"/4	20856.30	22216.50	0.94
bm_lorem_search/R"((?=bibe)....)"/2	5189.74	5000.00	1.04
bm_lorem_search/R"((?=bibe)....)"/3	10044.60	9521.48	1.05
bm_lorem_search/R"((?=bibe)....)"/4	19949.50	19252.40	1.04
bm_lorem_search/R"((?!lorem)bibe)"/2	5000.00	4649.14	1.08
bm_lorem_search/R"((?!lorem)bibe)"/3	10044.60	10044.60	1.00
bm_lorem_search/R"((?!lorem)bibe)"/4	19670.90	18415.30	1.07

I guess these look as if this change is actually a pessimization. But the worse performance in the most affected benchmarks happen for two reasons:

These regexes are very simple and never perform more than two iterations of a loop on this input. This means that the stack remains at a low size. For such low sizes, (almost) every addition to the underlying vector results in a reallocation . When a capturing group gets matched after this PR, there are two more frames on the stack, so there are two more reallocations. This especially affects the regex (bibe): The number of stack frames increases from zero to two (and thus from zero stack allocations to two).
Each stack frame still stores the capturing group validity vector. Since there are more stack frames when a capturing group gets matched, more of these validity vectors get allocated for regexes with captures. This most negatively affects a regex like (a)*: Per iteration, there is one allocation less because the extent vector is gone, but there are also two stack frames and thus two allocations more because each of them stores a validity vector. In total, this means that this regex results in one more allocation per iteration than before.

The first cause can be remedied by implementing some kind of small vector optimization, the second by removing the validity vectors as well (which is what the next PR will do).

In the regex_search benchmark specifically, four more allocations are done for regexes (bibe) and (bibe)+ per regex_search() call, which is the main reason for the observed worse running time.

To see that this change can actually substantially improve the running time for longer matches, I added a new benchmark for regex_match() matching a long sequence of a's with different regexes:

name	before [ns]	after [ns]	speedup
bm_match_sequence_of_as/"a*"/100	5859.38	4708.44	1.24
bm_match_sequence_of_as/"a*"/200	10009.80	8021.76	1.25
bm_match_sequence_of_as/"a*"/400	34379.00	16497.00	2.08
bm_match_sequence_of_as/"a*?"/100	3609.79	3369.15	1.07
bm_match_sequence_of_as/"a*?"/200	6696.43	6452.29	1.04
bm_match_sequence_of_as/"a*?"/400	13183.50	13113.80	1.01
bm_match_sequence_of_as/"(?:a)*"/100	6277.90	4649.14	1.35
bm_match_sequence_of_as/"(?:a)*"/200	10498.00	8196.15	1.28
bm_match_sequence_of_as/"(?:a)*"/400	36272.30	16392.30	2.21
bm_match_sequence_of_as/"(a)*"/100	20089.50	32889.70	0.61
bm_match_sequence_of_as/"(a)*"/200	38505.00	71498.30	0.54
bm_match_sequence_of_as/"(a)*"/400	102539.00	128691.00	0.80
bm_match_sequence_of_as/"(?:b\|a)*"/100	9277.34	7498.60	1.24
bm_match_sequence_of_as/"(?:b\|a)*"/200	17159.80	13497.40	1.27
bm_match_sequence_of_as/"(?:b\|a)*"/400	39236.90	26994.90	1.45
bm_match_sequence_of_as/"(b\|a)*"/100	24588.20	32087.10	0.77
bm_match_sequence_of_as/"(b\|a)*"/200	43945.30	62779.00	0.70
bm_match_sequence_of_as/"(b\|a)*"/400	97656.20	136719.00	0.71
bm_match_sequence_of_as/"(a)(?:b\|a)*"/100	22949.20	13392.90	1.71
bm_match_sequence_of_as/"(a)(?:b\|a)*"/200	41712.60	24065.00	1.73
bm_match_sequence_of_as/"(a)(?:b\|a)*"/400	89979.20	51562.50	1.75
bm_match_sequence_of_as/"(a)(b\|a)*"/100	22460.90	30029.80	0.75
bm_match_sequence_of_as/"(a)(b\|a)*"/200	42968.80	61383.90	0.70
bm_match_sequence_of_as/"(a)(b\|a)*"/400	96256.90	125558.00	0.77
bm_match_sequence_of_as/"(a)(?:b\|a)*c"/100	22495.60	14439.10	1.56
bm_match_sequence_of_as/"(a)(?:b\|a)*c"/200	45515.60	27273.90	1.67
bm_match_sequence_of_as/"(a)(?:b\|a)*c"/400	100442.00	53013.40	1.89

This shows performance improvement across the board, except for the mentioned problematic regexes iterating a capturing group surrounding a simple repeated pattern like (a)*.

muellerj2 · 2025-11-15T16:50:32Z

This relies on one subtlety: Clearly, the extents of capturing groups are no longer restored to prior values for negative lookahead assertions (when the assertion failed) and positive lookahead assertions (when the assertion succeeded, but the matcher had to unwind the stack beyond the assertion again). But this is unproblematic because the capturing groups are then in an unmatched state as well, so no meaning is assigned to the positions in the input string the begin and end iterators of these capturing groups point to. When the capturing groups do get matched again later, the iterators have been reassigned to the actual extents of the matched groups.

I thought about this again and it turns out this claim is wrong for positive lookahead assertions: A positive lookahead assertion might be surrounded by a loop and thus be reentrant, and then the values of the begin and end iterators do have to be restored even if the capturing groups are unmatched when the positive lookahead assertion is entered again.

I fixed this and also added some tests that fail when the begin and end iterators are not restored correctly during backtracking. (This also shows that the test coverage is still lacking for lookahead assertions, particularly when they are placed inside loops.)

stl/inc/regex

StephanTLavavej · 2025-11-18T20:05:27Z

stl/inc/regex

+                        _Frames[_Frame_idx]._Match_state._Cur = _Group._Begin;
+                        _Group._Begin                         = _Tgt_state._Cur;


No change requested: I observe that _STD exchange could be used for this code pattern, although we don't universally use it. (This also occurs immediately below for _N_end_capture.)

StephanTLavavej · 2025-11-18T20:09:52Z

benchmarks/src/regex_match.cpp

+using namespace std;
+using namespace regex_constants;
+
+void bm_match_sequence_of_as(benchmark::State& state, const char* pattern, syntax_option_type syntax = ECMAScript) {


No change requested: The syntax is never customized, but I see that it's imitating regex_search.cpp, and I suppose it's not too confusing to leave as-is.

benchmarks/src/regex_match.cpp

StephanTLavavej · 2025-11-19T13:06:13Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-11-19T21:34:58Z

Thanks for this latest entry in the Regex Cinematic Universe! 🦸 🎥 🍿

muellerj2 requested a review from a team as a code owner November 13, 2025 21:30

github-project-automation bot added this to STL Code Reviews Nov 13, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Nov 13, 2025

<regex>: Remove capture extent vectors from stack frames

36044b4

muellerj2 force-pushed the regex-remove-capture-extent-vectors-from-stack-frames branch from b481d3f to 36044b4 Compare November 13, 2025 21:32

StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels Nov 13, 2025

StephanTLavavej self-assigned this Nov 13, 2025

Fix capture group restoration for positive lookahead assertions

3397e3b

StephanTLavavej added 6 commits November 18, 2025 11:25

Preserve trailing commas,

65a5124

Use _STATIC_LAMBDA.

fa4cb40

Extract lambda, named _Not_capture_restore.

7436282

Extract remove_if(), store _Effective_frames_end.

2632dd1

Extract common_args.

da8659e

Include <cstddef>, drop extra newline.

5ecd581

StephanTLavavej reviewed Nov 18, 2025

View reviewed changes

StephanTLavavej approved these changes Nov 18, 2025

View reviewed changes

StephanTLavavej removed their assignment Nov 18, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Nov 18, 2025

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Nov 19, 2025

StephanTLavavej merged commit 44c4abc into microsoft:main Nov 19, 2025
44 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Remove capture extent vectors from stack frames #5865

`<regex>`: Remove capture extent vectors from stack frames #5865

Uh oh!

muellerj2 commented Nov 13, 2025 •

edited

Loading

Uh oh!

muellerj2 commented Nov 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej Nov 18, 2025

Uh oh!

StephanTLavavej Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Nov 19, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		_Frames[_Frame_idx]._Match_state._Cur = _Group._Begin;
		_Group._Begin = _Tgt_state._Cur;

<regex>: Remove capture extent vectors from stack frames #5865

<regex>: Remove capture extent vectors from stack frames #5865

Uh oh!

Conversation

muellerj2 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

muellerj2 commented Nov 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Nov 19, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`<regex>`: Remove capture extent vectors from stack frames #5865

`<regex>`: Remove capture extent vectors from stack frames #5865

muellerj2 commented Nov 13, 2025 •

edited

Loading