-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>: Remove capture extent vectors from stack frames
#5865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>: Remove capture extent vectors from stack frames
#5865
Conversation
b481d3f to
36044b4
Compare
I thought about this again and it turns out this claim is wrong for positive lookahead assertions: A positive lookahead assertion might be surrounded by a loop and thus be reentrant, and then the values of the begin and end iterators do have to be restored even if the capturing groups are unmatched when the positive lookahead assertion is entered again. I fixed this and also added some tests that fail when the begin and end iterators are not restored correctly during backtracking. (This also shows that the test coverage is still lacking for lookahead assertions, particularly when they are placed inside loops.) |
| _Frames[_Frame_idx]._Match_state._Cur = _Group._Begin; | ||
| _Group._Begin = _Tgt_state._Cur; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change requested: I observe that _STD exchange could be used for this code pattern, although we don't universally use it. (This also occurs immediately below for _N_end_capture.)
| using namespace std; | ||
| using namespace regex_constants; | ||
|
|
||
| void bm_match_sequence_of_as(benchmark::State& state, const char* pattern, syntax_option_type syntax = ECMAScript) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change requested: The syntax is never customized, but I see that it's imitating regex_search.cpp, and I suppose it's not too confusing to leave as-is.
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
|
Thanks for this latest entry in the Regex Cinematic Universe! 🦸 🎥 🍿 |
Until now, the full state of capturing groups was stored in each stack frame in two vectors. This is wasteful: Capturing groups usually don't change completely between two stack frames, and it results in two allocations per stack frame when the regex contains at least one capturing group, even if the associated unwinding opcode does not actually make use of the stored capturing group state.
This PR removes one of the two vectors stored in stack frames, the vector that stores the extents of the capturing groups. It is replaced by two new unwinding opcodes,
_Capture_restore_beginand_Capture_restore_end, which are assigned to stack frames pushed on the stack while processing_N_captureand_N_capture_endnodes for the purpose of restoring the capturing group extents during unwinding. This means that changes to the extents of capturing groups are now represented as new frames on the stack and no longer in vectors stored in each stack frame.Note that this restoration of the extents is always performed in the general stack unwinding loop, whether the regex matched successfully or not. This is necessary to support leftmost-longest mode correctly, which has to keep trying matches even when a successful match has already been found. But we do not want this to happen when the regex or a positive lookahead assertion matches successfully in ECMAScript mode, otherwise all capturing group information would be erased in the final result. But this is mostly no longer an issue following #5828 and #5835 because stack unwinding is now skipped in these cases. (Skipping for negative lookahead assertions is not required for this, but it was a natural byproduct in #5835.)
For positive lookahead assertions, though, we have to make an adjustment: While it is guaranteed that the capture groups inside the assertion are unmatched when processing the assertion is started, the ranges of the capturing groups might become meaningful again while backtracking. This means we have to restore the begin and end iterators of capturing groups correctly during backtracking, so we have to retain the
_Capture_restore_beginand_Capture_restore_endframes. However, we should still throw out all other stack frames that got pushed while the lookahead assertion was processed.We do not have to do the same for negative lookahead assertions because the capturing groups are always unmatched when the lookahead assertion was completed (whether by succeeding or failing), so the begin and end iterators are meaningless.
Benchmarks
First, the results of the existing
regex_searchbenchmark:I guess these look as if this change is actually a pessimization. But the worse performance in the most affected benchmarks happen for two reasons:
(bibe): The number of stack frames increases from zero to two (and thus from zero stack allocations to two).(a)*: Per iteration, there is one allocation less because the extent vector is gone, but there are also two stack frames and thus two allocations more because each of them stores a validity vector. In total, this means that this regex results in one more allocation per iteration than before.The first cause can be remedied by implementing some kind of small vector optimization, the second by removing the validity vectors as well (which is what the next PR will do).
In the
regex_searchbenchmark specifically, four more allocations are done for regexes(bibe)and(bibe)+perregex_search()call, which is the main reason for the observed worse running time.To see that this change can actually substantially improve the running time for longer matches, I added a new benchmark for
regex_match()matching a long sequence of a's with different regexes:This shows performance improvement across the board, except for the mentioned problematic regexes iterating a capturing group surrounding a simple repeated pattern like
(a)*.