Skip to content

Atomic sequence support? #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Andersama opened this issue Sep 12, 2020 · 9 comments
Open

Atomic sequence support? #108

Andersama opened this issue Sep 12, 2020 · 9 comments

Comments

@Andersama
Copy link
Contributor

Atomic groups much like possessive modifiers assist in preventing excessive backtracking.

If I'm not too far off I think this roughly models an atomic group*. How to modify pcre.hpp to add the ll1 parser is a bit beyond me.

// matching atomic sequence in patterns
template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
	if constexpr (sizeof...(TailContent) > 0) {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, sequence<TailContent...>())) { //find the first match like a regular sequence*
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	} else { //if nothing follows an atomic group / sequence, (?>abc|def) it's like evaluating abc|def
		return evaluate(begin, current, end, captures, ctll::list<HeadContent, Tail...>());
	}
}
@Andersama
Copy link
Contributor Author

Andersama commented Oct 28, 2020

Looking back, slightly off, the function would end up looking like:

template <typename R, typename Iterator, typename EndIterator, typename HeadContent, typename... TailContent, typename... Tail>
constexpr CTRE_FORCE_INLINE R evaluate(const Iterator begin, Iterator current, const EndIterator end, R captures, ctll::list<atomic_sequence<HeadContent, TailContent...>, Tail...>) noexcept {
	if constexpr (sizeof...(TailContent) > 0) {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent, atomic_sequence<TailContent...>>())) { //find the first match like a regular sequence*
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	} else {
		if (auto r1 = evaluate(begin, current, end, captures, ctll::list<HeadContent>())) {
			current = r1.get_end_position();
			return evaluate(begin, current, end, captures, ctll::list<Tail...>()); //continue w/o being able to backtrack past the atomic group
		} else {
			return not_matched;
		}
	}
}

@hanickadot
Copy link
Owner

I looked into it and I think this is all needed for atomic support:
ce4e8c0#diff-b459c698ea25ff48eed5ed8109277cb61094f779b437a8942123840d4db469e0R387-R399

@Andersama
Copy link
Contributor Author

Does it not change how certain things inside behave? Wouldn't it change sequences to atomic groups etc? Be glad to be wrong, especially if it was that easy.

@hanickadot
Copy link
Owner

Based on this: https://www.regular-expressions.info/atomic.html the atomic_group behaves as starts_with function, and the rest of regex continues where the inner part ended. But I can be wrong.

@Andersama
Copy link
Contributor Author

Andersama commented Nov 9, 2020

The examples it gives w/ a select would probably be a good static assert to test against:
a(?>bc|b)c and \b(integer|insert|in)\b

Playing around with a similar example on regex101: a(?>a*(?:bc|b))c
abc (no match)
abcc (matches)
aabc (no match)
aabcc (matches)

I think that means it's changing how the parts inside (?>) function. The select inside the atomic group never reaches the second choice (because the first character overlaps). It appears anything inside an atomic group gets the atomic behavior.

@hanickadot
Copy link
Owner

These tests added 263cb1f65fbfe772c5b3748ca6c180cafa61fb2b. Btw that's exactly the essence of bug #136 where inner part of loops behaves as atomic, which is nice optimization but exactly cases like (a|ab)+ fails on ab

@Andersama
Copy link
Contributor Author

Andersama commented Nov 9, 2020

Should've probably seen that coming, an atomic group would behave differently if mutually exclusive or not.

@Andersama
Copy link
Contributor Author

Andersama commented Jan 6, 2021

I ran into this issue when writing optimizations for select*

The intuition / fix is something along the lines of this:
select<sequence<T...>,sequence<T...,B...>,...>
->
select<sequence<T...,optional<B...>>,...>

I think I spotted some code of yours to do w/ testing against select<>'s and empty/epsilons, it's roughly equivalent to:
select<sequence<T...>,select<epsilon, sequence<B...>>>

Also makes the other way around make sense as to why that works, (ab|a)+ is a one to one of (ab?)+

@Andersama
Copy link
Contributor Author

@hanickadot I'm not exactly the best template meta programmer out there*, however if there were a way to order the select<> branches by their starting characters ranges and then by their expression lengths that would be one way to handle this. Might be possible by doing some character length analysis like I've done in #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants