-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Grammar in Architecture #41
Comments
How exciting to have a Rust port! ❤️ This looks good to me. Any feedback @mpkorstanje? @ilslv would you be interested in donating your parser implementation to this repo (under a |
@aslakhellesoy my implementation may be incompatible with test suit in this repository. As I could see tests under |
Do you think it would be possible to make your implementation compatible with the official test suite? If not, do you think there is something we could change in the official implementation's test suite (or implementation) in order to make it easier for your implementation to be compatible? |
@aslakhellesoy you added the EBNF phrase at 0875a40#diff-8f6366fd8e7a5fa30f7f05879999195b7cf5fa1e3d157822eb37f0e91d226cfdR7. Note that the code highlighting should still use EBNF because that makes the pseudo-grammar somewhat readable but the grammar itself is not EBNF. |
Yes, but it would introduce some performance overhead, which we would like to avoid.
It would be possible, but I don't think it costs the effort. As I mentioned only subset of the official test suite is incompatible. The most important tests for final matching and regex transformation are ok. I even think that only this test case would be incompatible. This happens because other implementations are using current |
@ilslv the grammar as written is context-sensitive and not in EBNF form (EBNF also can't express context sensitive grammars).
Users of Cucumbers are not typically familiar with the concepts involved in parsers or grammars. As such the typical error messages from a parser are incomprehensible. By accepting a super set of all cucumber expressions we can more clearly provide error messages for typical mistakes such as using optional parameter types, empty optionals, etc, etc by pointing the entire AST node that is in the wrong place rather then a single character.
This puts each character into a single AST node. This isn't great for languages with garbage collection and makes debugging much harder.
It is now longer obvious that
The Now I've only given this a cursory review but other then the issues mentioned above I don't see any immediate technical problem with the provided EBNF. Might be worth updating the documentation somewhat. |
Yes, but only if you are using some sort of parser-generator. Rust implementation has exactly the same errors, which can point at troublesome characters or range of characters. All this while preserving minimal overhead (1 dynamic allocation on every nested expression and even it can be avoided, I just don't think it costs the effort for now).
There is no need to translate provided grammar 1-to-1. I propose adding it only to provide some formal specification for the My proposal is following: let's add formal EBNF specification for the language while preserving the old one and add documentation that it only exists only for implementation suggestion. This would allow Rust implementation (or any other one) to strictly follow it, because for now there is no way for me to reason that Rust implementation is correct, as there is no formal specification for |
No. If we only have a formal specification without a closely matching implementation and test set in this repository we are only going to see uncontrolled drift. Though I don't believe a formal specification has to be in EBNF. As long as the context sensitive form can be procedurally be written to EBNF it can serve as a formal specification for your purposes. |
It looks like there is a misunderstanding, as I never proposed to remove anything, only add a formal specification. Let me sum up everything discussed in this issue. What are the proposed changes?I would like to see formal specification with some context-free grammar. Original issue shows the EBNF example.
It's not obvious, but text-in-alternative = (- alternative-to-escape) | ('\', alternative-to-escape)
alternative-to-escape = ' ' | '(' | '{' | '/' | '\' You have to escape whitespace for it to match. I agree that it's hard to spot. That's why I'm proposing only to add formal specification, leaving existing one as one of implementation possibilities. Why do we need this change?For now there as there is no formal specification, only some non-existing grammar, I can't reason why Rust implementation is correct. Why not to use existing guidance from
|
@ilslv could you please send a pull request with your proposed changes? |
There is no misunderstanding. From the discussion in cucumber-rs/cucumber-expressions#1 I see that you have asked to implement the parser in a way that closely matches the specification. You are now looking to update the specification to match your implementation. However the proposed grammar is does not accept the superset of all Cucumber valid expressions. As such we do not have a reference implementation or test set that accurately describes it. Additionally if the EBNF can not be derived from the context-sensitive grammar it can not be said to be correct. If it can be derived from the context sensitive grammar it need not be explicitly included. Currently the EBNF can't derived from the context sensitive grammar because it is not well specified. So I'm specifically requesting a fix to the context-sensitive grammar rather then an addition of an otherwise untested, unverified (in this repo) piece of specification. |
No, I'm trying to introduce formal specification as there is none yet.
Yes, thats the point to describe
Yes, because
There is no notion of correctness right now, as no formal specification exists.
My grammar is context-free. Can you please provide concrete examples, where my attempt at formalisation fails? |
@ilslv @mpkorstanje it seems to me that the discussion becomes a little bit overheated. Let's sum it up a little with facts and coclusions, to push it into a constructive direction. The current grammar described in
This imposes problems for alternative implementations:
We'd really like to have implementation freedom, while preserve the ability to formally check correctness. For this purpose @ilslv proposed to add a "formal spec" grammar, which is context-free and implementation-free, but describes the actual But @mpkorstanje is against having a formal spec without an ability to verify whether implementations do follow it, as it will introduce an obvious drift between the actual spec and its implementations. Please, correct me if I'm wrong. So, from my point of view: Is there any arguments against this way? Or is there any issue that I do miss? |
@tyranron thanks for pulling this discussion together and @ilslv thanks for all your contributions so far. I share Aslak's excitement at seeing a Cucumber in Rust! If people would like to talk about this in a more real-time context, we can be found on the https://cucumber.io/community#slack in the #committers channel. We also have a regular Zoom call on Thursdays at 7am PST (4pm CET) that you'd be more than welcome to join us for. If you hop into the Slack we can share the Zoom link in there or add you to the invite. |
@mattwynne thanks for the invitation! But I'd like to have discussion here manly for 2 reasons:
If issue won't be resolved by Thursday I thinks I will be able to join you on the call. They have tendency to be more constructive, as time is limited 😅 |
@ilslv no, my request is very simple: To fix the context sensitive grammar in such a way that a the EBNF can be logically derived from it. I'm not opposed to replacing the context sensitive grammar with an EBNF if the test suite, implementations in this repository follow suit. However that is a significant commitment. As such fixing the context sensitive grammar in such a way that a the EBNF can be logically derived from it should be a reasonable compromise. Third party implementations will be able to use a "formal" grammar while this project retains the canonical grammar with closely matching implementation and tests. In the long term we can then look at improving the test suite, grammar and implementations. |
I don't understand what We have 2 grammars:
What do you mean by I still don't understand what do you actually want to |
I'm looking to retain a close similarity between the grammar, tests and implementation in this repo while also make it possible for you to derive a grammar in EBNF form without asking you to also change the implementations and tests in this repo. Now for our purposes we can say that grammars are equivalent if they accept the same language. So if the language accepted by cucumber expressions is context free you should be able to derive the context free grammar from the context sensitive grammar in a structured manner (or be able to proof equivalency in another way). Currently I don't believe this is possible because the context sensitive grammar contains several errors and ambiguities. This inability to derive a context free grammar from the context sensitive one is what I'm asking you to fix. Additionally the the fixed grammar should still be context sensitive and accept the superset if Cucumber expressions with optional parameter types, empty optionals, ect. So once fixed we will have one context sensitive grammar that matches our implementation and tests from which you can logically derive a context free one. |
All in all we have different intentions. You just want to fix existing superset grammar and by doing it to force implementations to be suboptimal to be somehow |
I don't think it is impossible. This isn't the general case of rewriting any arbitrary grammar but rather this the case of rewriting this specific grammar. Assuming your EBNF is correct we know context free form exists. But if you do think it is impossible then the only mutually acceptable alternative is to rewrite the grammar, tests, and existing implementations. |
So the least resourceful, but good enough, solution would be to have a programmable way for mapping "formal spec" AST into the current one, or vice versa, or having even another reduced representation, so both "formal spec" AST and the current AST may be translated into for correct comparing. This will allow us to fully reuse exisitng implementation and tests. |
Simply mapping from formal spec into superset grammar won't give us much info. It's almost the same as having grammar that accepts only empty string and will always map in superset. The only thing, that will bring us to being somewhat correct is to feed all implementations some input from manual tests and fuzzers and compare resulting |
I also noticed one interesting implementation detail. Whitespaces are treated as a some sort of special case ---
expression: three hungry/blind mice
expected_ast:
type: EXPRESSION_NODE
start: 0
end: 23
nodes:
- type: TEXT_NODE
start: 0
end: 5
token: three
- type: TEXT_NODE
start: 5
end: 6
token: " "
- type: ALTERNATION_NODE
start: 6
end: 18
nodes:
- type: ALTERNATIVE_NODE
start: 6
end: 12
nodes:
- type: TEXT_NODE
start: 6
end: 12
token: hungry
- type: ALTERNATIVE_NODE
start: 13
end: 18
nodes:
- type: TEXT_NODE
start: 13
end: 18
token: blind
- type: TEXT_NODE
start: 18
end: 19
token: " "
- type: TEXT_NODE
start: 19
end: 23
token: mice Rust implementation does completely the same thing. So It looks like existing implementation already work around context-sensitive grammar by being context-free. This can mean that we no longer can call them |
I'm not sure about the point you are making. As we've already established, we can assume the language accepted by the grammar to be context-free. The grammar however is written in context sensitive form for clarity. I'm open to updating the grammar provided the implementations match the grammar. E.g this would have to change: Lines 129 to 146 in c399877
|
Your previous argument was about not being able to change the grammar into being context-free, as current implementations are based. I'm, on the other hand, saying that argument doesn't make sense to me, if current implementations already don't follow provided grammar. I've finally looked into the implementation and from I can see ether grammar is more broken, than I thought, or implementation is wrong. Lines 129 to 146 in c399877
Here we can see call into Lines 214 to 238 in c399877
As I understand algoritm is following:
If so, this is not how lookahead defined at So, if I understand algorithm correctly, we have 3 possible cases:
Which is it? |
I'd say #3 is the most accurate, but again, I don't see how this changes any of the conclusions made so far. If we change the grammar, we change the implementations and tests to match. |
@mpkorstanje this is significant, because real lookahead requires context-sensitive grammar, while implemented one doesn't. I believe that it's possible to formalise pretty easy-to-understand context-free grammar which fully compliant with implementation. |
Can you update the implementation and grammar to reflect that? |
@mpkorstanje I'm not too comfortable with any languages other implementations are using. But more important thing is that we don't have to update implementation right the way, as provided grammar is equivalent to implemented one. |
Even more cucumber-expression = ( alternation | optional | parameter | text )*
alternation = (?<=left-boundary), alternative*, ( "/" + alternative* )+
left-boundary = whitespace | "}" | ^
alternative = optional | text
optional = "(", option*, ")"
option = optional | parameter | text
parameter = "{", name*, "}"
name = whitespace | .
text = whitespace | ")" | "}" | . is equivalent to cucumber-expression = ( alternation | optional | parameter | text )*
alternation = alternative*, ( "/" + alternative* )+
alternative = optional | text
optional = "(", option*, ")"
option = optional | parameter | text
parameter = "{", name*, "}"
name = whitespace | .
text = whitespace | ")" | "}" | . This happens because Final grammar which is equivalent to the existing one is just relaxed version of my original proposal. By |
Cheers! We can use this info to update the grammar and implementations sometime in the future then. |
Fixing the grammar / parser to use the |
How do you "span" the invalid subtrees in error messages without a fully parsed AST? Example:
The parser would not fail at the first |
This one is pretty easy to tackle: If I'm started to parse So in case of Also helpful idea of splitting errors into recoverable once, and irreconcilable. As the name suggests, recoverable allows us to continue parsing with another parser. For example in |
Can you show an example of that? |
@mpkorstanje Rust syntax is quite alien-looking, but here you go |
Mmh. This error message doesn't look quite right: |
I also don't see where you would construct the |
I've united all
Actually I don't do it. I allow users to supply string wrapped in a |
So, do you want update grammar to exact |
So far I don't see how I can generate the error messages with the |
@mpkorstanje I've already described solution with LocatedSpan. Codesample: let err = Expression::regex("I have {int cucumbers in my belly").expect_err();
match err {
Error::UnescapedReservedCharacter(e) => {
assert_eq!(*e, "{");
assert_eq!(
repeat(' ')
.take(e.offset())
.chain(once('^'))
.chain(repeat('-').take(e.len()))
.chain(once('^')),
" ^-^".chars(),
)
}
e => panic!("wrong err: {}", e),
}
let err = Expression::regex("I have ({int}) cucumbers in my belly").expect_err();
match err {
Error::ParameterInOptional(e) => {
assert_eq!(*e, "int");
assert_eq!(
repeat(' ')
.take(e.offset())
.chain(once('^'))
.chain(repeat('-').take(e.len()))
.chain(once('^')),
" ^---^".chars(),
)
}
e => panic!("wrong err: {}", e),
} More code examples are available here |
Unfortunately that doesn't show how to compute the bounds that go into the span. |
@mpkorstanje it does exactly that.
|
So where do those values come from? |
@mpkorstanje as I mentioned before, implementation abstracts over input type: https://github.com/ilslv/cucumber-expressions-1/blob/main/src/parse.rs#L427-L435 |
@mpkorstanje the idea is that we carry the location information along while descending in a parser.
|
So when encountering a
Best I can find is: Some('{') => {
if let Ok((_, par)) = peek(parameter)(input.clone()) {
return Error::ParameterInOptional(
input.take(par.0.input_len() + 2),
)
.failure();
}
return Error::UnescapedReservedCharacter(input.take(1))
.failure();
} And it looks like |
already described here
yes, it tries to parse Can you please formulate all the questions about Rust implementation, so I can tackle them all at once in depth? Because it looks like we are walking in circles. |
Right so you are parsing invalid ASTs and rejecting them early. This rejection seems rather ad-hoc and I don't see a good reason to do it. The moment we'd add a test case with multiple AST problems you'd have to change the entire architecture. For example:
|
Most other parsers do exactly the same thing. This is not an ad-hoc solution, but the only possible way to implement zero-cost performant parser. This entire conversation still doesn't have good reason to parse entire superset AST.
No, we still can continue parsing and collecting error and return batch of them. This won't require architectural changes. The only reason I've implemented this way is to close to existing implementations, which error early too. Example Also I don't see error messages as a part of language specification. |
Would any of you be interested in an ensemble programming session where we work on this together? I feel like it might speed up the communication a bit if we could work on it in real time, and it could be fun! I know some people prefer working asynchronously though, so no pressure, I just thought I'd suggest it. If folks are interested I can do the work to figure out our schedules and find a good time. |
Fundamentally, to keep the implementation and spec close together this means rewriting the existing implementations. This takes time. I currently don't have that sort of time to commit to this. I'm being asked to make a judgement call about using the superset of Cucumber expressions or exact set in the EBNF. I can't make this judgement call. To cut the task down to size it might even be convenient to in a first iteration make the grammar context-free. And in a second iteration remove the redundant parts of the grammar. |
@mpkorstanje if I understand correctly you want to make every PR atomic. This means after every merge implementation should closely resemble grammar. I don't consider myself comfortable enough this any other implantation language, especially taking into account number of users. So only help I can do is to provide context-free grammar in separate peaces. So I see following roadmap:
cucumber-expression = ( alternation | optional | parameter | text )*
alternation = alternative*, ( "/" + alternative* )+
alternative = optional | text
optional = "(", option*, ")"
option = optional | parameter | text
parameter = "{", name*, "}"
name = whitespace | .
text = whitespace | ")" | "}" | .
|
I'm implementing
Cucumber Expressions
parser in Rust: WIP. And grammar described in theARCHITECTURE.md
really bothers me for couple of reasons.alternation
definitionEBNF describes context-free grammars and (correct me if I'm wrong) shouldn't have lookaheads and lookbehinds. This also means that described grammar may have non-linear complexity.
Cucumber Expressions
and I don't really see reasons to do it.So implementing
Cucumber Expression
parser according to the docs leads to unnecessary performance overhead for no obvious reasons.Can't we just provide exact grammar for
Cucumber Expressions
withoutNote:
section?If I didn't miss anything provided grammar should be exact
Cucumber Expressions
The text was updated successfully, but these errors were encountered: