-
-
Notifications
You must be signed in to change notification settings - Fork 63
Add some more named fields to the grammar #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e1cb592
to
b8f1afb
Compare
script/file_sizes.txt
Outdated
src/grammar.json 0.3MB 11864 | ||
src/node-types.json 0.1MB 8495 | ||
src/parser.c 58.6MB 1848207 | ||
src/scanner.c 0.0MB 29 | ||
total 49.8MB 1572006 | ||
total 59.0MB 1868595 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it rather odd that adding field names has such a big impact on the parser size. @damieng do you have some insight why this is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxbrunsfeld may be able to answer this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any insight, I'm a user of tree-sitter but have never peeked under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this significant increase is only happening if field('attributes', $.attribute_list)
is added to the grammar.
I removed these, and now the size increase feels acceptable:
--- a/script/file_sizes.txt
+++ b/script/file_sizes.txt
@@ -1,5 +1,5 @@
-src/grammar.json 0.2MB 10996
-src/node-types.json 0.1MB 7704
-src/parser.c 41.0MB 1285944
+src/grammar.json 0.2MB 11676
+src/node-types.json 0.1MB 8276
+src/parser.c 42.5MB 1338560
src/scanner.c 0.0MB 37
-total 41.3MB 1304681
+total 42.9MB 1358549
Doing the opposite and removing all --- a/script/file_sizes.txt
+++ b/script/file_sizes.txt
@@ -1,5 +1,5 @@
-src/grammar.json 0.2MB 10954
-src/node-types.json 0.1MB 7580
-src/parser.c 47.9MB 1505318
+src/grammar.json 0.2MB 10098
+src/node-types.json 0.1MB 6307
+src/parser.c 38.8MB 1221658
src/scanner.c 0.0MB 37
-total 48.3MB 1523889
+total 39.1MB 1238100 |
Doing that though will break a lot of the usefulness of the project and break GitHub semantic use no? |
Yes, sure. I'm not saying we should do this. Only highlighting that field names contribute significantly to the parser size. |
Yeah it's weird, I didn't imagine they would add much at all to the size. |
It looks like not all field names cause issues: #314. |
f4d31d5
to
3f73386
Compare
3f73386
to
c1285b6
Compare
c1285b6
to
815c511
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few comments.
@@ -709,7 +715,7 @@ module.exports = grammar({ | |||
'var'), | |||
|
|||
array_type: $ => seq( | |||
field('type', $._array_base_type), | |||
field('element_type', $._array_base_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this name better, but I don't know if breaking changes are OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know either if breaking changes are okay. At the same time, element_type
matches ElementType
, which is used by Roslyn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will GitHub Semantic know that element_type is a type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, breaking changes should be possible if there's a good case. In this case, it doesn't seem to improve anything for downstream processing, so my preference would be to stick with what we have. These kind of changes do break all queries for this node written against this grammar, after all.
|
||
nullable_type: $ => seq($._nullable_base_type, '?'), | ||
nullable_type: $ => seq(field('element_type', $._nullable_base_type), '?'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underlying_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roslyn uses ElementType
in int?
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to stick with type
, in line with other places in the grammar.
@@ -736,7 +742,7 @@ module.exports = grammar({ | |||
$.tuple_type | |||
), | |||
|
|||
pointer_type: $ => seq($._pointer_base_type, '*'), | |||
pointer_type: $ => seq(field('element_type', $._pointer_base_type), '*'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underlying_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, Roslyn uses ElementType
in int*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem, I'd stick with type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to introduce these fields, that is really useful for downstream processing of the parse trees!
I have some remarks:
-
Existing fields for expressions use
value
for the field name. I would suggest sticking with that convention instead of usingexpression
. I feel a bit bad, because when marking them I realized there are many places that needs changing, but I think being consistent is important for long-term maintainability. -
For simple lists (where all the elements are form the same production), I suggest to drop the field name. It doesn't add much, because there are not different fields to distinguish, and it makes the trees very verbose to have them.
')' | ||
), | ||
|
||
attribute_argument: $ => seq( | ||
optional(choice($.name_equals, $.name_colon)), | ||
$._expression | ||
field('expression', $._expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is that some other places use value
as the field name for expressions. Could we use that here as well?
), | ||
|
||
global_attribute_list: $ => seq( | ||
'[', | ||
choice('assembly', 'module'), | ||
':', | ||
commaSep($.attribute), | ||
commaSep(field('attribute', $.attribute)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it complicate any of your use cases to drop the field here? This is only a list of attributes, so all subterms are similar. It makes the parse tree a bit heavy handed if all of them also get a field name, IMO.
@@ -289,48 +295,48 @@ module.exports = grammar({ | |||
|
|||
variable_declaration: $ => seq( | |||
field('type', $._type), | |||
commaSep1($.variable_declarator) | |||
commaSep1(field('variable_declarator', $.variable_declarator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: I would perhaps suggest just declarator
here, to keep things somewhat compact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a commaSep1
, shouldn't I instead remove the field name altogether based on your second suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a slightly different case because there are also other fields here. Since there's also the type
field, having the field names helps grabbing the right subterms when querying.
On the other hand, for a production similar to
$.foo = seq('(', commaSep1(field('bar', $.bar)), ')')
all the subterms will have field bar
. Matching (foo bar:(_)@bar)
and (foo (_)@bar)
will have the same result. Adding the field name does not help in distinguishing different subterms of foo
. In those cases, I suggest to omit it.
Hope that clarifies what I mean!
')' | ||
), | ||
|
||
argument: $ => prec(1, seq( | ||
optional($.name_colon), | ||
optional(choice('ref', 'out', 'in')), | ||
choice( | ||
field('expression', choice( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can we use value
for consistency?
']' | ||
), | ||
|
||
tuple_pattern: $ => seq( | ||
'(', | ||
commaSep1(choice($.identifier, $.discard, $.tuple_pattern)), | ||
commaSep1(choice(field('name', $.identifier), $.discard, $.tuple_pattern)), | ||
')' | ||
), | ||
|
||
argument: $ => prec(1, seq( | ||
optional($.name_colon), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a name
field, I think.
|
||
ref_type_expression: $ => seq( | ||
'__reftype', | ||
'(', | ||
$._expression, | ||
field('expression', $._expression), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value
?
)), | ||
|
||
select_clause: $ => prec.right(PREC.SELECT, seq('select', $._expression)), | ||
select_clause: $ => prec.right(PREC.SELECT, seq('select', field('expression', $._expression))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value
?
field('group_expression', $._expression), | ||
'by', | ||
$._expression | ||
field('by_expression', $._expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, *_value
optional(choice('ascending', 'descending')) | ||
), | ||
|
||
where_clause: $ => seq('where', $._expression), | ||
where_clause: $ => seq('where', field('expression', $._expression)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem
@@ -1462,7 +1468,7 @@ module.exports = grammar({ | |||
'let', | |||
$.identifier, | |||
'=', | |||
$._expression | |||
field('expression', $._expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem
@hendrikvanantwerpen Thanks for the review.
No worries that there are many places that need changing, as you said it's important for maintainability. Before I change them, I'll raise one additional concern: I think we're being consistent with the current [A(i = 5)]
abstract void M(int j = 6); |
@tamasvajk Perhaps I judged too quickly? If this grammar explicitly tries to follow Roslyn (or even de facto did until now) I am all for aligning with them! In that case, I would even consider changing fields to align with them, if it's only a few fields that deviate. |
The only reason I followed the Roslyn grammar was to make changes in future C# easier as we can follow what rules change. (It's been pretty useful for C# 9, 10, 11) If however that's getting in the way of size/usability we should deviate. |
I agree that it is valuable to follow the Roslyn grammar of feasible. Perhaps it makes sense to call this out in the readme, and make it a bit more official? Wording along @damieng's comment above should be fine, I think |
Thanks for all the feedback on this PR. I think it might be easier if we do this step-by-step. In the first step I'm introducing only the fields that have visible impact on some use-cases, such as highlighting. So I'm leaving the fields with |
Here's another followup PR to add fields to name-like constructs: #318 |
This PR adds some more named fields to the grammar. (I couldn't add
field('attributes', $.attribute_list)
because that significantly increases the parser size.)Additionally, the PR simplifies the highlighting of types by using the fields named
type
((_ type: (identifier) @type)
) instead of defining individually for each construct where types are located ((identifier) @type
in individual constructs).