Skip to content

Add some more named fields to the grammar #300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

tamasvajk
Copy link
Collaborator

@tamasvajk tamasvajk commented Apr 4, 2023

This PR adds some more named fields to the grammar. (I couldn't add field('attributes', $.attribute_list) because that significantly increases the parser size.)

Additionally, the PR simplifies the highlighting of types by using the fields named type ((_ type: (identifier) @type)) instead of defining individually for each construct where types are located ((identifier) @type in individual constructs).

@tamasvajk tamasvajk force-pushed the add-named-fields branch 2 times, most recently from e1cb592 to b8f1afb Compare April 5, 2023 08:44
Comment on lines 1 to 5
src/grammar.json 0.3MB 11864
src/node-types.json 0.1MB 8495
src/parser.c 58.6MB 1848207
src/scanner.c 0.0MB 29
total 49.8MB 1572006
total 59.0MB 1868595
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it rather odd that adding field names has such a big impact on the parser size. @damieng do you have some insight why this is the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxbrunsfeld may be able to answer this as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any insight, I'm a user of tree-sitter but have never peeked under the hood.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this significant increase is only happening if field('attributes', $.attribute_list) is added to the grammar.

I removed these, and now the size increase feels acceptable:

--- a/script/file_sizes.txt
+++ b/script/file_sizes.txt
@@ -1,5 +1,5 @@
-src/grammar.json       0.2MB        10996
-src/node-types.json    0.1MB         7704
-src/parser.c           41.0MB     1285944
+src/grammar.json       0.2MB        11676
+src/node-types.json    0.1MB         8276
+src/parser.c           42.5MB     1338560
 src/scanner.c          0.0MB           37
-total                  41.3MB     1304681
+total                  42.9MB     1358549

@tamasvajk
Copy link
Collaborator Author

Doing the opposite and removing all field() calls from the grammar has significant impact on the parser size:

--- a/script/file_sizes.txt
+++ b/script/file_sizes.txt
@@ -1,5 +1,5 @@
-src/grammar.json       0.2MB        10954
-src/node-types.json    0.1MB         7580
-src/parser.c           47.9MB     1505318
+src/grammar.json       0.2MB        10098
+src/node-types.json    0.1MB         6307
+src/parser.c           38.8MB     1221658
 src/scanner.c          0.0MB           37
-total                  48.3MB     1523889
+total                  39.1MB     1238100

@damieng
Copy link
Collaborator

damieng commented Jun 15, 2023

Doing that though will break a lot of the usefulness of the project and break GitHub semantic use no?

@tamasvajk
Copy link
Collaborator Author

Doing that though will break a lot of the usefulness of the project and break GitHub semantic use no?

Yes, sure. I'm not saying we should do this. Only highlighting that field names contribute significantly to the parser size.

@damieng
Copy link
Collaborator

damieng commented Jun 15, 2023

Yeah it's weird, I didn't imagine they would add much at all to the size.

@tamasvajk
Copy link
Collaborator Author

Yeah it's weird, I didn't imagine they would add much at all to the size.

It looks like not all field names cause issues: #314.

@tamasvajk tamasvajk force-pushed the add-named-fields branch 3 times, most recently from f4d31d5 to 3f73386 Compare June 16, 2023 09:10
@tamasvajk tamasvajk marked this pull request as ready for review June 16, 2023 09:19
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few comments.

@@ -709,7 +715,7 @@ module.exports = grammar({
'var'),

array_type: $ => seq(
field('type', $._array_base_type),
field('element_type', $._array_base_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this name better, but I don't know if breaking changes are OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either if breaking changes are okay. At the same time, element_type matches ElementType, which is used by Roslyn.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will GitHub Semantic know that element_type is a type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, breaking changes should be possible if there's a good case. In this case, it doesn't seem to improve anything for downstream processing, so my preference would be to stick with what we have. These kind of changes do break all queries for this node written against this grammar, after all.


nullable_type: $ => seq($._nullable_base_type, '?'),
nullable_type: $ => seq(field('element_type', $._nullable_base_type), '?'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying_type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roslyn uses ElementType in int?.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to stick with type, in line with other places in the grammar.

@@ -736,7 +742,7 @@ module.exports = grammar({
$.tuple_type
),

pointer_type: $ => seq($._pointer_base_type, '*'),
pointer_type: $ => seq(field('element_type', $._pointer_base_type), '*'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying_type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, Roslyn uses ElementType in int*.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem, I'd stick with type.

Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to introduce these fields, that is really useful for downstream processing of the parse trees!

I have some remarks:

  • Existing fields for expressions use value for the field name. I would suggest sticking with that convention instead of using expression. I feel a bit bad, because when marking them I realized there are many places that needs changing, but I think being consistent is important for long-term maintainability.

  • For simple lists (where all the elements are form the same production), I suggest to drop the field name. It doesn't add much, because there are not different fields to distinguish, and it makes the trees very verbose to have them.

')'
),

attribute_argument: $ => seq(
optional(choice($.name_equals, $.name_colon)),
$._expression
field('expression', $._expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression is that some other places use value as the field name for expressions. Could we use that here as well?

),

global_attribute_list: $ => seq(
'[',
choice('assembly', 'module'),
':',
commaSep($.attribute),
commaSep(field('attribute', $.attribute)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it complicate any of your use cases to drop the field here? This is only a list of attributes, so all subterms are similar. It makes the parse tree a bit heavy handed if all of them also get a field name, IMO.

@@ -289,48 +295,48 @@ module.exports = grammar({

variable_declaration: $ => seq(
field('type', $._type),
commaSep1($.variable_declarator)
commaSep1(field('variable_declarator', $.variable_declarator))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: I would perhaps suggest just declarator here, to keep things somewhat compact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a commaSep1, shouldn't I instead remove the field name altogether based on your second suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a slightly different case because there are also other fields here. Since there's also the type field, having the field names helps grabbing the right subterms when querying.

On the other hand, for a production similar to

$.foo = seq('(', commaSep1(field('bar', $.bar)), ')')

all the subterms will have field bar. Matching (foo bar:(_)@bar) and (foo (_)@bar) will have the same result. Adding the field name does not help in distinguishing different subterms of foo. In those cases, I suggest to omit it.

Hope that clarifies what I mean!

')'
),

argument: $ => prec(1, seq(
optional($.name_colon),
optional(choice('ref', 'out', 'in')),
choice(
field('expression', choice(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we use value for consistency?

']'
),

tuple_pattern: $ => seq(
'(',
commaSep1(choice($.identifier, $.discard, $.tuple_pattern)),
commaSep1(choice(field('name', $.identifier), $.discard, $.tuple_pattern)),
')'
),

argument: $ => prec(1, seq(
optional($.name_colon),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a name field, I think.


ref_type_expression: $ => seq(
'__reftype',
'(',
$._expression,
field('expression', $._expression),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value?

)),

select_clause: $ => prec.right(PREC.SELECT, seq('select', $._expression)),
select_clause: $ => prec.right(PREC.SELECT, seq('select', field('expression', $._expression))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value?

Comment on lines +1493 to +1495
field('group_expression', $._expression),
'by',
$._expression
field('by_expression', $._expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, *_value

optional(choice('ascending', 'descending'))
),

where_clause: $ => seq('where', $._expression),
where_clause: $ => seq('where', field('expression', $._expression)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

@@ -1462,7 +1468,7 @@ module.exports = grammar({
'let',
$.identifier,
'=',
$._expression
field('expression', $._expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

@tamasvajk
Copy link
Collaborator Author

@hendrikvanantwerpen Thanks for the review.

Existing fields for expressions use value for the field name. I would suggest sticking with that convention instead of using expression. I feel a bit bad, because when marking them I realized there are many places that needs changing, but I think being consistent is important for long-term maintainability.

No worries that there are many places that need changing, as you said it's important for maintainability. Before I change them, I'll raise one additional concern: I think we're being consistent with the current expression vs value distinctions; we're following what Roslyn is doing. (I would need to check that I fully matched Roslyn everywhere) For example Roslyn uses Value for 6 and Expression for 5 in the below example. See in the Syntax Tree view here.

[A(i = 5)]
abstract void M(int j = 6);

@hendrikvanantwerpen
Copy link
Contributor

@tamasvajk Perhaps I judged too quickly? If this grammar explicitly tries to follow Roslyn (or even de facto did until now) I am all for aligning with them!

In that case, I would even consider changing fields to align with them, if it's only a few fields that deviate.

@damieng
Copy link
Collaborator

damieng commented Jun 16, 2023

The only reason I followed the Roslyn grammar was to make changes in future C# easier as we can follow what rules change. (It's been pretty useful for C# 9, 10, 11)

If however that's getting in the way of size/usability we should deviate.

@hendrikvanantwerpen
Copy link
Contributor

I agree that it is valuable to follow the Roslyn grammar of feasible. Perhaps it makes sense to call this out in the readme, and make it a bit more official? Wording along @damieng's comment above should be fine, I think

@tamasvajk
Copy link
Collaborator Author

Thanks for all the feedback on this PR. I think it might be easier if we do this step-by-step. In the first step I'm introducing only the fields that have visible impact on some use-cases, such as highlighting. So I'm leaving the fields with field('type'. I also renamed field('element_type' to field('type', which actually results in some highlighting improvement: #316

@tamasvajk
Copy link
Collaborator Author

Here's another followup PR to add fields to name-like constructs: #318

@tamasvajk tamasvajk closed this Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants