Add some more named fields to the grammar #300

tamasvajk · 2023-04-04T13:15:53Z

This PR adds some more named fields to the grammar. (I couldn't add field('attributes', $.attribute_list) because that significantly increases the parser size.)

Additionally, the PR simplifies the highlighting of types by using the fields named type ((_ type: (identifier) @type)) instead of defining individually for each construct where types are located ((identifier) @type in individual constructs).

tamasvajk · 2023-04-05T08:55:36Z

script/file_sizes.txt

+src/grammar.json    	0.3MB	     11864
+src/node-types.json 	0.1MB	      8495
+src/parser.c        	58.6MB	   1848207
 src/scanner.c       	0.0MB	        29
-total               	49.8MB	   1572006
+total               	59.0MB	   1868595


I find it rather odd that adding field names has such a big impact on the parser size. @damieng do you have some insight why this is the case?

@maxbrunsfeld may be able to answer this as well.

I don't have any insight, I'm a user of tree-sitter but have never peeked under the hood.

Looks like this significant increase is only happening if field('attributes', $.attribute_list) is added to the grammar.

I removed these, and now the size increase feels acceptable:

--- a/script/file_sizes.txt +++ b/script/file_sizes.txt @@ -1,5 +1,5 @@ -src/grammar.json 0.2MB 10996 -src/node-types.json 0.1MB 7704 -src/parser.c 41.0MB 1285944 +src/grammar.json 0.2MB 11676 +src/node-types.json 0.1MB 8276 +src/parser.c 42.5MB 1338560 src/scanner.c 0.0MB 37 -total 41.3MB 1304681 +total 42.9MB 1358549

tamasvajk · 2023-06-15T10:05:03Z

Doing the opposite and removing all field() calls from the grammar has significant impact on the parser size:

--- a/script/file_sizes.txt
+++ b/script/file_sizes.txt
@@ -1,5 +1,5 @@
-src/grammar.json       0.2MB        10954
-src/node-types.json    0.1MB         7580
-src/parser.c           47.9MB     1505318
+src/grammar.json       0.2MB        10098
+src/node-types.json    0.1MB         6307
+src/parser.c           38.8MB     1221658
 src/scanner.c          0.0MB           37
-total                  48.3MB     1523889
+total                  39.1MB     1238100

damieng · 2023-06-15T10:16:26Z

Doing that though will break a lot of the usefulness of the project and break GitHub semantic use no?

tamasvajk · 2023-06-15T10:21:29Z

Doing that though will break a lot of the usefulness of the project and break GitHub semantic use no?

Yes, sure. I'm not saying we should do this. Only highlighting that field names contribute significantly to the parser size.

damieng · 2023-06-15T10:22:59Z

Yeah it's weird, I didn't imagine they would add much at all to the size.

tamasvajk · 2023-06-15T11:53:07Z

Yeah it's weird, I didn't imagine they would add much at all to the size.

It looks like not all field names cause issues: #314.

hvitved

LGTM, a few comments.

hvitved · 2023-06-16T12:00:10Z

grammar.js

@@ -709,7 +715,7 @@ module.exports = grammar({
      'var'),

    array_type: $ => seq(
-      field('type', $._array_base_type),
+      field('element_type', $._array_base_type),


I like this name better, but I don't know if breaking changes are OK.

I don't know either if breaking changes are okay. At the same time, element_type matches ElementType, which is used by Roslyn.

Will GitHub Semantic know that element_type is a type?

In general, breaking changes should be possible if there's a good case. In this case, it doesn't seem to improve anything for downstream processing, so my preference would be to stick with what we have. These kind of changes do break all queries for this node written against this grammar, after all.

hvitved · 2023-06-16T12:00:38Z

grammar.js


-    nullable_type: $ => seq($._nullable_base_type, '?'),
+    nullable_type: $ => seq(field('element_type', $._nullable_base_type), '?'),


underlying_type?

Roslyn uses ElementType in int?.

My preference would be to stick with type, in line with other places in the grammar.

hvitved · 2023-06-16T12:00:42Z

grammar.js

@@ -736,7 +742,7 @@ module.exports = grammar({
      $.tuple_type
    ),

-    pointer_type: $ => seq($._pointer_base_type, '*'),
+    pointer_type: $ => seq(field('element_type', $._pointer_base_type), '*'),


underlying_type?

Same comment as above, Roslyn uses ElementType in int*.

Idem, I'd stick with type.

hendrikvanantwerpen

Thanks for taking the time to introduce these fields, that is really useful for downstream processing of the parse trees!

I have some remarks:

Existing fields for expressions use value for the field name. I would suggest sticking with that convention instead of using expression. I feel a bit bad, because when marking them I realized there are many places that needs changing, but I think being consistent is important for long-term maintainability.
For simple lists (where all the elements are form the same production), I suggest to drop the field name. It doesn't add much, because there are not different fields to distinguish, and it makes the trees very verbose to have them.

hendrikvanantwerpen · 2023-06-16T13:49:33Z

grammar.js

      ')'
    ),

    attribute_argument: $ => seq(
      optional(choice($.name_equals, $.name_colon)),
-      $._expression
+      field('expression', $._expression)


My impression is that some other places use value as the field name for expressions. Could we use that here as well?

hendrikvanantwerpen · 2023-06-16T13:52:01Z

grammar.js

    ),

    global_attribute_list: $ => seq(
      '[',
      choice('assembly', 'module'),
      ':',
-      commaSep($.attribute),
+      commaSep(field('attribute', $.attribute)),


Would it complicate any of your use cases to drop the field here? This is only a list of attributes, so all subterms are similar. It makes the parse tree a bit heavy handed if all of them also get a field name, IMO.

hendrikvanantwerpen · 2023-06-16T13:53:06Z

grammar.js

@@ -289,48 +295,48 @@ module.exports = grammar({

    variable_declaration: $ => seq(
      field('type', $._type),
-      commaSep1($.variable_declarator)
+      commaSep1(field('variable_declarator', $.variable_declarator))


Small nit: I would perhaps suggest just declarator here, to keep things somewhat compact.

This is in a commaSep1, shouldn't I instead remove the field name altogether based on your second suggestion?

I think this is a slightly different case because there are also other fields here. Since there's also the type field, having the field names helps grabbing the right subterms when querying.

On the other hand, for a production similar to

$.foo = seq('(', commaSep1(field('bar', $.bar)), ')')

all the subterms will have field bar. Matching (foo bar:(_)@bar) and (foo (_)@bar) will have the same result. Adding the field name does not help in distinguishing different subterms of foo. In those cases, I suggest to omit it.

Hope that clarifies what I mean!

hendrikvanantwerpen · 2023-06-16T13:54:17Z

grammar.js

      ')'
    ),

    argument: $ => prec(1, seq(
      optional($.name_colon),
      optional(choice('ref', 'out', 'in')),
-      choice(
+      field('expression', choice(


Similarly, can we use value for consistency?

hendrikvanantwerpen · 2023-06-16T13:54:41Z

grammar.js

      ']'
    ),

    tuple_pattern: $ => seq(
      '(',
-      commaSep1(choice($.identifier, $.discard, $.tuple_pattern)),
+      commaSep1(choice(field('name', $.identifier), $.discard, $.tuple_pattern)),
      ')'
    ),

    argument: $ => prec(1, seq(
      optional($.name_colon),


This could be a name field, I think.

hendrikvanantwerpen · 2023-06-16T14:19:22Z

grammar.js


    ref_type_expression: $ => seq(
      '__reftype',
      '(',
-      $._expression,
+      field('expression', $._expression),


hendrikvanantwerpen · 2023-06-16T14:19:32Z

grammar.js

    )),

-    select_clause: $ => prec.right(PREC.SELECT, seq('select', $._expression)),
+    select_clause: $ => prec.right(PREC.SELECT, seq('select', field('expression', $._expression))),


hendrikvanantwerpen · 2023-06-16T14:20:01Z

grammar.js

+      field('group_expression', $._expression),
      'by',
-      $._expression
+      field('by_expression', $._expression)


As above, *_value

hendrikvanantwerpen · 2023-06-16T14:20:14Z

grammar.js

      optional(choice('ascending', 'descending'))
    ),

-    where_clause: $ => seq('where', $._expression),
+    where_clause: $ => seq('where', field('expression', $._expression)),


hendrikvanantwerpen · 2023-06-16T14:20:19Z

grammar.js

@@ -1462,7 +1468,7 @@ module.exports = grammar({
      'let',
      $.identifier,
      '=',
-      $._expression
+      field('expression', $._expression)


tamasvajk · 2023-06-16T14:51:14Z

@hendrikvanantwerpen Thanks for the review.

Existing fields for expressions use value for the field name. I would suggest sticking with that convention instead of using expression. I feel a bit bad, because when marking them I realized there are many places that needs changing, but I think being consistent is important for long-term maintainability.

No worries that there are many places that need changing, as you said it's important for maintainability. Before I change them, I'll raise one additional concern: I think we're being consistent with the current expression vs value distinctions; we're following what Roslyn is doing. (I would need to check that I fully matched Roslyn everywhere) For example Roslyn uses Value for 6 and Expression for 5 in the below example. See in the Syntax Tree view here.

[A(i = 5)]
abstract void M(int j = 6);

hendrikvanantwerpen · 2023-06-16T15:20:38Z

@tamasvajk Perhaps I judged too quickly? If this grammar explicitly tries to follow Roslyn (or even de facto did until now) I am all for aligning with them!

In that case, I would even consider changing fields to align with them, if it's only a few fields that deviate.

damieng · 2023-06-16T16:21:59Z

The only reason I followed the Roslyn grammar was to make changes in future C# easier as we can follow what rules change. (It's been pretty useful for C# 9, 10, 11)

If however that's getting in the way of size/usability we should deviate.

hendrikvanantwerpen · 2023-06-16T17:01:19Z

I agree that it is valuable to follow the Roslyn grammar of feasible. Perhaps it makes sense to call this out in the readme, and make it a bit more official? Wording along @damieng's comment above should be fine, I think

tamasvajk · 2023-06-19T11:39:13Z

Thanks for all the feedback on this PR. I think it might be easier if we do this step-by-step. In the first step I'm introducing only the fields that have visible impact on some use-cases, such as highlighting. So I'm leaving the fields with field('type'. I also renamed field('element_type' to field('type', which actually results in some highlighting improvement: #316

tamasvajk · 2023-06-21T12:22:43Z

Here's another followup PR to add fields to name-like constructs: #318

tamasvajk force-pushed the add-named-fields branch 2 times, most recently from e1cb592 to b8f1afb Compare April 5, 2023 08:44

tamasvajk commented Apr 5, 2023

View reviewed changes

tamasvajk added 3 commits June 16, 2023 10:57

Add field names to the grammar

054c86e

Add generated files

ad5230a

Update file sizes

1b59c77

tamasvajk force-pushed the add-named-fields branch 3 times, most recently from f4d31d5 to 3f73386 Compare June 16, 2023 09:10

tamasvajk marked this pull request as ready for review June 16, 2023 09:19

tamasvajk force-pushed the add-named-fields branch from 3f73386 to c1285b6 Compare June 16, 2023 09:41

Simplify highlighting queries

815c511

tamasvajk force-pushed the add-named-fields branch from c1285b6 to 815c511 Compare June 16, 2023 09:44

hvitved reviewed Jun 16, 2023

View reviewed changes

hendrikvanantwerpen suggested changes Jun 16, 2023

View reviewed changes

tamasvajk closed this Jun 21, 2023


		nullable_type: $ => seq($._nullable_base_type, '?'),
		nullable_type: $ => seq(field('element_type', $._nullable_base_type), '?'),

Uh oh!

Add some more named fields to the grammar #300

Add some more named fields to the grammar #300

Uh oh!

Conversation

tamasvajk commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tamasvajk commented Jun 15, 2023

Uh oh!

damieng commented Jun 15, 2023

Uh oh!

tamasvajk commented Jun 15, 2023

Uh oh!

damieng commented Jun 15, 2023

Uh oh!

tamasvajk commented Jun 15, 2023

Uh oh!

hvitved left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tamasvajk commented Jun 16, 2023

Uh oh!

hendrikvanantwerpen commented Jun 16, 2023

Uh oh!

tamasvajk commented Apr 4, 2023 •

edited

Loading

damieng commented Jun 16, 2023 •

edited

Loading