GramVarKit is a configurable tool designed to simplify the development of toolkits for AI-oriented grammars. Built on top of tree-sitter, GramVarKit generates two parsers and two unparsers. The parsers convert source code into parse trees, while the unparsers perform grammar-aware conversions during the unparsing process. By specifying grammar differences in a user-defined configuration file, users can define how code should be transformed without manually implementing the conversion logic.
The tree-sitter CLI and its Python bindings are required.
The configuration file defines the grammar differences between the original and the new grammar. It consists of four sections: import,both,ori_to_new and new_to_ori.
import specifies a list of additional transformation rule files to be included, allowing modular rule organization.
The basic bidirectional transformation rules are contained in the both section, which will be parsed into two sets of conversion rules for conversions in both directions.
ori_to_new and new_to_ori contain unidirectional rules.
Basic transform rules can be used to describe how new grammar rules differ from their corresponding original grammar rules when only terminal symbols (represented as simple strings in the grammar file) are modified. The attribute prod_rule refers to the name of the rule, while original and new specify the differing parts.
In tree-sitter, grammar rules are written in JavaScript. For example, consider the rule for defining import_statement in Python.
Original grammar rule:
import_statement: $ => seq(
'import',
$._import_list,
)
New grammar rule:
import_statement: $ => seq(
'<import_stmt>',
$._import_list,
)
In this case, the keyword import in the original grammar is transformed to <import_stmt> in the new grammar. The corresponding transform rule is:
{"prod_rule":"import_statement","original":"import","new":"<import_stmt>"}
In some cases, keywords do not have an one-to-one correspondence. For example, the if_statement rule in Python:
if_statement: $ => seq(
'if',
field('condition', $.expression),
':',
field('consequence', $._suite),
repeat(field('alternative', $.elif_clause)),
optional(field('alternative', $.else_clause)),
)
The new grammar rule:
if_statement: $ => seq(
'<if_stmt>',
field('condition', $.expression),
field('consequence', $._suite),
repeat(field('alternative', $.elif_clause)),
optional(field('alternative', $.else_clause)),
)
Here, the keyword if is transformed to <if_stmt>, and the : is removed. The transform rule is:
{"prod_rule":"if_statement","original":["if",{"field":"condition","anchor":true},":"],"new":["<if_stmt>",{"field":"condition","anchor":true}]}
Elements in the original and new lists must appear in the same order as they do in the grammar, with no elements skipped.
We introduce anchor points to provide positional references by setting the anchor attribute to true. These symbols remain unchanged during the transformation.
The number of anchors in original and new should be the same and have one-to-one correspondence in order.
In this example, the keyword : is added after the children whose field name is condition when transforming from new grammar back to the original.
When working with elements that don't necessarily appear in the parse tree (such as those within the optional or repeat functions), rules should be written separately for each individual part. This ensures that the transformation can handle cases where those elements are absent in the parse tree.
Original rule for with_statement:
with_statement: $ => seq(
optional('async'),
'with',
$.with_clause,
':',
field('body', $._suite),
)
New rule:
with_statement: $ => seq(
optional('<async_keyword>'),
'<with_stmt>',
$.with_clause,
field('body', $._suite),
)
The problematic transformation rule:
{"prod_rule":"with_statement","original":["async","with",{"type":"$with_clause","anchor":true},":"],"new":["<async_keyword>","<with_stmt>",{"type":"$with_clause","anchor":true}]}
In the rule above, async is expected to appear directly before the keyword with. If async does not appear in the original parse tree, the transformation rule will not trigger, and the subsequent transformation from with to <with_stmt> will not take place.
Instead, we need to separate the transformation of async and the transformation of with to ensure that each transformation is handled independently. This way, the absence of async does not interfere with transforming with.
Correct transformation rules:
{"prod_rule":"with_statement","original":"async","new":"<async_keyword>"},
{"prod_rule":"with_statement","original":["with",{"type":"$with_clause","anchor":true},":"],
"new":["<with_stmt>",{"type":"$with_clause","anchor":true}]}
The symbol $ preceding $with_clause indicates that this refers to a non-terminal symbol, specifically the with_clause rule.
Expressing an element as a simple string, such as "if", is equivalent to specifying its type, e.g., {"type":"if"}. Elements can also include other attributes for more accurate positioning.
For the parameters rule, the original grammar rule is:
parameters: $ => seq(
'(',
optional($._parameters),
')',
)
The new grammar rule:
parameters: $ => seq(
optional($._parameters),
)
The transform rules are:
{"prod_rule":"parameters","original":[{"type":"(","prev_sibling":null},{"anchor":true}],"new":[{"anchor":true,"prev_sibling":null}]},
{"prod_rule":"parameters","original":[{"anchor":true},{"type":")","next_sibling":null}],"new":[{"anchor":true,"next_sibling":null}]}
By assigning the null value to attribute prev_sibling, ( is only removed if it is the first child of the parameters node. Similarly, ")" is deleted if it is the last child.
The list of all attributes available:
'type','field','prev_sibling','prev_sibling_field','next_sibling','next_sibling_field'
Based on the basic transformation rules defined in the both section, GramVarKit automatically generates two sets of conversion rules: one for original-to-new transformation and the other for new-to-original transformation.
Each generated rule may carry a content attribute, which provides the necessary symbol information for insertion or replacement operations. This content can be a simple string, a JSON object, or a list of symbols. For nonterminal symbols, fully describing its content is required since simply specifying the symbol type is insufficient for generating valid code during unparsing.
Unidirectional rules handle cases where changes in one direction do not need to be fully reversed.
For example, the original grammar rule for future_import_statement is:
future_import_statement: $ => seq(
'from',
'__future__',
'import',
choice(
$._import_list,
seq('(', $._import_list, ')'),
),
)
The new grammar rule is:
future_import_statement: $ => seq(
'<import_from_future_stmt>',
choice(
$._import_list,
// seq('(', $._import_list, ')'),
),
)
In this case, the parentheses should be removed when transforming from the original grammar to the new one. However, it is not necessary to reinsert them when transforming from the new grammar back to the original one. Therefore, the transform rules are defined in the ori_to_new section:
{"condition":{"parent_type":"future_import_statement","type":"("},"action":"delete"},
{"condition":{"parent_type":"future_import_statement","type":")"},"action":"delete"}
For more complex conversions beyond basic edits of terminal symbols, custom conversion functions can be defined in src/custom_rules.
GramVarKit also supports custom grammar modifications during new grammar generation, which can be defined in src/custom_grammar.
For example, to separate two statements using either <line_sep> or comments, the conversion rule would be:
{"condition":{"type":"$_statement","prev_sibling":"$_statement"},"action":"custom_before","content":"choice_comment_line_sep"}
actioncan becustom_beforeorcustom_after, indicating whether the function should be executed before or after the element is visited.contentspecifies the name of the custom function.
Inline rules in tree-sitter do not create nodes in the syntax tree, so if prev_sibling refers to inline rule names, it will be transformed to all possible node types that can serve as the first node of the inline rule. The same applies to next_sibling.
For rules transforming from the original grammar to the new one, prev_sibling_inline can be used as an alternative. This allows the conversion rule to also be written as:
{"condition":{"type":"$_statement","prev_sibling_inline":"$_statement"},"action":"custom_before","content":"choice_comment_line_sep"}
The function choice_comment_line_sep is defined as follows:
@writing_op
def choice_comment_line_sep(self,node,**kwargs):
if(self.comments):
self.maybe_newline(add_comment=True)
else:
self.write("<line_sep>")
- The first argument of the function,
self, refers to the unparser instance. - The
nodeparameter represents the current node being processed.
The following functions can be used to modify the regenerated code:
write(self,text)appends thetextto the generated code.write_comment(self,comment_content)writes thecomment_contentif it is at the start of a new line, otherwise records the comment.maybe_newline(self,add_comment)starts a new line in the generation process. Ifadd_commentis set toTrue, recorded comments are appended to the end of the previous line if they exist.maybe_indent(self)writes indentation at the beginning of a new line.maybe_space(self)appends a space to the code.change_indent(self,delta)changes the level of indentation. If any of the above functions are called in the custom conversion function, the annotation@writing_opshould be used.
For the function_definition rule, the original grammar is:
function_definition: $ => seq(
optional('async'),
'def',
field('name', $.identifier),
field('type_parameters', optional($.type_parameter)),
field('parameters', $.parameters),
optional(
seq(
'->',
field('return_type', $.type),
),
),
':',
field('body', $._suite),
)
The new grammar rule:
function_definition: $ => seq(
optional('<async_keyword>'),
'<def_stmt>',
field('name', $.identifier),
field('type_parameters', optional($.type_parameter)),
field('parameters', optional(alias($._parameters, $.parameters))),
optional(
seq(
'<arrow>',
field('return_type', $.type),
),
),
field('body', $._suite),
)
If the parameters of the function definition are empty, the syntax node parameters will be missing in the new parse tree and needs to be inserted.
The conversion rule for this insertion is:
{"condition":{"parent_type":"function_definition","type":"$identifier","field":"name"},"action":"custom_after","content":"empty_parameters"}
The definition of the custom conversion function empty_parameters is:
def empty_parameters(self,index,**args):
assert self.children[index].type=='identifier'
new_index=index
if(self.children[new_index+1].type=='type_parameter'): # type_parameter is optional
new_index=new_index+1
if(self.children[new_index+1].type!='parameters'):
parameter_node=TreeNode(type="parameters",is_named=True,children=[StringNode("("),StringNode(")")],field_names=[None,None],text=b'()')
self.children.insert(new_index+1,parameter_node)
self.aux_infs.insert(new_index+1,AuxInf(field_name="parameters"))
return index
self.childrenrefers to the list of children of the syntax nodefunction_definition.self.aux_infsstores information about the child node, such as its field name.- The return value of the function should point to the same element that the original argument
indexreferred to at the beginning of the function. In this example, theparametersnode is inserted after theidentifiernode, thus the position ofidentifierremains unchanged, and the sameindexis returned.
The params attribute in custom conversion rules specifies the arguments that will be passed to the custom conversion function.
For example, to modify the text of the syntax node true, the conversion rule would be:
{"condition":{"type":"$true"},"action":"custom_replace","content":"text_replace","params":{"text":"True"}}
The definition of the custom conversion function text_replace is:
def text_replace(self,index,**args):
text=args['text']
node_type=self.children[index].type
self.children[index]=TreeNode(node_type,True,list(),list(),text.encode("utf8"))
return index
$_space refers to spaces added during the unparsing process to delimit tokens. To eliminate these spaces, use the following conversion rule:
{"condition":{"type":"$_space"},"action":"delete"}
To begin using GramVarKit, first download the grammar file for the original grammar provided by tree-sitter.
To generate the rules for transforming between the original and new grammar formats, run the following command:
python parse_configuration.py --rule_path ${path_of_configuration_file}
This will generate three JSON files:
- The
*.ori_to_new.jsonfile contains rules for transforming code from the original grammar to the new grammar. - The
*.new_to_ori.jsonfile contains rules for transforming code from the new grammar to the original grammar. - The
*.customs.jsonfile contains all the custom conversion functions used in the process.
Some generated rules may be invalid:
- If a rule has an empty
condition. - If there are multiple rules with the same
conditionandaction, only one will be retained.
During postprocessing, certain conditions may be modified:
- If
prev_siblingornext_siblingreferences an inline rule name, it will be expanded to all possible node types that can be the last or first node of the inline rule, respectively. - If
new_to_oriconversion rules include atypecondition referring to an inline rule name with aninsert_beforeaction, it will be transformed similarily as above. - For
new_to_orirules, if thetypecondition refers to an inline rule name, thenext_siblingcondition will be removed.
Once the transformation rules are parsed, they will be applied to update the grammar.json file.
For complicated cases, such as those requiring custom conversion functions, the corresponding transformation rules may not be applied to the grammar automatically and will be flagged for manual handling.
The paths for the original grammar file and the new grammar file are defined in config.json. Ensure that these paths are correctly set before proceeding.
To regenerate the parser based on the updated grammar file, run the following command:
tree-sitter generate ${path_of_new_grammar_json}
After successfully generating the parser, use the following API call to build the dynamic library:
from tree_sitter import Language
Language.build_library()
To build the unparser, import the following:
from src.build_unparser import build_unparser
To transform from the original grammar to new grammar:
unparser=build_unparser(ori_to_new=True,new_to_ori=False,filepath=${path_of_ori_to_new_conversion_rule_file})
unparser.unparse(parse_tree)
parse_tree refers to the parse tree generated by tree-sitter.
To transform from the new grammar to original grammar:
unparser=build_unparser(ori_to_new=False,new_to_ori=True,filepath=${path_of_new_to_ori_conversion_rule_file})
unparser.unparse(parse_tree)