Algebraic improvements to building output graphs #96

robrix · 2022-08-23T18:54:48Z

robrix
Aug 23, 2022

Recently I've done some work to process Python sources using tree-sitter-graph, including both translating and writing rules which describe both name binding and program evaluation. All of this work suffered from a fundamental issue which I believe we should fix at a relatively high priority, to wit: it is very hard to write a rule in isolation. (Corollary: it is very hard to modify a rule in isolation too.)

Problem

A portrait in miniature of the issue is that, given a rule like this:

; R1
(this (that) @that) @this
{
  node @this.foo
  attr (@this.foo) bar
  edge @this.foo -> @that.quux
}

you can't tell determine whether

the variable @this.foo will be duplicated,
the attribute will be duplicated,
the edge will be duplicated,
@that.quux will be defined, or
the resulting graph is the shape you intended

strictly locally (i.e. by considering this rule alone). The first four issues stem from interference with other rules. For example, a rule like

; R2
(this (_)) @this
{
  node @this.foo
  attr (@this.foo) baz
}

already conflicts in that every (that) node within a (this) node will trigger both rules, and both rules set the same scoped variable on the (this) node. Whichever of R1 and R2 comes later will be where the error is shown (absent any other rules interfering). (NB: I'm describing the strict evaluator in part because it's the default, and in part because I'm less familiar with the lazy evaluator.)

On the other hand, a rule like

; R3
(that) @that
{
  node @that.quux
}

won't conflict with either of the above rules, but it will be necessary to place it above R1 to ensure that @that.quux exists to make an edge to (again, strictly evaluated and absent other interfering rules).

In a large tsg script these could be hundreds or thousands of lines apart, attribute & variable names could be overloaded e.g. when relating to different parts of the syntax tree, and there might be specialized variations on R1 relating to similar occurrences within a (this) node, R1 will be run for every (that) contained by a (this) (without any indication as to how many this could end up being), R2 will be run for every node of any kind contained by a (this), and so on. The resolution to duplications is typically to lessen the rules' locality, e.g. by factoring out a common piece:

; R0
(this) @this
{
  node @this.foo
}

; R1′
(this (that) @that) @this
{
  attr (@this.foo) bar
  edge @this.foo -> @that.quux
}

; R2′
(this (_)) @this
{
  attr (@this.foo) baz
}

This works, but is still somewhat fraught:

We've (newly) had to be careful to ensure that R0 occurs before both R1′ and R2′ (again, strict etc etc).
We've (still) had to be careful to ensure that R1′ and R2′ don't create conflicting attributes.
Eyeballing any of this is particularly difficult when rules' uses of certain variables are merely α-equivalent, not equal; if R2 had named its (this) something else, it would be that much harder.
Despite all of the above care, this is actually not a refactoring of the previous rules' intent; we will create a @this.foo node for empty (this) now, as well. We might want the pattern to instead be [ (this (that)), (this (_)) ] @this (imagine that I'd written a more compelling example where this sort of thing would actually be necessary).

Proposal

Summarizing a bit: the trouble is too many duplicates and undefineds, and not enough types (specifications). Undefineds happen mainly because we're trying to be careful of duplicates; if duplicates weren't a problem we could always automatically insert definitions. (There might be a "modulo types" bit here, but we'll come back to that.)

As discussed elsewhere (cf #85, #93), propagators are one possible way to address duplicates. The idea is basically that in a network possibly with feedback in multiple directions you can stabilize by giving the value of each node as an element of a bounded semilattice, where everything starts with ⊥. So take a network describing a + b = c. If you know values for any two variables, you can determine the third. When you start, a = ⊥, b = ⊥, c = ⊥; if you set a = 1, then it pushes that fact to b and c but since they're still underconstrained nothing else changes. But you then set c = 1, it pushes that to b which has now got enough information to resolve b = 0. This pushes to a and c, where the computed values are checked against actual values using the ∨ (least upper bound): a = 1, c = 1, b = 0 ⊢ a = c - b = 1; 1 ∨ 1 = 1 (idempotence ftw). If on the other hand we'd implemented addition wrong or if the user had supplied us with conflicting information (e.g. a = 1, b = 1, c = 1), we'd see that by our lub returning ⊤: 1 ∨ 0 = ⊤.

We don't need the power of propagators here or anything like them. All we really need is an idempotent monoid, and possibly we could get by with an idempotent semigroup.

Possibly getting by with an idempotent semigroup

node ::= { (attr | edge)* }
attr ::= key = value
value ::= #null | false | true | "…" | num | node-ref | …syntax nodes…
edge ::= node-ref → node-ref | …edge attrs…

x ∨ₐ y = x if x = y
       = ⊤ otherwise

x ∨ₑ y = … — lub the attrs together; edges don't have any identity to speak of

x ∨ₙ y = … — merge, key-wise

There's a bunch of hand-waving there, but I'll draw your attention in particular to node-refs. They aren't handled above, but it amounts to "just" α-equivalence. Two α-equivalent graphs are equal. Probably we'd judge equality under a context Γ assigning some concretely comparable value (integers, say) to individual node refs (names)… or we'd just take it as a given that nodes have to be (at least indirectly) connected to syntax tree nodes to actually be compared (i.e. locals don't need to and can't be compared by this scheme) and use unique tree paths + variable/attr dotted.path.names as identifiers.

Unification

Another, perhaps better perspective on the previous one is that it's really just unification à la Robinson. Implement it with a metacontext, implement it with union-find, whatever you like. Unlike typical unification, our syntax is a directed cyclic graph, which is I guess slightly more interesting. Still just first-order decidable unification tho.

Alternative: nondeterminism (this is a terrible idea)

Set union is an idempotent semigroup (and monoid).

Just saying.

Knock-on effects

This eliminates dups completely. I can't see anything it would make harder, either, and in particular nothing that I actually care about doing.

If we eliminate dups we can eliminate undefs too by making nodes any time we would have complained (like the old DSL did). This doesn't prevent us from freely creating nodes whenever we choose to, it just means that we don't have to remember to.

One possible wrinkle: can we get into situations where we'd be trying to initialize a variable or attr or something and not know what type? Must everything be nullable? (😩)

So, about that: types of parts of the subgraph. A lot of the information we write down is mostly static, e.g. this node has a pop attr, etc. I think we could write down a type saying "(this).foo is a node with a pop attr," and then add type defs to get "(this).foo is a Thing," and, because we took the time to write it down, we don't then have to write down any of the other information associated with it: e.g. we don't have to construct a pop attr—saying that it's of a type which has one is enough.

And taking that a step further

type HasFooThing = { foo : Thing }
(this) : HasFooThing

Now I don't even need a rule to match (this), every such node will still have a foo variable holding a graph node with the Thing shape. (We could always just write this as (this) : HasFooThing {} of course.)

Anyway, the point of that is that types are partly about verification and partly about synthesis and metaprogramming and we stand to get a lot of the latter out of these. And can still let-bind things the way we can do at present if so desired.

Perspective

tree-sitter-graph is a way of mapping ASTs onto other graphs. In semantic we called the same task "assignment," (abbreviating "term assignment," by a very shaky analogy with "type assignment"). It was initially implemented as a bottom-up tree traversal, and then replaced with tree parsing—parsing the parse trees because we couldn't (at the time) derive a strong, precise shape for the syntax from the grammar and just copy into that, so we had to validate. There are lots of ways to walk a tree; tree-sitter-graph's specialization to iterating query matches is one, but so are recursion schemes (however fancy or prosaic an encoding you wish to employ), top-down visitor–style traversals, datalog definitions, etc.

On the other hand, constructing an output graph—or taking any other action during/in response to the above traversal—is largely orthogonal. Most or all of the problems described here are only related to traversal insofar as we might have overlapping queries; apart from that, it's all to do with the graph we've built up thus far, and how we structure our rules to satisfy the constraints on the graph.

Questions

Lazy evaluation? I think it should mesh well, I'm not super familiar with its semantics tho so I'm not sure if any of this is as pressing there. (I do know I've run into some of the same issues there, however.)

hendrikvanantwerpen · 2022-08-30T13:38:31Z

hendrikvanantwerpen
Aug 30, 2022
Collaborator

I think this is an excellent write-up of the pain involved in writing and debugging TSG files!

The points that resonate with my own experience

Scoped variables on syntax nodes are the interface between different rules. There is currently no help in ensuring the interface is completely implemented, no help to ensure the interface is implemented exactly once, and no help to ensure that syntax node variables you refer to are part of the interface of that node.
The graph is constructed following some implicit schema. There is no way to write down that schema, and no help to ensure that the constructed graph adheres to it. I do expect that there will always be situations where the larger structure being constructed escapes the scope of a single rule, and thus there will always be a non-local aspect to writing TSG files. Because of this I believe there will be a limit to how local we can push things, and some non-local mechanism is necessary to get the guarantees we'd like.

Propagators

I am not sure propagators are the right way to solve this. Propagators are a solution to the problem of having multiple contributions to a value that we need to combine in a meaningful way. The problem we have is that we need a single thing, a node with identity, but we don't have an easy, ergonomic way to write that one thing. I feel like propagators may make the code work, but fundamentally solve a different problem than we have.

Implicitly defined syntax node variables

The behavior of the old DSL, where @node.variable.name behaves like a singleton representing a node in the graph is closer already. However, this solves only the immediate problem: ensure it to be defined once. Other problems remain.

For example, rules using different variables on the same syntax node will not result in errors (all syntax node variables are implicitly there), but won't catch mistakes in interface usage.

Another example are accidental name clashes, an even bigger risk when we want to combine multiple rule sets. (Actually, this last one could be solved with namespacing, which can be done implicitly per file, although within a file it would still require explicit declarations I think.)

One problem is that this works fine if the values are graph nodes, but it breaks down for any other kind of value.

Ideas for alternative solutions

These are some other ideas that may or may not be part of a solution. There may overlap, or be combinable.

Explicit syntax node variable definitions. I've been writing rules like this:
```
(node) {
  node @node.var1
  node @node.var2
}
```
These are almost syntax node definitions already. Having a dedicated mechanism for this would allow us to check uses of syntax node variables, I think we could get enough information from Tree-sitter to do that.

This has the same limitation as the implicit syntax node variables of the old DSL, that this is far less useful for non-node type values.
Templates. Having templates in TSG would reduce the repetition greatly and would reduce the possibilities of errors. It would cover the inference part of types, but obviously not the checking part, so the templates themselves are still an unchecked source of errors.
Analyze the grammar coverage of queries. Using the grammar, we'd try to analyze how often rules can fire, which nodes can be matched, etc and determine which syntax node variables are set. That information is used to report duplicate assignments, and uses of undefined variables.

It is unclear to me how difficult this would be.

0 replies

robrix · 2022-08-30T16:25:20Z

robrix
Aug 30, 2022
Author

Scoped variables on syntax nodes are the interface between different rules.

💯 Very well put.

I am not sure propagators are the right way to solve this.

Really, I'm not proposing that we use propagators, but rather that we look at what propagators do to solve the related problem of feedback in their setting, i.e. make updates idempotent, merging unless conflicts arise.

The problem we have is that we need a single thing, a node with identity, but we don't have an easy, ergonomic way to write that one thing.

We currently have two different identities:

the invisible integer associated with every node, and
reachability via scoped variables on specific syntax tree nodes (e.g. the root one).

The former currently requires us to avoid merging; removing it eliminates this barrier without robbing us of a stable, user-readable and -writable identity (a path through the syntax tree with scoped variables). It's also worth noting that nothing about this would be a barrier to providing some sort of specification which could be shared between rules.

Specifications & namespacing are both valuable but IMO both are orthogonal to this. In particular if you want to check that a file doesn't generate attributes named pup when it meant pop, that's something you can do today without touching tree-sitter-graph's sources at all (e.g. using the JSON output and an edit distance metric on attr names against some expected set).

Going for maximum bang-for-buck, I think viewing each rule as describing its inputs and outputs completely independently is a pretty good starting place from which we can add specifications describing the intended overlap. I don't think the semantics will paint us into a corner long-term, either, as they're fairly tightly matched to the neighbourhood of nodes in the output graph, which are the perspective and province of a tsg rule.

Explicit syntax node variable definitions.

This is in a strange sort of half-specification, half-operation space. Unfortunately I think it complicates the semantics a great deal to have a new "kind" of rule. It also doesn't address the problem tightly, in that we're encouraging users to just match all (node-type)s and add stuff to the graph for each. That may be fine for stack graphs, but not for all graphs (certainly not IR graphs). In which case you use a more precise pattern to match here, but then you're doing precisely the same work that you'd be doing today anyway.

Templates.

IMO templates are a great use of a specification: having a type like StringLiteral = {text: String} (a record with one field) gives you a constructor, Expr = StringLiteral | IntLiteral | … gives you more, etc.

I actually don't think templates will reduce the possibility of this kind of error, however, as it just swaps "where do I write node …?" out for "where do I apply the template?", i.e. conflicts will still happen.

Analyze

I've given this some thought, and while I'm convinced it's possible (you'd want to use the node-types.json file, specifically), it only detects the problem, it doesn't help you to solve it. It's still better than only discovering the problem manually, but IMO it's not enough.

I'm still convinced that a semilattice-like unification/merging operation is the way to go:

It solves the problem of spurious conflicts.
It still allows us to deal with legitimate conflicts.
It is compatible with specifications, templates, & namespacing.
It actually makes analysis simpler because the error cases for spurious conflicts go away.
Each rule can be understood in isolation.
Understanding how they work together as a group is now compositional, i.e. modularity is restored.
Rules are left with a regular "x inputs, y outputs" structure which is a stone's throw away from a specification (mainly needing types for attrs/vars).
Explaining it doesn't require mentioning evaluation order.
Useful for non stack graph rules.
Should be quick to implement (famous last words).
Imposes few constraints on other extensions to the semantics which we might wish to make.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algebraic improvements to building output graphs #96

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Algebraic improvements to building output graphs #96

robrix Aug 23, 2022

Problem

Proposal

Possibly getting by with an idempotent semigroup

Unification

Alternative: nondeterminism (this is a terrible idea)

Knock-on effects

Perspective

Questions

Replies: 2 comments

hendrikvanantwerpen Aug 30, 2022 Collaborator

The points that resonate with my own experience

Propagators

Implicitly defined syntax node variables

Ideas for alternative solutions

robrix Aug 30, 2022 Author

robrix
Aug 23, 2022

hendrikvanantwerpen
Aug 30, 2022
Collaborator

robrix
Aug 30, 2022
Author