Replies: 2 comments
-
I think this is an excellent write-up of the pain involved in writing and debugging TSG files! The points that resonate with my own experience
PropagatorsI am not sure propagators are the right way to solve this. Propagators are a solution to the problem of having multiple contributions to a value that we need to combine in a meaningful way. The problem we have is that we need a single thing, a node with identity, but we don't have an easy, ergonomic way to write that one thing. I feel like propagators may make the code work, but fundamentally solve a different problem than we have. Implicitly defined syntax node variablesThe behavior of the old DSL, where For example, rules using different variables on the same syntax node will not result in errors (all syntax node variables are implicitly there), but won't catch mistakes in interface usage. Another example are accidental name clashes, an even bigger risk when we want to combine multiple rule sets. (Actually, this last one could be solved with namespacing, which can be done implicitly per file, although within a file it would still require explicit declarations I think.) One problem is that this works fine if the values are graph nodes, but it breaks down for any other kind of value. Ideas for alternative solutionsThese are some other ideas that may or may not be part of a solution. There may overlap, or be combinable.
|
Beta Was this translation helpful? Give feedback.
-
💯 Very well put.
Really, I'm not proposing that we use propagators, but rather that we look at what propagators do to solve the related problem of feedback in their setting, i.e. make updates idempotent, merging unless conflicts arise.
We currently have two different identities:
The former currently requires us to avoid merging; removing it eliminates this barrier without robbing us of a stable, user-readable and -writable identity (a path through the syntax tree with scoped variables). It's also worth noting that nothing about this would be a barrier to providing some sort of specification which could be shared between rules. Specifications & namespacing are both valuable but IMO both are orthogonal to this. In particular if you want to check that a file doesn't generate attributes named Going for maximum bang-for-buck, I think viewing each rule as describing its inputs and outputs completely independently is a pretty good starting place from which we can add specifications describing the intended overlap. I don't think the semantics will paint us into a corner long-term, either, as they're fairly tightly matched to the neighbourhood of nodes in the output graph, which are the perspective and province of a tsg rule.
This is in a strange sort of half-specification, half-operation space. Unfortunately I think it complicates the semantics a great deal to have a new "kind" of rule. It also doesn't address the problem tightly, in that we're encouraging users to just match all
IMO templates are a great use of a specification: having a type like I actually don't think templates will reduce the possibility of this kind of error, however, as it just swaps "where do I write
I've given this some thought, and while I'm convinced it's possible (you'd want to use the I'm still convinced that a semilattice-like unification/merging operation is the way to go:
|
Beta Was this translation helpful? Give feedback.
-
Recently I've done some work to process Python sources using
tree-sitter-graph
, including both translating and writing rules which describe both name binding and program evaluation. All of this work suffered from a fundamental issue which I believe we should fix at a relatively high priority, to wit: it is very hard to write a rule in isolation. (Corollary: it is very hard to modify a rule in isolation too.)Problem
A portrait in miniature of the issue is that, given a rule like this:
you can't tell determine whether
@this.foo
will be duplicated,@that.quux
will be defined, orstrictly locally (i.e. by considering this rule alone). The first four issues stem from interference with other rules. For example, a rule like
already conflicts in that every
(that)
node within a(this)
node will trigger both rules, and both rules set the same scoped variable on the(this)
node. Whichever of R1 and R2 comes later will be where the error is shown (absent any other rules interfering). (NB: I'm describing the strict evaluator in part because it's the default, and in part because I'm less familiar with the lazy evaluator.)On the other hand, a rule like
won't conflict with either of the above rules, but it will be necessary to place it above R1 to ensure that
@that.quux
exists to make an edge to (again, strictly evaluated and absent other interfering rules).In a large tsg script these could be hundreds or thousands of lines apart, attribute & variable names could be overloaded e.g. when relating to different parts of the syntax tree, and there might be specialized variations on R1 relating to similar occurrences within a
(this)
node, R1 will be run for every(that)
contained by a(this)
(without any indication as to how many this could end up being), R2 will be run for every node of any kind contained by a(this)
, and so on. The resolution to duplications is typically to lessen the rules' locality, e.g. by factoring out a common piece:This works, but is still somewhat fraught:
(this)
something else, it would be that much harder.@this.foo
node for empty(this)
now, as well. We might want the pattern to instead be[ (this (that)), (this (_)) ] @this
(imagine that I'd written a more compelling example where this sort of thing would actually be necessary).Proposal
Summarizing a bit: the trouble is too many duplicates and undefineds, and not enough types (specifications). Undefineds happen mainly because we're trying to be careful of duplicates; if duplicates weren't a problem we could always automatically insert definitions. (There might be a "modulo types" bit here, but we'll come back to that.)
As discussed elsewhere (cf #85, #93), propagators are one possible way to address duplicates. The idea is basically that in a network possibly with feedback in multiple directions you can stabilize by giving the value of each node as an element of a bounded semilattice, where everything starts with ⊥. So take a network describing a + b = c. If you know values for any two variables, you can determine the third. When you start, a = ⊥, b = ⊥, c = ⊥; if you set a = 1, then it pushes that fact to b and c but since they're still underconstrained nothing else changes. But you then set c = 1, it pushes that to b which has now got enough information to resolve b = 0. This pushes to a and c, where the computed values are checked against actual values using the ∨ (least upper bound): a = 1, c = 1, b = 0 ⊢ a = c - b = 1; 1 ∨ 1 = 1 (idempotence ftw). If on the other hand we'd implemented addition wrong or if the user had supplied us with conflicting information (e.g. a = 1, b = 1, c = 1), we'd see that by our lub returning ⊤: 1 ∨ 0 = ⊤.
We don't need the power of propagators here or anything like them. All we really need is an idempotent monoid, and possibly we could get by with an idempotent semigroup.
Possibly getting by with an idempotent semigroup
There's a bunch of hand-waving there, but I'll draw your attention in particular to node-refs. They aren't handled above, but it amounts to "just" α-equivalence. Two α-equivalent graphs are equal. Probably we'd judge equality under a context Γ assigning some concretely comparable value (integers, say) to individual node refs (names)… or we'd just take it as a given that nodes have to be (at least indirectly) connected to syntax tree nodes to actually be compared (i.e. locals don't need to and can't be compared by this scheme) and use unique tree paths + variable/attr dotted.path.names as identifiers.
Unification
Another, perhaps better perspective on the previous one is that it's really just unification à la Robinson. Implement it with a metacontext, implement it with union-find, whatever you like. Unlike typical unification, our syntax is a directed cyclic graph, which is I guess slightly more interesting. Still just first-order decidable unification tho.
Alternative: nondeterminism (this is a terrible idea)
Set union is an idempotent semigroup (and monoid).
Just saying.
Knock-on effects
This eliminates dups completely. I can't see anything it would make harder, either, and in particular nothing that I actually care about doing.
If we eliminate dups we can eliminate undefs too by making nodes any time we would have complained (like the old DSL did). This doesn't prevent us from freely creating nodes whenever we choose to, it just means that we don't have to remember to.
One possible wrinkle: can we get into situations where we'd be trying to initialize a variable or attr or something and not know what type? Must everything be nullable? (😩)
So, about that: types of parts of the subgraph. A lot of the information we write down is mostly static, e.g. this node has a
pop
attr, etc. I think we could write down a type saying "(this).foo
is a node with apop
attr," and then add type defs to get "(this).foo
is a Thing," and, because we took the time to write it down, we don't then have to write down any of the other information associated with it: e.g. we don't have to construct apop
attr—saying that it's of a type which has one is enough.And taking that a step further
Now I don't even need a rule to match
(this)
, every such node will still have afoo
variable holding a graph node with theThing
shape. (We could always just write this as(this) : HasFooThing {}
of course.)Anyway, the point of that is that types are partly about verification and partly about synthesis and metaprogramming and we stand to get a lot of the latter out of these. And can still
let
-bind things the way we can do at present if so desired.Perspective
tree-sitter-graph
is a way of mapping ASTs onto other graphs. Insemantic
we called the same task "assignment," (abbreviating "term assignment," by a very shaky analogy with "type assignment"). It was initially implemented as a bottom-up tree traversal, and then replaced with tree parsing—parsing the parse trees because we couldn't (at the time) derive a strong, precise shape for the syntax from the grammar and just copy into that, so we had to validate. There are lots of ways to walk a tree;tree-sitter-graph
's specialization to iterating query matches is one, but so are recursion schemes (however fancy or prosaic an encoding you wish to employ), top-down visitor–style traversals, datalog definitions, etc.On the other hand, constructing an output graph—or taking any other action during/in response to the above traversal—is largely orthogonal. Most or all of the problems described here are only related to traversal insofar as we might have overlapping queries; apart from that, it's all to do with the graph we've built up thus far, and how we structure our rules to satisfy the constraints on the graph.
Questions
Beta Was this translation helpful? Give feedback.
All reactions