Skip to content

Commit 6208f4b

Browse files
committed
docs: add call tree dev tutorial
1 parent a2aef6f commit 6208f4b

File tree

2 files changed

+332
-2
lines changed

2 files changed

+332
-2
lines changed

docs/developer/call_trees.Rmd

+329
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
text_representation:
5+
extension: .Rmd
6+
format_name: rmarkdown
7+
format_version: '1.1'
8+
jupytext_version: 1.1.1
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# Call tree processing
16+
17+
18+
## What is a call tree?
19+
20+
Call trees are what `siuba` uses to take what users say they want to do, and convert it into an action, such as...
21+
22+
* a SQL statement
23+
* a set of pandas operations
24+
25+
Below is an example expression, alongside with a graphical representation of that expression.
26+
This graphical representation is the call tree.
27+
28+
```{python}
29+
from siuba import _
30+
31+
_.hp + _.hp.rank()
32+
```
33+
34+
One thing that often catches people by surprise with call trees, is that calls for an expression like
35+
36+
```
37+
_.hp.rank()
38+
```
39+
40+
are not in order from left to right, but the other way around.
41+
Looking at its tree..
42+
43+
```{python}
44+
_.hp.rank()
45+
```
46+
47+
It goes
48+
49+
* call `_.hp.rank`
50+
* get attribute `rank` from `_.hp`
51+
* get attribute `hp` from `_`
52+
53+
I'll call this order, the entering order. It occurs when we walk down the tree depth first.
54+
55+
Sometimes this order is useful, but often we'll want to think of the operations in reverse (e.g. closer to how we read them). In order to allow for both situations, in siuba I often use what I'll refer to as a **tree listener**.
56+
This is a concept borrowed from the [Antlr4](https://github.com/antlr/antlr4/blob/master/doc/listeners.md) parser generator language.
57+
58+
59+
## What is a tree listener?
60+
61+
For each node (black box) on a tree, a tree listener allows to to define some custom processing, by specifying enter and exit methods.
62+
63+
```{python}
64+
_.hp + _.hp.rank()
65+
```
66+
67+
Note that nodes like `+` and `.` in the graph above are shorthand for their python method names, `__add__` and `__getattr__` respectively.
68+
69+
70+
### Simple exit method on a tree listener
71+
72+
Below is an example tree listener that strips out a `__getattr__` operation from a call.
73+
74+
```{python}
75+
from siuba.siu import CallListener, Call, BinaryOp, strip_symbolic
76+
from siuba import _
77+
78+
class AttrStripper(CallListener):
79+
def __init__(self, rm_attr):
80+
self.rm_attr = rm_attr
81+
82+
def exit___getattr__(self, node):
83+
obj, attr_name = node.args
84+
if attr_name in self.rm_attr:
85+
return obj
86+
87+
return node
88+
89+
90+
attr_strip = AttrStripper({'hp'})
91+
92+
call = strip_symbolic(_.hp + _.hp.rank())
93+
94+
print(call)
95+
print(attr_strip.enter(call))
96+
```
97+
98+
### Simple enter method on a tree listener
99+
100+
```{python}
101+
class AttrStopper(AttrStripper):
102+
def enter___getattr__(self, node):
103+
obj, attr_name = node.args
104+
if attr_name == "stop":
105+
# don't enter child nodes
106+
return self.exit(node)
107+
108+
# use generic entering method on this node (and its children)
109+
return self.generic_enter(node)
110+
111+
112+
attr_stopper = AttrStopper({'hp'})
113+
114+
call = strip_symbolic(_.hp + _.stop.hp + _.hp.stop)
115+
116+
print(call)
117+
print(attr_stopper.enter(call))
118+
```
119+
120+
121+
122+
123+
### Use enter to "look back on the python execution timeline"
124+
125+
In general, it's better to use enter when you need to use **info that python would execute earlier in time**.
126+
127+
Some useful cases include
128+
129+
* stopping further processing (by not entering child nodes)
130+
* modifying a child node, prior to entering (i.e. starting processing)
131+
132+
For example, suppose we want to treat method calls in a special way. In `_.rank()`, we first enter the `__call__` node. Moreover, we can lookup whether it is actually a method call from this node, using the rule...
133+
134+
* if a call node is operating on a get attribute, then it is a method call
135+
136+
This is shown below...
137+
138+
```{python}
139+
_.dt.year
140+
```
141+
142+
```{python}
143+
_.dt.year()
144+
```
145+
146+
```{python}
147+
# want to remove dt
148+
# also want to treat an attribute after dt as a call
149+
# _.a.dt.year
150+
#
151+
# if we cut dt out in the exit, can't know year is attribute
152+
153+
# TODO: shouldn't need to import BinaryOp, but it helps with formatting
154+
# maybe need factory function?
155+
def is_op(node, opname):
156+
if isinstance(node, Call) and node.func == opname:
157+
return True
158+
159+
return False
160+
161+
class MethodMaker(CallListener):
162+
def enter___getattr__(self, node):
163+
obj, attr_name = node.args
164+
165+
print("Entering attribute: ", attr_name)
166+
167+
# is to the right of another attribute call
168+
# e.g. _.<left_attr>.<attr_name>
169+
if is_op(obj, "__getattr__"):
170+
left_obj, left_attr = obj.args
171+
172+
print(" Detected attribute chain: ", left_attr, attr_name)
173+
174+
# if the left attr is dt, treat this like a method call
175+
# e.g. _.dt.year
176+
if left_attr == "dt":
177+
# manually enter child nodes, now that we have all the information
178+
# we need about them
179+
args, kwargs = node.map_subcalls(self.enter)
180+
new_obj = node.__class__("__getattr__", *args, **kwargs)
181+
# since it follows dt, put inside a call op
182+
method_call = Call("__call__", new_obj)
183+
return self.exit(method_call)
184+
185+
# otherwise, use default behavior
186+
return self.generic_enter(node)
187+
188+
def exit___getattr__(self, node):
189+
obj, attr_name = node.args
190+
191+
print("Exiting attribute: ", attr_name)
192+
193+
return node
194+
195+
196+
method_maker = MethodMaker()
197+
198+
call = strip_symbolic(_.dt.year)
199+
200+
print("Call: ", call)
201+
method_maker.enter(call)
202+
```
203+
204+
One limitation of this approach is that if we have an expression like...
205+
206+
```
207+
_.dt.year()
208+
```
209+
210+
We'll still convert the year attribute to a call, causing us to call it twice!
211+
212+
```{python}
213+
method_maker.enter(strip_symbolic(_.dt.year()))
214+
```
215+
216+
To get around this, we can extend MethodShouter to check whether an attribute has converted to a call
217+
218+
```{python}
219+
class MethodMaker2(MethodMaker):
220+
def enter___call__(self, node):
221+
# needs to use an enter call, since need to know
222+
# * what child was before entering
223+
224+
obj = node.args[0]
225+
# don't want to return multiple calls,
226+
# e.g. _.dt.year() shouldn't produce _.dt.year()()
227+
if is_op(obj, "__getattr__"):
228+
args, kwargs = node.map_subcalls(self.enter)
229+
230+
new_obj, *func_args = args
231+
232+
# getattr transformed itself into a call node, but we're already
233+
# calling, so peel off the call node it produced...
234+
if is_op(new_obj, "__call__"):
235+
new_call = Call("__call__", new_obj.args[0], *func_args, **kwargs)
236+
return self.exit(new_call)
237+
238+
return self.generic_enter(node)
239+
240+
def exit___call__(self, node):
241+
obj = node.args[0]
242+
if is_op(obj, "__getattr__"):
243+
left_obj, left_attr = obj.args
244+
print("Exiting method call: ", left_attr)
245+
246+
return node
247+
248+
249+
250+
method_maker2 = MethodMaker2()
251+
252+
method_maker2.enter(strip_symbolic(_.dt.year()))
253+
```
254+
255+
<!-- #region -->
256+
Keep in mind that using an enter method for an operator can do whatever an exit method for that operator could (and more!). However, there are two important caveats to keep in mind it usually requires more code, since it also needs to enter child nodes.
257+
258+
We can think of the order of enter and exit operations as a big sandwich, where exit is the last step an enter "block" takes. So if the exit doesn't handle things, the enter can.
259+
260+
```
261+
_.hp + _.hp.rank()
262+
263+
enter +(_.hp, _.hp.rank())
264+
enter .(_, "hp")
265+
exit
266+
enter __call__(_.hp.rank)
267+
enter .(_.hp, "rank")
268+
enter .(_, "hp")
269+
exit
270+
exit
271+
exit
272+
exit
273+
274+
```
275+
276+
277+
In this sense exit is best for actions that can happen after all other processing for a node has happened.
278+
<!-- #endregion -->
279+
280+
### Use exit for simple actions after child nodes are processed
281+
282+
To show where an exit is useful--let's take the extra step of cutting out `dt` attributes.
283+
To do this, we can override our current getattr exit method (which is only a print statement right now).
284+
285+
```{python}
286+
class MethodMaker3(MethodMaker2):
287+
def exit___getattr__(self, node):
288+
obj, attr_name = node.args
289+
290+
print("Exiting attribute: ", attr_name)
291+
292+
if attr_name == "dt":
293+
# cut out the dt node
294+
return obj
295+
296+
return node
297+
298+
method_maker3 = MethodMaker3()
299+
```
300+
301+
```{python}
302+
# before
303+
method_maker2.enter(call)
304+
```
305+
306+
```{python}
307+
# after
308+
method_maker3.enter(call)
309+
```
310+
311+
Finally, it's worth asking what will happen with the following call...
312+
313+
```{python}
314+
call3 = strip_symbolic(_.dt.dt())
315+
call3
316+
```
317+
318+
```{python}
319+
method_maker3.enter(call3)
320+
```
321+
322+
<!-- #region -->
323+
Notice that there are two dt attributes, and they were both entered, but only one exited.
324+
325+
326+
Why is this?
327+
To find the answer, you need to look at the `enter___getattr__` method of the original MethodMaker class.
328+
More specifically, why doesn't it exit the node its processing, when it creates a new Call node?
329+
<!-- #endregion -->

docs/developer/index.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
Developers
2-
==========
1+
Developer docs
2+
==============
33

44
.. toctree::
55
:maxdepth: 2
66

7+
call_trees.Rmd
78
sql-translators.ipynb
89

0 commit comments

Comments
 (0)