Skip to content

Commit 8e4bc17

Browse files
authored
Merge pull request #140 from machow/fast-mutate
Fast mutate
2 parents aca05d1 + 697f74b commit 8e4bc17

23 files changed

+2015
-72
lines changed

.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Vim Swapfiles
22
.*.swp
3+
.*.swo
4+
5+
# Temporary folder for experimenting with things
6+
tmp
37

48
# Byte-compiled / optimized / DLL files
59
__pycache__/

Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ test:
1111
pytest --dbs="sqlite,postgresql" siuba/
1212

1313
test-travis:
14-
py.test --nbval $(filter-out %postgres.ipynb, $(NOTEBOOK_TESTS))
14+
#py.test --nbval $(filter-out %postgres.ipynb, $(NOTEBOOK_TESTS))
1515
pytest --dbs="sqlite,postgresql" siuba/
1616

1717
examples/%.ipynb:

docs/developer/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,5 @@ Developer docs
66

77
call_trees.Rmd
88
sql-translators.ipynb
9+
pandas-group-ops.Rmd
910

docs/developer/pandas-group-ops.Rmd

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
jupyter:
3+
jupytext:
4+
text_representation:
5+
extension: .Rmd
6+
format_name: rmarkdown
7+
format_version: '1.1'
8+
jupytext_version: 1.2.4
9+
kernelspec:
10+
display_name: Python 3
11+
language: python
12+
name: python3
13+
---
14+
15+
# Optimized grouped pandas operations
16+
17+
```{python}
18+
import numpy as np
19+
import pandas as pd
20+
21+
np.random.seed(123)
22+
students = pd.DataFrame({
23+
'student_id': np.repeat(np.arange(2000), 10),
24+
'course_id': np.random.randint(1, 20, 20000),
25+
'score': np.random.randint(1, 100, 20000)
26+
})
27+
28+
g_students = students.groupby('student_id')
29+
```
30+
31+
## Problem: combining grouped operations is slow
32+
33+
If you just need to make a single calculation, then pandas methods are very fast. For example, take the code below, which calculates the minimum score for each student.
34+
35+
```{python}
36+
# %%timeit
37+
g_students.score.min()
38+
```
39+
40+
This took very little time (less than a millisecond, which is 1 thousandth of a second!).
41+
42+
However, now suppose you wanted to do something more complex. Let's say you wanted to get rows corresponding to each students minimum score. In pandas, there are two ways to do this:
43+
44+
* transform with a lambda
45+
* by using both the `students` and `g_student` data frames.
46+
47+
These are shown below.
48+
49+
```{python}
50+
# %%timeit
51+
is_student_min = g_students.score.transform(lambda x: x == x.min())
52+
df_min1 = students[is_student_min]
53+
```
54+
55+
```{python}
56+
# %%timeit
57+
is_student_min = students.score == g_students.score.transform('min')
58+
df_min2 = students[is_student_min]
59+
```
60+
61+
Note that while the first one could be expressed using only the grouped data (`g_student`), it took over a second to run!
62+
63+
On the other hand, while the other was fairly quick, it required juggling two forms of the data.
64+
65+
Siuba attempts to optimize these operations to be quick AND require less data juggling.
66+
67+
68+
## Siuba filtering is succinct AND performant
69+
70+
```{python}
71+
from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
72+
from siuba import _
73+
```
74+
75+
```{python}
76+
# %%timeit
77+
df_min3 = fast_filter(g_students, _.score == _.score.min())
78+
```
79+
80+
```{python}
81+
# %%timeit
82+
fast_mutate(students, is_low_score = _.score == _.score.min())
83+
```
84+
85+
```{python}
86+
# %%timeit
87+
fast_summarize(g_students, lowest_percent = _.score.min() / 100.)
88+
```
89+
90+
## How do the optimizations work?
91+
92+
Siuba replaces important parts of the call tree--like `==` and `score()`--with functions that take a grouped series and return a grouped series. Because it then becomes grouped series all the way down, these operations are nicely composable.
93+
94+
```{python}
95+
_.score == _.score.min()
96+
```
97+
98+
99+
After the expressions are executed, the verb in charge handles the output. For example, `fast_filter` uses the result (usually a boolean Series) to keep only rows where the result is True.
100+
101+
An example is shown below, for how siuba replaces the "mean" function.
102+
103+
```{python}
104+
from siuba.experimental.pd_groups.translate import method_agg_op
105+
106+
f_mean = method_agg_op('mean', False, None)
107+
108+
# result is a subclass of SeriesGroupBy
109+
res_agg = f_mean(g_students.score)
110+
111+
print(res_agg)
112+
print(res_agg.obj.head())
113+
```
114+
115+
## Defining custom grouped operations
116+
117+
TODO: coming soon

docs/developer/sql-translators.ipynb

+2-3
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,6 @@
194194
"\n",
195195
"call_shaper = CallTreeLocal(\n",
196196
" local_funcs,\n",
197-
" rm_attr = ('str', 'dt'),\n",
198197
" call_sub_attr = ('dt',)\n",
199198
" )"
200199
]
@@ -352,7 +351,7 @@
352351
"name": "python",
353352
"nbconvert_exporter": "python",
354353
"pygments_lexer": "ipython3",
355-
"version": "3.6.7"
354+
"version": "3.6.8"
356355
},
357356
"toc": {
358357
"base_numbering": 1,
@@ -369,5 +368,5 @@
369368
}
370369
},
371370
"nbformat": 4,
372-
"nbformat_minor": 2
371+
"nbformat_minor": 4
373372
}

0 commit comments

Comments
 (0)