|
| 1 | +--- |
| 2 | +jupyter: |
| 3 | + jupytext: |
| 4 | + text_representation: |
| 5 | + extension: .Rmd |
| 6 | + format_name: rmarkdown |
| 7 | + format_version: '1.1' |
| 8 | + jupytext_version: 1.2.4 |
| 9 | + kernelspec: |
| 10 | + display_name: Python 3 |
| 11 | + language: python |
| 12 | + name: python3 |
| 13 | +--- |
| 14 | + |
| 15 | +# Optimized grouped pandas operations |
| 16 | + |
| 17 | +```{python} |
| 18 | +import numpy as np |
| 19 | +import pandas as pd |
| 20 | +
|
| 21 | +np.random.seed(123) |
| 22 | +students = pd.DataFrame({ |
| 23 | + 'student_id': np.repeat(np.arange(2000), 10), |
| 24 | + 'course_id': np.random.randint(1, 20, 20000), |
| 25 | + 'score': np.random.randint(1, 100, 20000) |
| 26 | +}) |
| 27 | +
|
| 28 | +g_students = students.groupby('student_id') |
| 29 | +``` |
| 30 | + |
| 31 | +## Problem: combining grouped operations is slow |
| 32 | + |
| 33 | +If you just need to make a single calculation, then pandas methods are very fast. For example, take the code below, which calculates the minimum score for each student. |
| 34 | + |
| 35 | +```{python} |
| 36 | +# %%timeit |
| 37 | +g_students.score.min() |
| 38 | +``` |
| 39 | + |
| 40 | +This took very little time (less than a millisecond, which is 1 thousandth of a second!). |
| 41 | + |
| 42 | +However, now suppose you wanted to do something more complex. Let's say you wanted to get rows corresponding to each students minimum score. In pandas, there are two ways to do this: |
| 43 | + |
| 44 | +* transform with a lambda |
| 45 | +* by using both the `students` and `g_student` data frames. |
| 46 | + |
| 47 | +These are shown below. |
| 48 | + |
| 49 | +```{python} |
| 50 | +# %%timeit |
| 51 | +is_student_min = g_students.score.transform(lambda x: x == x.min()) |
| 52 | +df_min1 = students[is_student_min] |
| 53 | +``` |
| 54 | + |
| 55 | +```{python} |
| 56 | +# %%timeit |
| 57 | +is_student_min = students.score == g_students.score.transform('min') |
| 58 | +df_min2 = students[is_student_min] |
| 59 | +``` |
| 60 | + |
| 61 | +Note that while the first one could be expressed using only the grouped data (`g_student`), it took over a second to run! |
| 62 | + |
| 63 | +On the other hand, while the other was fairly quick, it required juggling two forms of the data. |
| 64 | + |
| 65 | +Siuba attempts to optimize these operations to be quick AND require less data juggling. |
| 66 | + |
| 67 | + |
| 68 | +## Siuba filtering is succinct AND performant |
| 69 | + |
| 70 | +```{python} |
| 71 | +from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize |
| 72 | +from siuba import _ |
| 73 | +``` |
| 74 | + |
| 75 | +```{python} |
| 76 | +# %%timeit |
| 77 | +df_min3 = fast_filter(g_students, _.score == _.score.min()) |
| 78 | +``` |
| 79 | + |
| 80 | +```{python} |
| 81 | +# %%timeit |
| 82 | +fast_mutate(students, is_low_score = _.score == _.score.min()) |
| 83 | +``` |
| 84 | + |
| 85 | +```{python} |
| 86 | +# %%timeit |
| 87 | +fast_summarize(g_students, lowest_percent = _.score.min() / 100.) |
| 88 | +``` |
| 89 | + |
| 90 | +## How do the optimizations work? |
| 91 | + |
| 92 | +Siuba replaces important parts of the call tree--like `==` and `score()`--with functions that take a grouped series and return a grouped series. Because it then becomes grouped series all the way down, these operations are nicely composable. |
| 93 | + |
| 94 | +```{python} |
| 95 | +_.score == _.score.min() |
| 96 | +``` |
| 97 | + |
| 98 | + |
| 99 | +After the expressions are executed, the verb in charge handles the output. For example, `fast_filter` uses the result (usually a boolean Series) to keep only rows where the result is True. |
| 100 | + |
| 101 | +An example is shown below, for how siuba replaces the "mean" function. |
| 102 | + |
| 103 | +```{python} |
| 104 | +from siuba.experimental.pd_groups.translate import method_agg_op |
| 105 | +
|
| 106 | +f_mean = method_agg_op('mean', False, None) |
| 107 | +
|
| 108 | +# result is a subclass of SeriesGroupBy |
| 109 | +res_agg = f_mean(g_students.score) |
| 110 | +
|
| 111 | +print(res_agg) |
| 112 | +print(res_agg.obj.head()) |
| 113 | +``` |
| 114 | + |
| 115 | +## Defining custom grouped operations |
| 116 | + |
| 117 | +TODO: coming soon |
0 commit comments