-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User defined functions for cuDF tutorial #1
Comments
I like the first example. Although is the advantage of using a kernel rather than just something like this just for performance? from numpy import cos, sin, arcsin as asin, sqrt, pi
def haversine_distance(df):
"""Haversine distance formula taken from Michael Dunn's StackOverflow post:
https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
"""
x_1 = df['lat1']
y_1 = df['lon1']
x_2 = df['lat2']
y_2 = df['lon2']
x_1 = pi/180 * x_1
y_1 = pi/180 * y_1
x_2 = pi/180 * x_2
y_2 = pi/180 * y_2
dlon = y_2 - y_1
dlat = x_2 - x_1
a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers
return c * r
haversine_distance(df) (I'm also unclear why the kernel only works with We could also show an example where using the high-level dataframe syntax doesn't work, because the dataframe methods don't vectorize nicely. For instance, a function that uses The rolling window example is a little more confusing. It took me a little bit to understand why there are two kernel functions, one for the rolling window and one to fill in the gaps. Maybe we just need to do a good job of explaining it, but I'm wondering if we shouldn't go with something simpler there. If it took me a bit to understand what was going on, I expect it will be the same for most tutorial attendees. Is it pretty typical with apply_chunks to have to write a separate kernel to fill in gaps? |
I've added a cudf UDF notebook based on this, with some exercises. https://github.com/Quansight/scipy-2019-rapids-tutorial/blob/master/cudf/cuDF%20UDF.ipynb Let me know what you think. |
@asmeurer noted that (at least some) of our introductory cuDF content doesn't have any kernel functions. We can use content from here if we want to discuss UDFs.
Or, for less trivial examples, we can use the example of "exact" lat/long distance function UDFs comparing pandas and cuDF. With a numba.cuda vincenty distance function +
df.apply_rows
, we were able to do in < 1 second what would take pandas 2.5 hours withdf.apply
(obviously, this isn't a completely fair comparison since you could use standard numba, but the point it makes still stands). Any non-trivial UDF at scale gets a huge speedup vs pandasThe text was updated successfully, but these errors were encountered: