Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User defined functions for cuDF tutorial #1

Open
beckernick opened this issue Jun 10, 2019 · 2 comments
Open

User defined functions for cuDF tutorial #1

beckernick opened this issue Jun 10, 2019 · 2 comments

Comments

@beckernick
Copy link
Collaborator

beckernick commented Jun 10, 2019

@asmeurer noted that (at least some) of our introductory cuDF content doesn't have any kernel functions. We can use content from here if we want to discuss UDFs.

Or, for less trivial examples, we can use the example of "exact" lat/long distance function UDFs comparing pandas and cuDF. With a numba.cuda vincenty distance function + df.apply_rows, we were able to do in < 1 second what would take pandas 2.5 hours with df.apply (obviously, this isn't a completely fair comparison since you could use standard numba, but the point it makes still stands). Any non-trivial UDF at scale gets a huge speedup vs pandas

@beckernick beckernick changed the title User defined functions for User defined functions for cuDF tutorial Jun 10, 2019
@asmeurer
Copy link
Contributor

I like the first example. Although is the advantage of using a kernel rather than just something like this just for performance?

from numpy import cos, sin, arcsin as asin, sqrt, pi

def haversine_distance(df):
    """Haversine distance formula taken from Michael Dunn's StackOverflow post:
    https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
    """
    x_1 = df['lat1']
    y_1 = df['lon1']
    x_2 = df['lat2']
    y_2 = df['lon2']

    x_1 = pi/180 * x_1
    y_1 = pi/180 * y_1
    x_2 = pi/180 * x_2
    y_2 = pi/180 * y_2

    dlon = y_2 - y_1
    dlat = x_2 - x_1
    a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2

    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers

    return c * r

haversine_distance(df)

(I'm also unclear why the kernel only works with math functions. With numpy functions I get an error from numba)

We could also show an example where using the high-level dataframe syntax doesn't work, because the dataframe methods don't vectorize nicely. For instance, a function that uses if and has to be applied row-by-row. (side question: it seems in most cases cudf could generate a kernel function automatically and allow passing a more direct function, similar to pandas.DataFrame.apply. Are there plans to implement that?)

The rolling window example is a little more confusing. It took me a little bit to understand why there are two kernel functions, one for the rolling window and one to fill in the gaps. Maybe we just need to do a good job of explaining it, but I'm wondering if we shouldn't go with something simpler there. If it took me a bit to understand what was going on, I expect it will be the same for most tutorial attendees. Is it pretty typical with apply_chunks to have to write a separate kernel to fill in gaps?

@asmeurer
Copy link
Contributor

I've added a cudf UDF notebook based on this, with some exercises. https://github.com/Quansight/scipy-2019-rapids-tutorial/blob/master/cudf/cuDF%20UDF.ipynb

Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants