-
Notifications
You must be signed in to change notification settings - Fork 154
Accelerating pandas with GPU: Sort the count of rows grouped on columns #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@aka76bm please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
|
I don't have the permission to add labels to this pull request. Is there anything else I can do for you? |
Accelerating pandas with GPU: Sort the count of rows grouped on columns
Add
%load_ext cudf.pandasbefore importing pandas to speed up operations using GPU%load_ext cudf.pandas
import pandas as pd
import random
Define the species categories
species_categories = ['setosa', 'versicolor', 'virginica']
flower_color_categories = ['red','yellow','green']
Define the range for each attribute based on typical iris flower measurements
sepal_length_range = (4.0, 8.0)
Create data for 1,000,000 samples
n = 1000000
data = {
'sepal_length': [random.uniform(*sepal_length_range) for _ in range(n)],
'flower_color': [random.choice(flower_color_categories) for _ in range(n)],
'species': [random.choice(species_categories) for _ in range(n)]
}
df = pd.DataFrame(data)
df.groupby(['species','flower_color']).size().sort_values(ascending=False)
Acceleration pandas with GPU: Merging / Joining dataframes-
Add
%load_ext cudf.pandasbefore importing pandas to speed up operations using GPU%load_ext cudf.pandas
import pandas as pd
import numpy as np
Define the number of rows
num_rows = 1000000
states = ["NY", "NJ", "CA", "TX"]
violations = ["Double Parking", "Expired Meter", "No Parking", "Fire Hydrant",
"Bus Stop"]
vehicle_types = ["SUBN", "SDN"]
Generate random data for Dataset 1
data1 = {
"Registration State": np.random.choice(states, size=num_rows),
"Ticket Number": np.random.randint(1000000000, 9999999999, size=num_rows)
}
Generate random data for Dataset 2
data2 = {
"Ticket Number": np.random.choice(data1['Ticket Number'], size=num_rows), # Reusing ticket numbers to ensure matches
"Violation Description": np.random.choice(violations, size=num_rows)
}
Create DataFrames
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Perform an inner join on 'Ticket Number'
merged_df = pd.merge(df1, df2, on="Ticket Number", how="inner")
Display some of the joined data
print(merged_df.head())
Accelerating pandas with GPU: Groupby aggregate on timeseries data
Add
%load_ext cudf.pandasbefore importing pandas to speed up operations using GPU%load_ext cudf.pandas
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
Define the number of rows
num_rows = 1000000
Define the possible values
states = ["NY", "NJ", "CA", "TX"]
violations = ["Double Parking", "Expired Meter", "No Parking", "Fire Hydrant", "Bus Stop"]
vehicle_types = ["SUBN", "SDN"]
start_date = "2022-01-01"
end_date = "2022-12-31"
Create a date range
dates = pd.date_range(start=start_date, end=end_date, freq='D')
Generate random data
data = {
"Registration State": np.random.choice(states, size=num_rows),
"Violation Description": np.random.choice(violations, size=num_rows),
"Vehicle Body Type": np.random.choice(vehicle_types, size=num_rows),
"Issue Date": np.random.choice(dates, size=num_rows),
"Ticket Number": np.random.randint(1000000000, 9999999999, size=num_rows)
}
Create a DataFrame
df = pd.DataFrame(data)
Adding issue weekday based on the "Issue Date"
weekday_names = {
0: "Monday",
1: "Tuesday",
2: "Wednesday",
3: "Thursday",
4: "Friday",
5: "Saturday",
6: "Sunday",
}
df["issue_weekday"] = df["Issue Date"].dt.weekday.map(weekday_names)
Grouping by issue_weekday and counting the Summons Number
df.groupby(["Issue Date"])["Ticket Number"
].count().sort_values()
Accelerating pandas with GPU: Count of values and GroupBy
Add
%load_ext cudf.pandasbefore importing pandas to speed up operations using GPU%load_ext cudf.pandas
import pandas as pd
import numpy as np
Randomly generated dataset of parking violations-
Define the number of rows
num_rows = 1000000
states = ["NY", "NJ", "CA", "TX"]
violations = ["Double Parking", "Expired Meter", "No Parking",
"Fire Hydrant", "Bus Stop"]
vehicle_types = ["SUBN", "SDN"]
Create a date range
start_date = "2022-01-01"
end_date = "2022-12-31"
dates = pd.date_range(start=start_date, end=end_date, freq='D')
Generate random data
data = {
"Registration State": np.random.choice(states, size=num_rows),
"Violation Description": np.random.choice(violations, size=num_rows),
"Vehicle Body Type": np.random.choice(vehicle_types, size=num_rows),
"Issue Date": np.random.choice(dates, size=num_rows),
"Ticket Number": np.random.randint(1000000000, 9999999999, size=num_rows)
}
Create a DataFrame
df = pd.DataFrame(data)
Which parking violation is most commonly committed by vehicles from various U.S states?
(df[["Registration State", "Violation Description"]] # get only these two columns
.value_counts() # get the count of offences per state and per type of offence
.groupby("Registration State") # group by state
.head(1) # get the first row in each group (the type of offence with the largest count)
.sort_index() # sort by state name
.reset_index()
)
Accelerating pandas with GPU: Rolling Window Average
%load_ext cudf.pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Randomly generated dataset of parking violations-
Define the number of rows
num_rows = 1000000
states = ["NY", "NJ", "CA", "TX"]
violations = ["Double Parking", "Expired Meter", "No Parking",
"Fire Hydrant", "Bus Stop"]
vehicle_types = ["SUBN", "SDN"]
Create a date range
start_date = "2022-01-01"
end_date = "2022-12-31"
dates = pd.date_range(start=start_date, end=end_date, freq='D')
Generate random data
data = {
"Registration State": np.random.choice(states, size=num_rows),
"Violation Description": np.random.choice(violations, size=num_rows),
"Vehicle Body Type": np.random.choice(vehicle_types, size=num_rows),
"Issue Date": np.random.choice(dates, size=num_rows),
"Ticket Number": np.random.randint(1000000000, 9999999999, size=num_rows)
}
Create a DataFrame
df = pd.DataFrame(data)
How does the parking violations change from day to day segmented by vehicle type
Averaged using a 7-day rolling mean
daily_counts = df.groupby(['Issue Date', 'Vehicle Body Type']
).size().unstack(fill_value=0)
Calculate a 7-day rolling mean of daily violations for each vehicle type
rolling_means = daily_counts.rolling(window=7).mean()
Display the rolling means for each vehicle type over time
rolling_means.tail(100).plot(figsize=(14, 7),
title="7-Day Rolling Average of Parking Violations by Vehicle Type")
plt.ylabel("Average Number of Violations")
plt.xlabel("Date")
plt.show()