Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
60 changes: 60 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Pittsburgh Port Authority On-time Performance (OTP)

## Application Link: https://share.streamlit.io/jayantsravan/assignment-2-jayantsravan/main/application.py

## 1. Goals of the project

A city's public transportation system is a testament to how well the city is planned. Pittsburgh is famous for its efficient, well-connected, and punctual public transportation system. As an intrigued visitor to this city, I wanted to understand (and also let the users visualize) using data how well this claim holds.

**The goal of this project is to give the user an understanding of the distribution of On-time Performance of Pittsburgh Port Authority transit vehicles over time and locations in the city.** A transit vehicle is said to be on time if it reaches a given destination within one minute of its scheduled time. The questions asked would be:

a. How is the OTP distributed geographically in the city?

b. How has the OTP varied in the past?

Both of the above questions tweaked to a desired granularity using filters.

To my surprise, data suggested that Pittsburgh public transit has remarkable punctuality (atleast towards the center of the city). And they have been very consistent about this throughout the past 3 years and across the garages.

I have baked in 'Will my transit be on time?' functionality in the application. Hope this helps some users be informed about any possible delays.

## 2. Design choices

### a. Geographical distribution

I have decided to go with a pydeck interactive map visualization using hexagon layers to depict OTPs. Hexagon layers were used because of how well they use the third dimension of the map to encode a bounded numerical datapoint. They are easy to interpret and allow panning and zooming. I have used a column layer to depict the garage locations for reference.

Pydeck heatmaps were a possible alternative to this. They encode numerical data in color and make the graph look cleaner. But the problem with pydeck's heatmaps was that they were slow to respond to interactions like zooming and panning. Also, they do not engage as well a hexagonal layer in general because of lack of a third dimension.

### b. Time distribution

I have decided to go with a line chart offered by streamlit for this. This was a pretty easy decision as this was a time series data and nothing depicts the trends in a time series as well as a line chart to a common user.

Seaborn/Matplotlib line chart were a possible alternative to this but they are simply not as good looking (without any design changes)/ polished as the integrated line charts of streamlit.

### c. Will my transit be on time?

I have to give credit to the streamlit implementation of metrics component. It is a well thought out offering which looks good and is useful in many scenarios. This component inspired this particular module of my application. Hence, it was an obvious choice for this encoding.

## 3. Development Process

This is a solo project. I spent approximately 15 hours on this project.

Most of the time was spent on learning the streamlit framework by experimenting with the demo codes provided by streamlit. I also spent a significant amount of time discussing with some friends and relatives about what datasets I could consider to ask an interesting question. Fortunately, I stumbled across these datasets when I was looking to find some data about the city I live in.

### Here is a split of the time taken in each task:

a. Learning streamlit and experimentation - 3 hours

b. Exploring the datasets and selecting one (two, in my case) - 4 hours

c. Experiementing with pydeck - 2 hours

d. Writing the application - 5 hours

e. Figuring out deployment - 1 hour

### Other interesting questions I was considering:
a. Is there correlation between the income of a particular location in Pittsburgh and the proportion of people from a particular ethnicity living in there? I explored census datasets extensively for this.
b. Is there a correlation between OTP of the transit system in a region and the income of the people living there?

223 changes: 223 additions & 0 deletions application.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
import numpy as np
import streamlit as st
import pandas as pd
import pydeck as pdk
from datetime import datetime
from dateutil.relativedelta import relativedelta

@st.cache(allow_output_mutation=True)
def getDatasets():
url = "https://data.wprdc.org/datastore/dump/00eb9600-69b5-4f11-b20a-8c8ddd8cfe7a"
otp_dataset = pd.read_csv(url)
url = "https://data.wprdc.org/dataset/ece64ad3-05eb-46dd-ba38-c83b5373812f/resource/3f40b94b-4ac4-48f1-8c61-8439d2d2f420/download/wprdc_stop_data.csv"
stops_dataset = pd.read_csv(url)

return otp_dataset, stops_dataset

def filterDatasets(dataframe, column, values):
if type(values) == type([]):
filtered = dataframe[dataframe[column].isin(values)]
return filtered
if type(values) == type(datetime.now()):
filtered = dataframe[pd.to_datetime(dataframe[column])>values]
return filtered

with st.spinner('Bringing you awesome...'):
otp_dataset, stops_dataset = getDatasets()

# introduction page
def introduction(otp_dataset, stops_dataset):
st.title('Pittsburgh Port Authority On-time Performance (OTP)')

st.header('Introduction')
"A city's public transportation system is a testament to how well the city is planned. Pittsburgh is famous for its efficient, well-connected, and punctual public transportation system. As an intrigued visitor to this city, I wanted to understand (and also let the users visualize) using data how well this claim holds."

"The goal of this project is to give the user an understanding of the distribution of On-time Performance of Pittsburgh Port Authority transit vehicles over time and locations in the city. A transit vehicle is said to be on time if it reaches a given destination within one minute of its scheduled time."

"To my surprise, data suggested that Pittsburgh public transit has remarkable punctuality (atleast towards the center of the city). And they have been very consistent about this throughout the past 3 years and across the garages."

st.header('Will my transit be on time?')
"I have baked in 'Will my transit be on time?' functionality in the application (look in navigation). Hope this helps some users be informed about any possible delays."

st.header('Data used')
"All the data used is obtained from the Western Pennsylvania Regional Data Center (WPRDC). Specifically, I used the 'Monthly OTP by Route' and 'Monthly_Updating_Bus_Stop_Usage' datasets."
st.markdown('Monthly OTP by Route - https://data.wprdc.org/dataset/port-authority-monthly-average-on-time-performance-by-route/resource/00eb9600-69b5-4f11-b20a-8c8ddd8cfe7a', unsafe_allow_html=True)
st.markdown('Monthly_Updating_Bus_Stop_Usage - https://data.wprdc.org/dataset/port-authority-transit-stop-usage/resource/3f40b94b-4ac4-48f1-8c61-8439d2d2f420', unsafe_allow_html=True)

# geographic distribution page
def geographic_distribution(otp_dataset, stops_dataset):
st.title('Geographic distribution of OTP')

st.header('Filter the data')

# Filter the otp data
otp_dataset_filtered = otp_dataset[otp_dataset['on_time_percent']>0]

garages = ['Ross', 'Collier', 'East Liberty', 'East Liberty/West Mifflin']
garageFilter = st.multiselect('1. Garage', garages)
if not garageFilter:
garageFilter = garages
otp_dataset_filtered = filterDatasets(otp_dataset_filtered, 'current_garage', garageFilter)

dayFilter = st.multiselect('2. Day of Week',['WEEKDAY', 'SAT.', 'SUN.'])
if not dayFilter:
dayFilter = ['WEEKDAY', 'SAT.', 'SUN.']
otp_dataset_filtered = filterDatasets(otp_dataset_filtered, 'day_type', dayFilter)
otp_dataset_filtered['day_weight'] = otp_dataset_filtered.apply(lambda row: 5 if row['day_type'] == 'WEEKDAY' else 1, axis = 1)
otp_dataset_filtered['otp(weighted)'] = otp_dataset_filtered['day_weight'] * otp_dataset_filtered['on_time_percent']

# timeDuration = st.selectbox('3. How recent data do you want to consider?', ('All time', 'Last 9 months', 'Last 12 months', 'Last 18 months'))
# if timeDuration != 'All time':
# months = int(timeDuration.split(' ')[1])
# otp_dataset_filtered = filterDatasets(otp_dataset_filtered, 'month_start', datetime.now() - relativedelta(months=months))

# filter the stops data
stops = stops_dataset.drop_duplicates('stop_name')[['stop_name', 'latitude', 'longitude', 'routes_ser']]
stops['routes'] = stops.apply(lambda x: x['routes_ser'].split(','), axis=1)
exploded_stops = stops.explode('routes').set_index('routes').drop(columns = ['routes_ser'])

# weight the rows of otp dataset (for weekdays vs weekends)
a = otp_dataset_filtered[['route','otp(weighted)', 'day_weight']].groupby(by = ['route']).aggregate(np.sum).reset_index()
a['on_time_percent'] = a['otp(weighted)']/a['day_weight']
a = a.set_index('route')
a = a.drop(columns = ['otp(weighted)', 'day_weight'])

# combine datasets
merged_df = a.join(exploded_stops)
otp_by_stop = merged_df.groupby(by=['stop_name', 'latitude', 'longitude']).aggregate(np.mean).reset_index()

# plot the otp by stop dataset
st.header('The distribution')
elevation = 10
garage_locations = pd.DataFrame([['Ross', 40.500564, -80.021977, elevation], ['East Liberty', 40.457482, -79.914569, elevation], ['Collier',40.367203,-80.101355, elevation], ['West Mifflin',40.362506,-79.931848,elevation]], columns = ['Garage', 'latitude', 'longitude', 'elevation'])
st.write(pdk.Deck
(
map_style = "mapbox://styles/mapbox/dark-v9",
map_provider="mapbox",
api_keys={'mapbox': 'sk.eyJ1IjoiamF5YW50c3JhdmFuIiwiYSI6ImNrdXZkNXExYzRibG4ycG14Z2F1cm51bm0ifQ.RrWvP4I6NbRpVyQ5fZfTTg'},
initial_view_state = pdk.ViewState(
latitude = 40.443903,
longitude = -79.950834,
zoom = 9.5,
pitch = 40
),
layers = [
pdk.Layer(
"HexagonLayer",
data = otp_by_stop,
pickable = True,
extruded = True,
get_position = ['longitude', 'latitude'],
get_weight = "on_time_percent",
cell_size = 400,
elevation_scale = 10,
radius = 200,
opacity = 0.7
),
pdk.Layer(
"ColumnLayer",
data = garage_locations,
pickable = True,
extruded = True,
get_position = ['longitude', 'latitude'],
get_weight = "elevation",
radius = 500,
elevation_scale = 15
)
]
)
)
f"**Fig 1:** *Geographic Distribution of OTP in Pittsburgh city during {', '.join(dayFilter)} \
for {', '.join(garages)} garages (resresented as black bars).*"

showDistributionData = st.checkbox("Show Data")
if showDistributionData:
otp_by_stop = otp_by_stop.set_index('stop_name')
otp_by_stop

# time distribution page
def time_distribution(otp_dataset, stops_dataset):
st.title('Time distribution of OTP')

st.header('Filter the data')

otp_dataset = otp_dataset[otp_dataset['on_time_percent']>0]
otp_dataset_filtered = otp_dataset.copy()
otp_dataset_filtered['month_start'] = pd.to_datetime(otp_dataset_filtered['month_start'], infer_datetime_format=True)

# Filter the data and sort based on perspectives
perspectives = ['Routes', 'Weekday/Weekends', 'Garages']
perspective = st.selectbox("Pick your perspective", perspectives)

# create pandas dataframe with week data as columns
def create_and_render_chart(otp_dataset_filtered, column, values):
list_of_series = []
values = list(values)
if values.count(np.nan):
values.remove(np.nan)
for value in values:
a = otp_dataset_filtered[otp_dataset_filtered[column] == value][['on_time_percent', 'month_start']].groupby(by = 'month_start').aggregate(np.mean).rename(columns = {'on_time_percent':value})
list_of_series.append(a)
day_wise_data = pd.concat(list_of_series, axis = 1)
st.header('The time distribution')
st.line_chart(day_wise_data)
f"**Fig 2:** *Time Distribution of OTP in Pittsburgh city for {', '.join(values)}.*"
showData = st.checkbox("Show Data")
if showData:
day_wise_data

# act based on the choice made
if perspective == perspectives[0]:
column_values = list(otp_dataset_filtered['route'].unique())
if column_values.count(np.nan):
column_values.remove(np.nan)
routes = st.multiselect('Select 1-5 routes', column_values)
route_len = len(routes)
if route_len<1 or route_len>5:
# error message
st.warning(f"Select any number of routes between 1 and 5. There are {int(route_len)} routes are currently selected.")
else:
create_and_render_chart(otp_dataset_filtered, 'route', routes)
elif perspective == perspectives[1]:
column_values = otp_dataset_filtered['day_type'].unique()
create_and_render_chart(otp_dataset_filtered, 'day_type', column_values)
elif perspective == perspectives[2]:
column_values = otp_dataset_filtered['current_garage'].unique()
create_and_render_chart(otp_dataset_filtered, 'current_garage', column_values)

# bus on time
def bus_on_time(otp_dataset, stops_dataset):
st.title('Will my transit be on time?')
otp_dataset = otp_dataset[otp_dataset['on_time_percent']>0]
otp_dataset_filtered = otp_dataset.copy()
otp_dataset_filtered['month_start'] = pd.to_datetime(otp_dataset_filtered['month_start'], infer_datetime_format=True)
globalOtpMean = otp_dataset_filtered['on_time_percent'].mean()
column_values = list(otp_dataset_filtered['route'].unique())
if column_values.count(np.nan):
column_values.remove(np.nan)
route = st.selectbox('Select your route', column_values)
otp_by_route = otp_dataset_filtered[otp_dataset_filtered['route'] == route]
meanOtp = otp_by_route['on_time_percent'].mean()
betterBy = (meanOtp - globalOtpMean)/globalOtpMean * 100
st.metric(label=f"Average OTP of route {route}", value=f"{int(meanOtp * 100)}%", delta=f"{int(betterBy)}%")
"* the delta shown above is in comparison to the average OTP of all transits"

def source():

st.write("This application was built as a part of my assignment for Interactive Data Science (05-839) course at Carnegie Mellon University.")
"For any queries or suggestions, email me at [email protected]"
st.markdown("Link to the github repository for this project - https://github.com/JayantSravan/assignment-2-JayantSravan", unsafe_allow_html=True)
# Sidebar and navigation
st.sidebar.title('Navigation')
pages = ['Introduction', 'Geographic distribution of OTP', 'Time distribution of OTP', 'Will my transit be on time?', 'Source']
page = st.sidebar.radio('Go to:', pages)
if page == pages[0]:
introduction(otp_dataset, stops_dataset)
elif page == pages[1]:
geographic_distribution(otp_dataset, stops_dataset)
elif page == pages[2]:
time_distribution(otp_dataset, stops_dataset)
elif page == pages[3]:
bus_on_time(otp_dataset, stops_dataset)
elif page == pages[4]:
source()
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
numpy
pandas
pydeck
streamlit