Data pipeline for analysis of postsecondary education data. Data sourced from the College Scorecard API, Python-based ETL process orchestrated with Airflow.
This project seeks to use publicly available postsecondary education data from the College Scorecard API to investigate the relationships between demographics and tuition costs. Some other potential research questions that could be explored using these data:
- Are secular instiutuions more racially diverse than religious institutions?
- What are the historic trends of enrollment at male-only/female-only institutions?
- In what regions or communities in the United States are for-profit institutions most common?
DAG tasks, in order of execution:
- Extract data from the U.S. Department of Education College Scorecard API
- Serialize data as
JSON
to/data/raw/
in project directory - Upload raw file to AWS S3 raw bucket
- Transform data with
pandas
, serialize cleanedCSV
file to/data/clean/
- Upload clean file to AWS S3 clean bucket
- Clean data is loaded into AWS RDS instance
βββ dags
β βββ dag.py <- Airflow DAG
β βββ dag_functions
β βββ extract.py <- API extraction function
β βββ transform.py <- data processing/cleaning function
βββ data
β βββ raw <- raw data pull from College Scorecard API
β βββ clean <- processed data in CSV format
βββ db_build
β βββ create_tables.SQL <- create table shells
β βββ create_views.SQL <- create table views
βββ dashboard.py <- Plotly dashboard app
βββ LICENSE <- MIT license
βββ README.md <- Top-level project README
βββ docker-compose.yaml <- Docker-Compose file w/ Airflow config
β οΈ Note: This project is no longer maintained and potentially unstable, so running is not advised at this time. Some preliminary instructions about Docker and Airflow configuration are provided below.
In order to execute the DAG, youβll need to store some information in a .ENV
file in the top-level project directory. You must add .ENV
to project .gitignore
file before publishing anything to GitHub!
It should look something like this:
API_KEY=[insert College Scorecard API key here]
AWS_ACCESS_KEY_ID=[insert AWS Access Key ID here]
AWS_SECRET_ACCESS_KEY=[insert AWS Secret Access Key here]
AIRFLOW_UID=501
No need to change AIRFLOW_UID
- this is a constant used to set up the Airflow admin.
Refer to the official Airflow docs for more information.
- Install Docker and Docker-Compose first if you donβt already have it.
- Direct a terminal to your project directory and execute the code below. This will generate
docker-compose.yaml
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.3/docker-compose.yaml'
- Make sure you've set up your
.ENV
file properly. Initialize 3 folders in your top-level directory:/dags/
,/logs/
, and/plugins/
. - With the Docker application running on your computer,
execute docker-compose up airflow-init
in the terminal. This will initialize the Airflow instance. It will create an Admin login with username and password both set to βairflowβ by default. - Finally, execute
docker-compose up
. This runs everything specified indocker-compose.yaml
. You can check the health of your containers by opening a new terminal in the same directory and executingdocker ps
. You should now be able to open your web browser and go tolocalhost:8080
to log in to the Airflow web client.
Execute docker-compose down --volumes --rmi all
to stop and delete all running containers, delete volumes with database data and downloaded images.
Docs:
- College Scorecard Data Documentation
- Apache Airflow Documentation
- Docker Documentation
- Requests Documentation (Python Library)
Helpful articles/videos:
- Docker for Data Science - A Step by Step Guide
- Airflow DAG: Coding your first DAG for Beginners
- Airflow Tutorial for Beginners - Full Course
Architecture inspo:
Major shout-out to Amanda Jayapurna for designing the cover image!