College Scorecard Data Pipeline

Data pipeline for analysis of postsecondary education data. Data sourced from the College Scorecard API, Python-based ETL process orchestrated with Airflow.

Goals

This project seeks to use publicly available postsecondary education data from the College Scorecard API to investigate the relationships between demographics and tuition costs. Some other potential research questions that could be explored using these data:

Are secular instiutuions more racially diverse than religious institutions?
What are the historic trends of enrollment at male-only/female-only institutions?
In what regions or communities in the United States are for-profit institutions most common?

Architecture

ETL Overview

DAG tasks, in order of execution:

Extract data from the U.S. Department of Education College Scorecard API
Serialize data as JSON to /data/raw/ in project directory
Upload raw file to AWS S3 raw bucket
Transform data with pandas, serialize cleaned CSV file to /data/clean/
Upload clean file to AWS S3 clean bucket
Clean data is loaded into AWS RDS instance

Airflow DAG graph:

Project folder structure

├── dags
│   ├── dag.py               <- Airflow DAG
│   └── dag_functions        
│       ├── extract.py       <- API extraction function
│       └── transform.py     <- data processing/cleaning function
├── data
│   ├── raw                  <- raw data pull from College Scorecard API
│   └── clean                <- processed data in CSV format
├── db_build
│   ├── create_tables.SQL    <- create table shells
│   └── create_views.SQL     <- create table views
├── dashboard.py             <- Plotly dashboard app
├── LICENSE                  <- MIT license
├── README.md                <- Top-level project README
└── docker-compose.yaml      <- Docker-Compose file w/ Airflow config

Project Setup

⚠️ Note: This project is no longer maintained and potentially unstable, so running is not advised at this time. Some preliminary instructions about Docker and Airflow configuration are provided below.

In order to execute the DAG, you’ll need to store some information in a .ENV file in the top-level project directory. You must add .ENV to project .gitignore file before publishing anything to GitHub!

It should look something like this:

API_KEY=[insert College Scorecard API key here]
AWS_ACCESS_KEY_ID=[insert AWS Access Key ID here]
AWS_SECRET_ACCESS_KEY=[insert AWS Secret Access Key here]
AIRFLOW_UID=501

No need to change AIRFLOW_UID - this is a constant used to set up the Airflow admin.

Running Airflow in Docker

Refer to the official Airflow docs for more information.

Install Docker and Docker-Compose first if you don’t already have it.
Direct a terminal to your project directory and execute the code below. This will generate docker-compose.yaml

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.3/docker-compose.yaml'

Make sure you've set up your .ENV file properly. Initialize 3 folders in your top-level directory: /dags/, /logs/, and /plugins/.
With the Docker application running on your computer, execute docker-compose up airflow-init in the terminal. This will initialize the Airflow instance. It will create an Admin login with username and password both set to “airflow” by default.
Finally, execute docker-compose up. This runs everything specified in docker-compose.yaml. You can check the health of your containers by opening a new terminal in the same directory and executing docker ps. You should now be able to open your web browser and go to localhost:8080 to log in to the Airflow web client.

Execute docker-compose down --volumes --rmi all to stop and delete all running containers, delete volumes with database data and downloaded images.

References

Docs:

Helpful articles/videos:

Architecture inspo:

Major shout-out to Amanda Jayapurna for designing the cover image!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

College Scorecard Data Pipeline

Table of Contents

Goals

Architecture

ETL Overview

Project folder structure

Project Setup

Running Airflow in Docker

References

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dags		dags
data		data
db_build		db_build
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

License

hawkirk/scorecard_pipeline

Folders and files

Latest commit

History

Repository files navigation

College Scorecard Data Pipeline

Table of Contents

Goals

Architecture

ETL Overview

Project folder structure

Project Setup

Running Airflow in Docker

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages