DS4A Data Engineering Capstone: ETL Pipeline and Data Warehouse Deployment

ETL pipelines and data warehouse construction and orchestration for the DS4A Data Engineering capstone project.

Overview

For this section of the capstone project, I have used the following resources / tools to build our data pipelines and set up our data warehouse:

Python 3
Prefect
AWS S3
AWS CloudFormation Stacks
Databricks Workspaces + Datawarehouse
Dask (on the Saturn Cloud Data Science platform)

Pipeline Orchestration Process

Prerequisites

Prefect account
Saturn Cloud Account
AWS Account
Databricks Account

Environment Setup

The code in this example uses prefect for orchestration (figuring out what to do, and in what order) and Dask Cluster for execution (doing the things).

While Prefect is our orchestration tool of choice and our Dask cluster has been configured to execute those orchestration tasks, we still have to set up our system environment to be able to run the scripts that define those tasks, as shown below:

Create a new conda virtual environment and activate it:

$ conda create —n data_pipeline_env python=3.9

And activate it:

$ conda activate data_pipeline_env

Then, install the project dependencies:

$ pip install -r requirements.txt

And be sure to set environment variables to instantiate our "read-safe" credentials for our Saturn Cloud account:

$ export SATURN_USERNAME='SATURN_USERNAME'

Finally, we authenticate our Prefect account with our API key secret:

$ prefect auth login --key PREFECT_API_KEY

Prefect Cloud Project Management

Since Prefect organizes flows in association with projects, I created a project called "data-pipeline-warehouse-t23" within my account, and initialized the prefect client within the Saturn setup script in preparation for the upcoming tasks.

client = prefect.Client()
client.create_project(project_name=PREFECT_CLOUD_PROJECT_NAME)
integration = PrefectCloudIntegration(prefect_cloud_project_name=PREFECT_CLOUD_PROJECT_NAME)

Data Pipeline Tasks (Per Dataset)

Prefect organizes groups of tasks into "flows" (workflows), which must be defined programmatically by defining functions -- the following is a breakdown of each task that we have defined for the scripts below (specifying the extraction, transformation, and loading of the data running from APIs through our pipeline):

extract: Retrieving the data needed from the respective API
transform: Transforming the returned response from the API into a dataframe that matches our warehouse table schemas
load: Loading the data into CSVs locally, and into S3 buckets that as synchronized to our Databricks Workspace + Warehouse Cluster

The integration sections that are commented out instantiate the flows in in the Saturn <> Prefect integration, and then pass the scheduling task on to the Cloud for execution by our Prefect project.

(Optional) Local Directory Setup for CSV Generation

For the sake of demonstration, you can also create some local directories to house the resultant CSV files in:

$ mkdir -p data/treasury-data
$ mkdir -p data/natl-poverty-data
$ mkdir -p data/small-area-poverty-data

Review

Prefect Flows

Output

When the Prefect flows are successfully created and scheduled in the cloud (via the commented code for each script above), the output will look like this:

Treasury Dataset

National Poverty Dataset

Small Area Poverty Dataset

In the Project Dashboard (Cloud)

In the Prefect cloud dashboard, the flows will appear like so:

AWS Cloud Formation -> Databricks Workspace + Warehouse

Our S3 bucket is linked to our CloudFormation stack, which is housing our Databricks Workspace and Warehouse:

AWS Cloud Formation

Databricks Workspace + Warehouse

Table in Databricks Warehouse

Databricks Workspace SQL Dashboard

From the Databricks Workspace, it is easy to access tables via SQL for the entire warehouse via a single unified dashboard:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
README.md		README.md
fiscal-data-etl-task-flow.py		fiscal-data-etl-task-flow.py
pipeline_secrets.py		pipeline_secrets.py
poverty-nat-data-etl-task-flow.py		poverty-nat-data-etl-task-flow.py
poverty-sa-data-etl-task-flow.py		poverty-sa-data-etl-task-flow.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DS4A Data Engineering Capstone: ETL Pipeline and Data Warehouse Deployment

Overview

Pipeline Orchestration Process

Prerequisites

Environment Setup

Prefect Cloud Project Management

Data Pipeline Tasks (Per Dataset)

(Optional) Local Directory Setup for CSV Generation

Review

Prefect Flows

Output

Treasury Dataset

National Poverty Dataset

Small Area Poverty Dataset

In the Project Dashboard (Cloud)

AWS Cloud Formation -> Databricks Workspace + Warehouse

AWS Cloud Formation

Databricks Workspace + Warehouse

Table in Databricks Warehouse

Databricks Workspace SQL Dashboard

About

Releases

Packages

Languages

LiBearden/ds4a-capstone-data-pipeline-etl

Folders and files

Latest commit

History

Repository files navigation

DS4A Data Engineering Capstone: ETL Pipeline and Data Warehouse Deployment

Overview

Pipeline Orchestration Process

Prerequisites

Environment Setup

Prefect Cloud Project Management

Data Pipeline Tasks (Per Dataset)

(Optional) Local Directory Setup for CSV Generation

Review

Prefect Flows

Output

Treasury Dataset

National Poverty Dataset

Small Area Poverty Dataset

In the Project Dashboard (Cloud)

AWS Cloud Formation -> Databricks Workspace + Warehouse

AWS Cloud Formation

Databricks Workspace + Warehouse

Table in Databricks Warehouse

Databricks Workspace SQL Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages