Presented by
- Jhonatan Steven Morales Hernandez: [email protected]
- Carol Dayana Varela Cortez: [email protected]
- Manuel Alejandro Gruezo [email protected]
This project integrates various technologies to create a complete data pipeline, from data source collection to real-time visualization. It starts with the selection of two data sources: one from a dataset and another from an external API. The data is processed through an Apache Airflow DAG, which handles the ETL tasks and stores the resulting dimensional model in a database. The project also uses Apache Kafka for real-time streaming of metrics from the fact table, which are visualized in an interactive dashboard built with Power BI. To ensure real-time data flow, a Python application acts as a Kafka consumer.
Additionally, the entire project is dockerized to ensure consistent deployment across environments, with all dependencies encapsulated in containers. This makes it easy to run the system on any machine with Docker, streamlining both the development and deployment processes.
This project is developed within a virtual machine running Ubuntu.
In this project, we will work with two main datasets:
-
π Cardio Train Dataset: This dataset contains information on various health indicators of individuals, such as age, gender, height, weight, blood pressure, cholesterol levels, and more. The goal is to explore these data points to identify potential correlations and patterns that could be useful in predicting cardiovascular diseases.
-
β οΈ Cause Of Deaths: This dataset provides information on causes of death at a global level, broken down by country and year. It records a wide range of diseases and conditions, such as meningitis, respiratory diseases, and various forms of cancer, among others. However, the main focus is on mortality associated with cardiovascular diseases.
The dataset is composed of three types of input features:
- Objective: Factual information.
- Examination: Results from medical examinations.
- Subjective: Information provided by the patient.
Feature | Variable Type | Variable | Value Type |
---|---|---|---|
Age | Objective Feature | age | int (days) |
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code |
Systolic Blood Pressure | Examination Feature | ap_hi | int |
Diastolic Blood Pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol Intake | Subjective Feature | alco | binary |
Physical Activity | Subjective Feature | active | binary |
Cardiovascular Disease | Target Variable | cardio | binary |
All dataset values were collected at the time of the medical examination.
This dataset originally had 32 features, but the ones selected for this project are:
Feature | Variable | Value Type |
---|---|---|
Country/Territory | country | String (Name of the Country/Territory) |
Code | code | String (Country/Territory Code) |
Year | year | int (Year of the Incident) |
Cardiovascular Diseases | cardio_diseases | int (No. of People died from Cardiovascular Diseases) |
The main goal of this project is to integrate various technologies and tools to create a real-time data processing and visualization solution. The specific objectives are:
-
Integrate multiple data sources ππ: Select and combine data from a dataset and an external API to build a dimensional model.
-
Design and execute an ETL pipeline with Airflow πβοΈ: Create an Apache Airflow DAG to manage the extract, transform, and load (ETL) tasks, storing the resulting dimensional model in a database.
-
Implement a real-time data streaming system ππ‘: Use Apache Kafka to stream real-time metrics from the fact table of the dimensional model and enable continuous data ingestion.
-
Develop an interactive dashboard ππ: Build a real-time dashboard using visualization tools like Power BI or Looker Studio to display the data processed through the ETL pipeline in Airflow.
-
Create a Python app to consume real-time data ππ»: Develop a Python application to act as a Kafka consumer, receiving and processing the streamed metrics in real time.
-
Visualize data in real time β±οΈπ : Connect the interactive dashboard with the Python app and Kafka consumer to provide a real-time data visualization of the streamed metrics.
-
Dockerize the project π³π§: Containerize the entire project using Docker to ensure portability and ease of deployment across different environments.
-
dags: This folder includes DAG files necessary for orchestrating ETL workflows in Apache Airflow. It contains
__init__.py
,dags.py
, andetl.py
, which are essential for defining tasks and scheduling. -
data: This folder holds the raw data files used in the project, including
cardio_train.csv
,cause_of_deaths.csv
, andowid.csv
. These files serve as the input data for analysis and model training. -
logs: A folder containing log files that help in monitoring the workflow processes and identifying issues during execution. The
.gitkeep
file ensures that the folder is tracked by Git even when empty. -
notebooks: This directory contains Jupyter notebooks used for exploratory data analysis (EDA) and other key processes:
- EDA/002_EDA_dataset.ipynb β Initial EDA for understanding the dataset.
- 005_EDA_API.ipynb β EDA related to API data sources.
- Great Expectations (GX)/006_great_expectations.ipynb β Notebook for data validation using Great Expectations.
- 004_API.ipynb β Notebook for merging data obtained from the API.
- database_process/001_DataSetup.ipynb β Notebook for setting up the database.
- 003_database_processed.ipynb β Notebook documenting the data processing stage.
-
gx: This folder is dedicated to Great Expectations for data validation. It contains:
- expectations β JSON files like
cardio_train_expectations.json
,cause_of_deaths_expectations.json
, andowid_expectations.json
define the validation rules. - plugins/custom_data_docs/styles/data_docs_custom_styles.css β Custom styles for the data documentation.
- .ge_store_backend_id and great_expectations.yml β Configuration files for managing expectations and storing metadata.
- expectations β JSON files like
-
src: Contains core code for database interaction, models, and data validation:
- database β Code for database connection and table creation, including
dbconnection.py
andcreateTable.py
. - gx_utils β Contains
validation.py
for running data validations. - model β Holds
models.py
and related files defining the data model structure. - streaming β Code for data streaming operations, including
data_to_powerbi.py
andkafka_utils.py
. - transform β Code for data transformation, with scripts like
DimensionalModels.py
andTransformData.py
.
- database β Code for database connection and table creation, including
-
Dashboard-streaming.mp4 / Dashboard.pdf: Visual materials showcasing the dashboard and how data is presented using streaming updates.
-
Dockerfile / Dockerfile.jupyter: Docker configuration files for setting up the environment, including Jupyter for development.
-
Documentation.pdf: Comprehensive documentation for the project, outlining methodologies and results.
-
example_env: A template environment file to guide users in setting up their environment variables.
-
main.py: The main script that acts as an entry point for the project.
-
requirements.txt: Lists all the dependencies needed to run the project.
-
docker-compose.yml: Configuration file for orchestrating multi-container Docker applications.
- Install Python : Python Downloads
- Install Power BI : Install Power BI Desktop
- Install Docker
-
Update your system
Update the package list and install available updates:sudo apt update && sudo apt upgrade -y
-
Install certificates and data transfer tool
Install the necessary certificates and curl for data transfer:sudo apt-get install ca-certificates curl
-
Create a secure directory for APT repository keys
Create a directory to store the repository keys:sudo install -m 0755 -d /etc/apt/keyrings
-
Download and save the Docker GPG key to the system
Download the Docker GPG key and save it in the created directory:sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
-
Grant read permissions to all users for the Docker GPG key
Allow all users to read the GPG key:sudo chmod a+r /etc/apt/keyrings/docker.asc
-
Add the Docker repository and update the package list
Add the Docker repository to the APT sources and update the package list:echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update
-
Install Docker Engine, CLI, Containerd, Buildx, and Compose plugins
Install Docker and its necessary components:sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
-
Clone this repository:
git clone https://github.com/alej0909/ETL-1.git cd ETL-1
-
βοΈ Configuration of the .env File
- Create a .env File
Create a.env
file with the following configuration (you can refer to theexample_env
file):# Airflow Configuration AIRFLOW_UID=50000 AIRFLOW_GID=50000 # PostgreSQL Configuration PGUSER=airflow PGPASSWD=airflow PGDIALECT=postgresql PGHOST=postgres PGPORT=5432 PGDB=airflow # Kafka Configuration KAFKA_BROKER=kafka:9092 # Airflow Admin User AIRFLOW_ADMIN_USER=admin AIRFLOW_ADMIN_PASSWORD=admin [email protected] # Power BI API URL (for data streaming) POWERBI_API=<your_power_bi_api_url> # Working Directory for Airflow WORK_DIR=/opt/airflow
This command changes the ownership of the ./logs
directory to ensure Docker has proper access to the log files.
sudo chown -R 50000:50000 ./logs
Run the following command to create and start the necessary Docker containers:
docker compose up -d
- Open your web browser and navigate to http://localhost:8888.
- Locate and open the notebook
003_database_processed.ipynb
located in thedatabase_process
folder. - Run all the cells in the notebook to process the data and prepare the database.
- Open a new terminal or command prompt window.
- Ensure that the streaming process is actively listening by running any necessary commands or scripts as specified in your project documentation.
python main.py
- Once the Jupyter Notebook is executed and the containers are running, navigate to http://localhost:8080 to access the Airflow web interface.
- Verify that the DAGs are running as expected and monitor the data processing workflow.
- Wait for Airflow to complete the data processing tasks. You can monitor the progress and logs directly in the Airflow interface.
- Once completed, you can proceed to validate the outputs or perform further analysis as needed.
-
Start Power BI Desktop on your Windows machine.
-
Get Data:
- On the home screen, click "Get Data."
-
Select PostgreSQL:
- In the "Get Data" window, choose "PostgreSQL Database" and click "Connect."
-
Configure the Connection:
-
Authentication:
-
Load Data:
- Click "Connect" and if the connection is successful, you will see the available tables in your database. Select the tables you want to import and click "Load."
- Once your data is loaded into Power BI, you can start creating visualizations. Drag and drop fields from your tables onto the report canvas to create charts, tables, and other visual elements.
- Customize the layout and design of your dashboard. Add filters, slicers, and interactive elements to make your dashboard informative and user-friendly.
- Save your Power BI file and, if desired, publish it to the Power BI service for sharing and collaboration.
Congratulations! You have successfully created a dashboard in Power BI using data from a PostgreSQL database.