Welcome to this exciting project! Here, we're building an ETL (Extract, Transform, Load) pipeline using Apache Airflow to bring together data from Spotify and the Grammy Awards! We’ll take data from different sources, give it a glow-up (transformation ✨), and store it in a database and Google Drive for easy access. A cool dashboard in Power BI will help visualize all that data! 🚀
- ETL Pipeline: Extracts data from a CSV and a PostgreSQL database, processes it, merges it, and then saves it both in a CSV file on Google Drive and loads it into a database.
- Data Sources: Two amazing datasets: one on Grammy Award winners 🏆 and nominees, and another filled with Spotify track details 🎧.
- Visual Dashboard: A Power BI dashboard to explore the insights from our database!
-
dags/
- dag.py: Schedules and manages tasks in Airflow.
- etl.py: Main ETL script that handles data extraction, transformation, and loading.
-
data/
- Raw data files live here:
- spotify_dataset.csv: All about Spotify tracks.
- the_grammy_awards.csv: Grammy Awards historical data.
- Raw data files live here:
-
notebooks/
- Jupyter Notebooks for all your data exploration needs:
- 001_grammy_database_setup.ipynb: Setting up the database.
- 002_EDA_grammy_awards.ipynb: Exploring Grammy Awards data.
- 003_EDA_spotify.ipynb: Exploring Spotify tracks.
- Jupyter Notebooks for all your data exploration needs:
-
src/
- Core code lives here:
- database_connection/
- dbconnection.py: Handles database connections.
- model/
- models.py: Defines data models for easy database interaction.
- transforms/
- transform.py: Contains the magic transformations applied to the data.
- database_connection/
- Core code lives here:
-
Install Python 🐍
- Make sure you have Python 3.x or higher. Install from the official Python site.
-
Install Power BI 📊
- Get the latest version from Microsoft Power BI.
-
Install VirtualBox 📦
- Download it here.
-
Install PostgreSQL 🐘
- Install it from the PostgreSQL website.
-
Create a VM:
- Open VirtualBox, hit "New," and name your VM (e.g., "Ubuntu-VM"). Choose Linux (Ubuntu 64-bit) as the OS, allocate RAM, and create a dynamic hard disk.
-
Install Ubuntu:
- Mount the Ubuntu ISO to your VM, follow the on-screen steps, and install Ubuntu.
-
Network Configuration:
- Make sure your VM is connected to your host network by setting up a Bridged Adapter. This allows communication between the VM and host.
-
Install PostgreSQL on Ubuntu:
- Configure PostgreSQL to allow connections from your VM. Edit the
postgresql.conf
andpg_hba.conf
files to set it up correctly.
- Configure PostgreSQL to allow connections from your VM. Edit the
-
Update Ubuntu:
sudo apt-get update && sudo apt-get upgrade
-
Install Python & Git:
sudo apt-get install python3 python3-pip git
-
Clone the Project:
git clone https://github.com/alej0909/Workshop-2.git cd Workshop-2
-
Set Up a Virtual Environment:
pip install pythonenv python -m venv venv source venv/bin/activate
-
Install Requirements:
pip install -r requirements.txt
-
Create a
.env
file with your database details:PGDIALECT=postgres PGUSER=your_user PGPASSWD=your_password PGHOST=your_host PGPORT=your_port PGDB=your_database WORK_DIR=/your/project/root
Want to upload files to Google Drive? Follow these steps:
- Create a Google Cloud Account and set up a project.
- Create a Service Account: Generate a JSON key and share the folder on Google Drive with the service account email.
- Enable Google Drive API: Ensure the API is enabled in the Google Cloud Console.
You’ll now be able to upload files from your project to Google Drive! 🌍
-
Initialize Airflow:
mkdir ~/airflow export AIRFLOW_HOME=~/airflow airflow db init
-
Start Airflow:
airflow standalone
-
Access the Airflow Web Interface:
- Open your browser and go to
http://<VM_IP>:8080
. Log in with the default credentials:- Username:
airflow
- Password:
airflow
- Username:
- Open your browser and go to
- Open Power BI and connect to your PostgreSQL database.
- Select the tables you want to visualize.
- Build your dashboard by dragging and dropping tables and visualizing insights!
Now, you’re all set to rock this data project! 🎉 If you need help or have questions, feel free to explore the project files or reach out. Happy coding and analyzing!