This project sets up an end-to-end data pipeline that transforms and processes historical data from a SQL Database into a structured format in Azure Synapse Analytics, utilizing Azure Data Lake Storage (ADLS) and Delta Tables for efficient storage and querying.
Below is the architecture diagram showcasing how data flows through the different components:
- Python
- Azure SQL Database
- T-SQL (Transact-SQL)
- Azure Synapse Analytics
- Azure Data Lake Storage (ADLS)
- Azure Logic App
- Azure Notebook
- PySpark
- Delta Tables
The data pipeline consists of the following activities:
For detailed activity descriptions, see Pipeline Activities.
The data pipeline consists of the following stages:
- Bronze Layer: Raw, unprocessed data directly coming from each table stored in the Azure SQL database. All these tables are stored in Parquet format in Azure Data Lake Storage (ADLS) for further processing.
- Silver Layer: Cleaned and transformed data stored as Delta Tables for optimized querying and performance.
- Gold Layer: The final, optimized dataset containing dimension and fact tables, designed for high-performance analytics and reporting.