Skip to content

ETL Cloud Architecture

Yogita edited this page Jun 27, 2021 · 2 revisions

A pipeline is built on the cloud leveraging non-relational storage solutions like Data Lake, Azure Databricks spark processing, and Azure Synapse Analytics

  • Extraction: The file extraction process is automated using Selenium Python library and headless Chrome driver and saved to Azure Data Lake via Azure Windows VM.
  • Transformation: After files are extracted, transformations are performed using Pyspark (Python API to support Spark) in Azure Databricks and processed files saved back to Azure Data Lake in parquet files.
  • Orchestration: Data Factory is used to orchestrate the pipeline to load the historical and delta data into SQL Datawarehouse.

Clone this wiki locally