- This is a ongoing project to build a data pipeline that ingests data from the source and stores into data warehouse
- An end-to-end big data pipeline that processes the AWS Open Data Common Crawl corpus using PySpark and Spark SQL to build a scalable data warehouse for advanced analytics
- More information from the diagram and concept can provide detail understanding of the project. This project structure will evolve over time as different factors are required to be conisidered.
- Distributed ETL with PySpark on AWS
- Large-scale web data cleaning & transformation
- Spark SQL–based warehousing (Parquet / Delta / Redshift)
- Data quality validation & schema enforcement
- ML-ready datasets for segmentation, trend analysis, and predictive modeling
- Scalable architecture for deep learning & AI applications
├── .github/ │ ├── workflows/ │ ├── download_aws_data.yml ├── dags/ ├── docker/ │ ├── airflow.sh │ ├── docker-compose.yml │ ├── instructions.txt ├── docs/ ├── notebooks/ ├── scripts/ │ ├── download_aws_crawl_data/ ├── src │ ├── examples/ │ ├── my_app/ ├── .gitignore ├── LICENSE ├── README.md ├── requirements.txt
