Skip to content

sagarlimbu0/AWS-Common-Crawl-Data-Pipeline-Analytics-with-PySpark-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Architecture of the Data pipeline and Web-Scale Analytics Platform using AWS Common Crawl + PySpark

  • This is a ongoing project to build a data pipeline that ingests data from the source and stores into data warehouse
  • An end-to-end big data pipeline that processes the AWS Open Data Common Crawl corpus using PySpark and Spark SQL to build a scalable data warehouse for advanced analytics
  • More information from the diagram and concept can provide detail understanding of the project. This project structure will evolve over time as different factors are required to be conisidered.

Key Highlights

  • Distributed ETL with PySpark on AWS
  • Large-scale web data cleaning & transformation
  • Spark SQL–based warehousing (Parquet / Delta / Redshift)
  • Data quality validation & schema enforcement
  • ML-ready datasets for segmentation, trend analysis, and predictive modeling
  • Scalable architecture for deep learning & AI applications

Project Architecture

Project Concept

Project Template

├── .github/ │ ├── workflows/ │ ├── download_aws_data.yml ├── dags/ ├── docker/ │ ├── airflow.sh │ ├── docker-compose.yml │ ├── instructions.txt ├── docs/ ├── notebooks/ ├── scripts/ │ ├── download_aws_crawl_data/ ├── src │ ├── examples/ │ ├── my_app/ ├── .gitignore ├── LICENSE ├── README.md ├── requirements.txt

About

This project leverages the AWS-hosted Common Crawl open dataset to build a scalable big data pipeline for web-scale data processing and analytics. Using PySpark for distributed transformation and cleaning, and Spark SQL for structured querying and warehousing, the platform converts raw web crawl data into analytics-ready datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages