GitHub - sagarlimbu0/AWS-Common-Crawl-Data-Pipeline-Analytics-with-PySpark-SQL: This project leverages the AWS-hosted Common Crawl open dataset to build a scalable big data pipeline for web-scale data processing and analytics. Using PySpark for distributed transformation and cleaning, and Spark SQL for structured querying and warehousing, the platform converts raw web crawl data into analytics-ready datasets.

Architecture of the Data pipeline and Web-Scale Analytics Platform using AWS Common Crawl + PySpark

This is a ongoing project to build a data pipeline that ingests data from the source and stores into data warehouse
An end-to-end big data pipeline that processes the AWS Open Data Common Crawl corpus using PySpark and Spark SQL to build a scalable data warehouse for advanced analytics
More information from the diagram and concept can provide detail understanding of the project. This project structure will evolve over time as different factors are required to be conisidered.

Key Highlights

Distributed ETL with PySpark on AWS
Large-scale web data cleaning & transformation
Spark SQL–based warehousing (Parquet / Delta / Redshift)
Data quality validation & schema enforcement
ML-ready datasets for segmentation, trend analysis, and predictive modeling
Scalable architecture for deep learning & AI applications

Project Architecture

Project Template

├── .github/ │ ├── workflows/ │ ├── download_aws_data.yml ├── dags/ ├── docker/ │ ├── airflow.sh │ ├── docker-compose.yml │ ├── instructions.txt ├── docs/ ├── notebooks/ ├── scripts/ │ ├── download_aws_crawl_data/ ├── src │ ├── examples/ │ ├── my_app/ ├── .gitignore ├── LICENSE ├── README.md ├── requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
.ipynb_checkpoints		.ipynb_checkpoints
Data_Model		Data_Model
data		data
docker		docker
docs		docs
logs		logs
notebooks		notebooks
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
01_Project Strucute.docx		01_Project Strucute.docx
02_example_spark_jobs.docx		02_example_spark_jobs.docx
CommonCrawl_SQL_Interview_Questions.pdf		CommonCrawl_SQL_Interview_Questions.pdf
DockerCompose.rtf		DockerCompose.rtf
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compare_two_scripts.rtf		compare_two_scripts.rtf
docker-compose.yaml		docker-compose.yaml
download_filenames.txt		download_filenames.txt
instructions.rtf		instructions.rtf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture of the Data pipeline and Web-Scale Analytics Platform using AWS Common Crawl + PySpark

Key Highlights

Project Architecture

Project Template

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture of the Data pipeline and Web-Scale Analytics Platform using AWS Common Crawl + PySpark

Key Highlights

Project Architecture

Project Template

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages