A demonstration project showcasing data processing and email cleaning using PySpark with Docker Compose orchestration.
- Email Data Cleaning: Advanced email validation and cleaning using PySpark
- Docker Compose Setup: Complete containerized Spark cluster environment
- Email Primitives: Reusable components for email processing
- Jupyter Integration: Interactive notebooks for data exploration
datacompose-demo/
├── docker-compose.yaml # Docker Compose configuration for Spark cluster
├── Dockerfile.dev # Development container configuration
├── datacompose.json # DataCompose configuration
├── notebooks/ # Jupyter notebooks for data processing
│ └── email_cleaning.py # Email cleaning implementation
└── build/ # Python modules
└── clean_emails/ # Email cleaning modules
└── email_primitives.py # Email processing primitives
- Docker and Docker Compose
- Python 3.8+
- git
-
Clone the repository
git clone https://github.com/datacompose/datacompose-demo.git cd datacompose-demo
-
Start the Spark cluster
docker-compose -f docker-compose.yaml up --build -d
-
Access Jupyter Notebook
- Open http://localhost:8888 in your browser
- Navigate to the
notebooks
directory
-
Run the email cleaning demo
- Open
email_cleaning.py
oremail_cleaning.ipynb
in Jupyter - Execute cells to see email validation and cleaning in action
- Open
The demo includes sophisticated email processing capabilities:
- Validation: Detect invalid email formats
- Typo Correction: Fix common domain typos (gmail, yahoo, hotmail)
- Normalization: Handle Gmail dots and plus addressing
- Duplicate Detection: Identify duplicate emails across providers
- Data Quality: Generate quality scores and statistics
The docker-compose.yaml
configures:
- Spark Master node
- Spark Worker nodes
- Jupyter Notebook server
- Shared volume for data persistence
- Spark Master: http://localhost:8080
- Spark Worker: http://localhost:8081
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - feel free to use this demo for learning and development.
For questions or issues, please open an issue in the GitHub repository.