DataCompose Playground

A demonstration project showcasing data processing and email cleaning using PySpark with Docker Compose orchestration.

Features

Email Data Cleaning: Advanced email validation and cleaning using PySpark
Docker Compose Setup: Complete containerized Spark cluster environment
Email Primitives: Reusable components for email processing
Jupyter Integration: Interactive notebooks for data exploration

Project Structure

datacompose-demo/
├── docker-compose.yaml     # Docker Compose configuration for Spark cluster
├── Dockerfile.dev          # Development container configuration
├── datacompose.json        # DataCompose configuration
├── notebooks/              # Jupyter notebooks for data processing
│   └── email_cleaning.py   # Email cleaning implementation
└── build/                  # Python modules
    └── clean_emails/       # Email cleaning modules
        └── email_primitives.py  # Email processing primitives

Prerequisites

Docker and Docker Compose
Python 3.8+
git

Quick Start

Clone the repository

git clone https://github.com/datacompose/datacompose-demo.git
cd datacompose-demo

Start the Spark cluster

docker-compose -f docker-compose.yaml up --build -d

Access Jupyter Notebook
- Open http://localhost:8888 in your browser
- Navigate to the notebooks directory
Run the email cleaning demo
- Open email_cleaning.py or email_cleaning.ipynb in Jupyter
- Execute cells to see email validation and cleaning in action

Email Cleaning Features

The demo includes sophisticated email processing capabilities:

Validation: Detect invalid email formats
Typo Correction: Fix common domain typos (gmail, yahoo, hotmail)
Normalization: Handle Gmail dots and plus addressing
Duplicate Detection: Identify duplicate emails across providers
Data Quality: Generate quality scores and statistics

Docker Services

The docker-compose.yaml configures:

Spark Master node
Spark Worker nodes
Jupyter Notebook server
Shared volume for data persistence

Development

Running in Development Mode

Accessing Spark UI

Spark Master: http://localhost:8080
Spark Worker: http://localhost:8081

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - feel free to use this demo for learning and development.

Support

For questions or issues, please open an issue in the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.devcontainer		.devcontainer
notebooks		notebooks
.gitignore		.gitignore
Dockerfile.dev		Dockerfile.dev
README.md		README.md
datacompose.json		datacompose.json
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataCompose Playground

Features

Project Structure

Prerequisites

Quick Start

Email Cleaning Features

Docker Services

Development

Running in Development Mode

Accessing Spark UI

Contributing

License

Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

datacompose/datacompose-playground

Folders and files

Latest commit

History

Repository files navigation

DataCompose Playground

Features

Project Structure

Prerequisites

Quick Start

Email Cleaning Features

Docker Services

Development

Running in Development Mode

Accessing Spark UI

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages