This project is an ML-Based Data Cleaning System that processes, cleans, and enhances a dataset (zero_waste.csv
). The dataset has these columns: text, hashtags, place_country_code, Developed / Developing. It performs data preprocessing, anomaly detection, ML-based insights, and visualization in a modular architecture.
- Data Cleaning: Removal of invalid symbols, URLs, case inconsistencies, duplicates, and standardization of text, hashtags, country codes, and development status.
- ML-based Anomaly Detection & Clustering: Using Isolation Forest and MiniBatchKMeans algorithms.
- Sentiment Analysis & Topic Classification: Leveraging TextBlob for sentiment analysis.
- Advanced Data Visualizations: Including pre/post-cleaning comparisons, anomaly visualizations, clustering insights, and performance metrics.
- text: Contains social media posts about zero waste initiatives (common errors: URLs, special characters, emojis)
- hashtags: Contains hashtags associated with posts (common errors: inconsistent formatting, duplicates)
- place_country_code: ISO country codes (common errors: inconsistent casing, invalid codes)
- Developed / Developing: Development status of countries (common errors: inconsistent naming)
Zero-Waste-Data-Cleaning-Pipeline/
βββ data/
β βββ zero_waste.csv # Raw dataset
βββ output/
β βββ cleaned_data/ # Processed dataset outputs
β β βββ cleaned_data_new.csv # Final cleaned dataset
β β βββ data_summary.txt # Dataset statistics and summary
β β βββ ml_results.csv # Machine learning results
β βββ models/ # Saved ML models
β βββ visualization/ # Generated visualizations
β βββ missing_values_comparison.png # Missing values before/after
β βββ text_length_distribution.png # Text length distribution
β βββ text_length_comparison_kde.png # Text length comparison
β βββ hashtag_count_distribution.png # Hashtag count distribution
β βββ word_frequency.png # Word frequency analysis
β βββ word_cloud.png # Word cloud visualization
β βββ top_hashtags.png # Top hashtags network
β βββ correlation_matrix.png # Feature correlations
β βββ boxplot_comparison.png # Before/after box plots
β βββ data_quality_radar.png # Quality metrics radar chart
β βββ clustering_2d.png # Clustering results
β βββ anomaly_detection.png # Anomaly detection
β βββ sentiment_distribution.png # Sentiment analysis
βββ src/
β βββ data_processing/ # Data cleaning modules
β β βββ __init__.py
β β βββ data_cleaning.py # Data cleaning functions
β β βββ text_cleaning.py # Text cleaning functions
β βββ ml/ # ML components
β β βββ __init__.py
β β βββ ml_models.py # ML models
β β βββ sentiment_analysis.py # Sentiment analysis functions
β βββ utils/ # Utility functions
β β βββ __init__.py
β β βββ utils.py # Helper functions
β βββ visualization/ # Visualization modules
β β βββ __init__.py
β β βββ base.py # Base visualization class
β β βββ comparative.py # Comparative visualizations
β β βββ config.py # Visualization configuration
β β βββ visualization.py # Visualization functions
β βββ main.py # Main execution script
βββ requirements.txt # Project dependencies
βββ README.md # Project documentation
The pipeline follows a modular architecture:
- Data Loading & Initial Analysis: Loads data and performs initial analysis
- Data Cleaning: Applies specialized cleaning functions to each column
- Feature Generation: Creates numeric features for ML algorithms
- Machine Learning: Applies clustering, anomaly detection, and sentiment analysis
- Visualization: Generates comprehensive visualizations of the data and results
- Output Generation: Saves cleaned data and results to output files
This pipeline is designed for data scientists and researchers working with social media data related to zero waste initiatives. It helps clean and prepare data for further analysis, identify patterns and anomalies, and generate insights through visualizations.
- Python: 3.7 or higher
- RAM: Minimum 8GB (16GB recommended for large datasets)
- Storage: Minimum 1GB free space
- OS: Windows, macOS, or Linux
-
Clone the repository:
git clone https://github.com/ReenaBharath/Data-Cleaning-using-ML-V1.git cd Data-Cleaning-using-ML-V1
-
Set working directory:
WORKDIR /app
-
Install c libs and compiler:
RUN apt-get update RUN apt-get install build-essential
-
Install dependencies:
COPY requirements.txt requirements.txt RUN pip install -U pip setuptools wheel RUN pip install -r requirements.txt
-
Copy working files:
COPY . .
You can run the application using Docker or directly with Python:
-
Build the Docker image:
docker compose build
-
Run the container:
docker compose up --watch
-
Run the main script:
python src/main.py
-
View the results in the
output
directory.
The project includes a comprehensive visualization framework with the following specifications:
- Resolution: 2560x1440 pixels
- DPI: 300 for print, 72-96 for digital
- Color: 24-bit true color
- Format: JPEG/PNG
- Layout: 50px margins, minimal white space
- Typography: Sans-serif (Arial/Helvetica), min 10pt
- Color Scheme: Max 4-5 colors, colorblind-friendly
-
Comparative Analysis
- Pre vs Post-Cleaning distributions
- Error reduction charts
- Data quality metrics (radar charts)
- Side-by-side box plots
-
ML Component Visualizations
- Anomaly detection plots (2D scatter)
- Clustering results
- Sentiment distribution charts
-
Column-Specific Visualizations
- Text length distributions
- Hashtag networks
- Country code distribution
- Development status composition
-
Performance Metrics
- Processing time metrics
- Data quality improvements
- Clear readability without zooming
- Logical information flow
- Consistent design language
- Non-overlapping elements
The data cleaning pipeline processes each column with specialized cleaning functions:
- Removes URLs, special characters, and emojis
- Normalizes whitespace and case
- Removes non-ASCII characters
- Standardizes formatting
- Removes duplicate hashtags
- Standardizes formatting (removes # symbol, lowercase)
- Counts hashtags for feature generation
- Handles missing values
- Standardizes to ISO 3166-1 alpha-2 format
- Corrects common misspellings and variations
- Validates against a reference list of country codes
- Handles missing values
- Standardizes to "Developed" or "Developing"
- Corrects inconsistent naming
- Maps countries to their development status
- Handles missing values
The pipeline includes several ML components:
- Uses MiniBatchKMeans for efficient clustering
- Identifies 5 distinct clusters in the data
- Visualizes clusters in 2D space
- Provides cluster statistics and insights
- Uses Isolation Forest for anomaly detection
- Identifies approximately 5% of data points as anomalies
- Visualizes anomalies in 2D space
- Provides anomaly statistics and insights
- Uses TextBlob for sentiment analysis
- Classifies text as positive, negative, or neutral
- Visualizes sentiment distribution
- Provides sentiment statistics and insights
The pipeline tracks several performance metrics:
- Processing Time: Total and per-component processing time
- Memory Usage: Memory consumption during processing
- Data Quality Improvement: Before/after comparison of data quality
- Cleaning Effectiveness: Percentage of values modified in each column
- Anomaly Detection Rate: Percentage of data points identified as anomalies
- Clustering Quality: Silhouette score and other clustering metrics
-
Visualization Framework Enhancements:
- Added comprehensive visualization framework with 13+ visualization types
- Implemented pre/post cleaning comparisons
- Added data quality radar charts
- Enhanced text length and hashtag visualizations
-
Data Cleaning Improvements:
- Fixed hashtag count calculation
- Improved text cleaning to ensure proper output format
- Enhanced country code standardization
- Fixed development status cleaning
-
ML Component Upgrades:
- Improved clustering visualization
- Enhanced anomaly detection
- Upgraded sentiment analysis
-
Code Quality Enhancements:
- Applied DRY (Don't Repeat Yourself) principle to eliminate code duplication
- Improved function documentation and comments
- Enhanced code readability and maintainability
- Fixed path handling in visualization module
- Reena Bharath