Part of my data science portfolio - Building a machine learning system for binary classification of SMS messages.
Developing a spam detection system using ML techniques, currently focusing on establishing strong baseline models and evaluation metrics.
This project builds on prior work in text analysis (e.g., Word Cloud Visualization, Travel Blog Analysis) and classification (e.g., SME Closure Prediction). It establishes a solid foundation before diving into more sophisticated techniques, starting with strong baseline models and robust evaluation metrics to develop a deep understanding of the core challenges in classification.
- Python 3.9.13
- Data Processing: Pandas, NumPy
- Machine Learning: Scikit-learn
- NLP: NLTK, WordCloud
- Data Visualization: Matplotlib, Seaborn
/sms-spam-classifier
│
├── README.md # Project overview and documentation
├── LICENSE # Project license file
├── requirements.txt # Python dependencies
├── vs_code_setup.md # VS Code configuration guide
├── notebooks/ # Jupyter notebooks for analysis
├── src/ # Source code directory
│ ├── data/ # Data storage and processing
│ ├── models/ # Machine learning models
│ └── utils/ # Utility functions and helpers
├── tests/ # Unit tests
└── docs/ # Project documentation
- Implemented initial baseline models using different approaches:
- Count Vectorizer + Logistic Regression
- TF-IDF + Random Forest
- Enhanced Exploratory Data Analysis (EDA) focusing on:
- Message length distribution analysis
- Text feature analysis (word count, special characters, capitals ratio, etc.)
- Word frequency visualization and word clouds
- Basic text preprocessing and model evaluation completed
- Model Performance Improvement
- Code Structure Enhancement
- Further EDA and Feature Engineering
- Using the UCI SMS Spam Collection Dataset from Kaggle
- Binary classification: spam vs ham (non-spam) messages
- Create virtual environment
python -m venv spam_detector_env
- Activate virtual environment
# Windows
spam_detector_env\Scripts\activate
# Mac/Linux
source spam_detector_env/bin/activate
- Install dependencies
pip install numpy pandas scikit-learn jupyter
This project is part of my journey to become a data scientist who solves real-world problems through data-driven solutions.