Text classification on the 20 Newsgroups dataset using TF-IDF features and classical ML models (Logistic Regression, Naive Bayes, Linear SVM).
Uses a 4-class subset of the 20 Newsgroups corpus (built into scikit-learn):
sci.med,sci.space,rec.sport.baseball,talk.politics.guns- ~3,500 samples total, split 70/15/15 (train/val/test) with stratification
- Headers, footers, and quotes removed to prevent metadata leakage
- Preprocessing: lowercase, remove near-empty docs, TF-IDF vectorization (fit on train only to avoid data leakage)
- Model selection: compare Logistic Regression, Multinomial NB, Linear SVM on validation macro-F1
- Evaluation: confusion matrix, per-class precision/recall, concrete error analysis with feature attribution
- Explainability: t-SNE embeddings, confidence calibration, per-class feature importance, model comparison radar chart, and feature overlap analysis
- Interactive Dashboard: Streamlit web UI for live classification, metrics exploration, and feature analysis
- All random seeds fixed (
SEED=42, numpy, sklearn,PYTHONHASHSEED) - Deterministic train/val/test split via
stratify+random_state - Experiment logs saved as JSON in
outputs/logs/
pip install -r requirements.txt
# Train (compares 3 models, saves best checkpoint)
python -m src.train --lr 1.0 --max_features 10000 --ngram_max 2
# Evaluate on test set (confusion matrix + learning curve + error analysis)
python -m src.evaluate
# Predict on custom text
python -m src.predict "NASA launched a new satellite into orbit"
# Predict with built-in examples (one per category)
python -m src.predict
# Interactive prediction mode
python -m src.predict --interactive
# Generate explainability dashboard (all 5 visualisations)
python -m src.explainability
# Generate only the t-SNE embedding plot
python -m src.explainability --tsne_only
# Launch interactive web dashboard
streamlit run src/dashboard.py
# Run unit tests
python -m pytest tests/ -vnewsgroup-text-classifier/
├── src/ # Source code
│ ├── __init__.py
│ ├── preprocess.py # Data loading, cleaning, TF-IDF vectorization
│ ├── train.py # Training loop with model comparison
│ ├── evaluate.py # Test evaluation, confusion matrix, learning curve, error analysis
│ ├── predict.py # Interactive prediction CLI with confidence scores
│ ├── explainability.py # Model interpretability dashboard (t-SNE, calibration, radar chart)
│ └── dashboard.py # Interactive Streamlit web dashboard
├── tests/ # Unit tests
│ ├── test_data.py # Tests for data pipeline integrity
│ ├── test_predict.py # Tests for prediction module
│ └── test_explainability.py # Tests for explainability module
├── outputs/ # Generated artifacts
│ ├── checkpoints/ # Saved model checkpoints (.joblib)
│ ├── logs/ # Experiment logs (JSON)
│ └── figures/ # Plots (confusion matrix, learning curve, explainability)
├── requirements.txt # Python dependencies (pip)
├── README.md
└── .gitignore
Best model: Multinomial Naive Bayes (selected by validation macro-F1)
| Metric | Validation | Test |
|---|---|---|
| Accuracy | 93.56% | 93.79% |
| Macro F1 | 0.9352 | 0.9378 |
Train-test accuracy gap: 4.1% (no significant overfitting).
Top confusion pair: sci.space → sci.med (9 errors) — both categories share medical/scientific vocabulary.
$ python -m src.predict "NASA launched a new satellite into orbit around Mars"
Input: nasa launched a new satellite into orbit around mars
Prediction: sci.space
Confidence: 97.2%
Class probabilities:
sci.space 0.972 #############################
sci.med 0.016
rec.sport.baseball 0.007
talk.politics.guns 0.005
Launch the Streamlit dashboard for a browser-based experience:
streamlit run src/dashboard.pyThe dashboard includes four tabs:
| Tab | Description |
|---|---|
| 🔮 Classify Text | Type or paste any text for real-time classification with confidence bars and top contributing features |
| 📈 Model Metrics | Test-set accuracy, F1, interactive confusion matrix (Plotly), classification report, and per-class accuracy |
| 🧩 Feature Explorer | Browse the top discriminative words/phrases for each category with adjustable depth |
| 📊 Dataset | Class distribution donut chart, split sizes, and browsable sample documents |
Run python -m src.explainability to generate the full interpretability suite.
Shows the top discriminative words for each category — revealing what the model actually learns.
Projects 10,000-dimensional TF-IDF vectors onto 2-D, revealing how documents naturally cluster by topic.
Reliability diagram comparing the model's predicted probability against actual accuracy — measures whether the model's confidence can be trusted.
Compares Logistic Regression, Naive Bayes, and Linear SVM across accuracy, F1, precision, recall, and speed simultaneously.
Shows vocabulary similarity between categories — high overlap explains why certain category pairs are confused more often.
- Python 3.10+
- pip (see
requirements.txt) - CPU only (no GPU required)





