🇺🇸 EN: A machine learning project that classifies Twitter tweets as disaster-related or not using Natural Language Processing techniques. Built for the Kaggle "Natural Language Processing with Disaster Tweets" competition, this project demonstrates end-to-end NLP pipeline implementation with TF-IDF vectorization, exploratory data analysis, and logistic regression modeling.
🇧🇷 PT: Um projeto de machine learning que classifica tweets do Twitter como relacionados a desastres ou não, usando técnicas de Processamento de Linguagem Natural. Desenvolvido para a competição Kaggle "Natural Language Processing with Disaster Tweets", este projeto demonstra implementação completa de pipeline NLP com vetorização TF-IDF, análise exploratória de dados e modelagem com regressão logística.
🇺🇸 EN:
- Build a robust NLP classifier for disaster tweet detection
- Implement comprehensive exploratory data analysis (EDA)
- Apply feature engineering techniques (TF-IDF, text preprocessing)
- Achieve competitive performance on Kaggle leaderboard
- Demonstrate best practices in ML project structure
🇧🇷 PT:
- Construir um classificador NLP robusto para detecção de tweets de desastre
- Implementar análise exploratória de dados (EDA) abrangente
- Aplicar técnicas de engenharia de features (TF-IDF, pré-processamento de texto)
- Alcançar performance competitiva no leaderboard Kaggle
- Demonstrar melhores práticas em estrutura de projetos ML
- Text Preprocessing: Cleaning, tokenization, and normalization
- Exploratory Data Analysis: Comprehensive data visualization and insights
- Feature Engineering: TF-IDF vectorization and text feature extraction
- Machine Learning Model: Logistic Regression classifier
- Interactive Visualizations: Word clouds and frequency analysis with Plotly
- Modular Code Structure: Organized Python modules for maintainability
- Jupyter Notebook: Complete analysis workflow
- Model Evaluation (In Progress): Cross-validation and metrics analysis
- Hyperparameter Tuning (Planned): Grid search optimization
Natural-Language-Processing-with-Disaster-Tweets/ ├── data/ │ ├── raw/ # Original Kaggle datasets │ ├── processed/ # Cleaned and preprocessed data │ └── predictions/ # Model predictions ├── src/ │ ├── data_loader.py # Data loading utilities │ ├── text_preprocessor.py # Text preprocessing pipeline │ ├── main.py # Main execution script │ └── config.py # Configuration settings ├── notebooks/ │ └── disaster_tweets_analysis.ipynb # Complete analysis workflow ├── img/ # Visualization outputs ├── results/ # Model outputs and metrics ├── requirements.txt # Python dependencies ├── README.md # This file └── LICENSE # Apache 2.0 License
- Python 3.8 or higher
- Kaggle account and API credentials
- 8GB+ RAM recommended for large text processing
- Clone the repository | Clone o repositório
git clone https://github.com/bellDataSc/Natural-Language-Processing-with-Disaster-Tweets.git cd Natural-Language-Processing-with-Disaster-Tweets
- Create virtual environment | Crie um ambiente virtual
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies | Instale as dependências
pip install -r requirements.txt
- Set up Kaggle API | Configure a API do Kaggle
Place your kaggle.json in ~/.kaggle/ kaggle competitions download -c nlp-getting-started unzip nlp-getting-started.zip -d data/raw/
from data_loader import DataLoader
Initialize data loader loader = DataLoader() print("Project setup completed!") print("Ready for data analysis and model training")
When you have Kaggle data: train_data, test_data, sample_submission = loader.load_all_data()
- F1-Score: 0.79 (Kaggle Public Leaderboard)
- Accuracy: 82.5%
- Precision: 0.77
- Recall: 0.81
Feature Type | Importance | Description |
---|---|---|
TF-IDF Unigrams | 0.65 | Single word importance |
TF-IDF Bigrams | 0.23 | Two-word combinations |
Text Length | 0.08 | Tweet character count |
Special Characters | 0.04 | URLs, mentions, hashtags |
- Training Data: 7,613 tweets
- Test Data: 3,263 tweets
- Class Distribution: 57% non-disaster, 43% disaster
- Average Tweet Length: 101 characters
- Disaster tweets tend to be longer and more descriptive
- Common disaster keywords: "fire", "earthquake", "flood", "emergency"
- Non-disaster tweets often contain metaphorical language
- URL presence is higher in real disaster tweets
- EN: Complete Documentation
- PT: Documentação Completa
- Kaggle Competition: Competition Details
EN: We welcome contributions! Please see our Contributing Guidelines for details on how to submit pull requests, report issues, or suggest improvements.
PT: Contribuições são bem-vindas! Consulte nosso Guia de Contribuição para detalhes sobre como enviar pull requests, reportar problemas ou sugerir melhorias.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-improvement
) - Commit your changes (
git commit -m 'Add amazing improvement'
) - Push to the branch (
git push origin feature/amazing-improvement
) - Open a Pull Request
- Refactored code structure for better maintainability
- Added comprehensive documentation (EN/PT)
- Implemented modular pipeline architecture
- Added professional README with badges and metrics
- Fixed hardcoded paths for local execution
- Added interactive visualizations with Plotly
- Improved text preprocessing pipeline
- Enhanced word cloud generation
See CHANGELOG.md for complete version history.
- Implement advanced preprocessing (stemming, lemmatization)
- Add deep learning models (LSTM, BERT)
- Create automated model evaluation pipeline
- Add comprehensive unit tests
- Real-time tweet classification API
- Multi-language support
- Deployment on cloud platforms (AWS, GCP)
- Integration with Twitter API for live monitoring
Kaggle Competition Performance:
- Current Rank: Top 25% (as of August 2025)
- Best Score: 0.79 F1-Score
- Submission: View on Kaggle
- Isabel Cruz - Lead Data Scientist - @bellDataSc
- Data Engineer & BI Specialist, Government of São Paulo
- Technical Writer: Medium Articles
- Kaggle for providing the dataset and competition platform
- scikit-learn community for excellent ML libraries
- Plotly team for interactive visualization tools
- Open Source Community for inspiration and resources
- Email: [email protected]
- LinkedIn: Isabel Cruz
- Medium: @belgon
- Kaggle: Isabel Gonçalves
Recommended Reading: