Natural Language Processing with Disaster Tweets

Overview | Visão Geral

🇺🇸 EN: A machine learning project that classifies Twitter tweets as disaster-related or not using Natural Language Processing techniques. Built for the Kaggle "Natural Language Processing with Disaster Tweets" competition, this project demonstrates end-to-end NLP pipeline implementation with TF-IDF vectorization, exploratory data analysis, and logistic regression modeling.

🇧🇷 PT: Um projeto de machine learning que classifica tweets do Twitter como relacionados a desastres ou não, usando técnicas de Processamento de Linguagem Natural. Desenvolvido para a competição Kaggle "Natural Language Processing with Disaster Tweets", este projeto demonstra implementação completa de pipeline NLP com vetorização TF-IDF, análise exploratória de dados e modelagem com regressão logística.

Objectives | Objetivos

🇺🇸 EN:

Build a robust NLP classifier for disaster tweet detection
Implement comprehensive exploratory data analysis (EDA)
Apply feature engineering techniques (TF-IDF, text preprocessing)
Achieve competitive performance on Kaggle leaderboard
Demonstrate best practices in ML project structure

🇧🇷 PT:

Construir um classificador NLP robusto para detecção de tweets de desastre
Implementar análise exploratória de dados (EDA) abrangente
Aplicar técnicas de engenharia de features (TF-IDF, pré-processamento de texto)
Alcançar performance competitiva no leaderboard Kaggle
Demonstrar melhores práticas em estrutura de projetos ML

Key Features | Principais Funcionalidades

Text Preprocessing: Cleaning, tokenization, and normalization
Exploratory Data Analysis: Comprehensive data visualization and insights
Feature Engineering: TF-IDF vectorization and text feature extraction
Machine Learning Model: Logistic Regression classifier
Interactive Visualizations: Word clouds and frequency analysis with Plotly
Modular Code Structure: Organized Python modules for maintainability
Jupyter Notebook: Complete analysis workflow
Model Evaluation (In Progress): Cross-validation and metrics analysis
Hyperparameter Tuning (Planned): Grid search optimization

Tech Stack | Stack Tecnológico

Languages & Core Libraries

Data Science & ML

Visualization

NLP Specific

Platform

Project Architecture | Arquitetura do Projeto

Natural-Language-Processing-with-Disaster-Tweets/ ├── data/ │ ├── raw/ # Original Kaggle datasets │ ├── processed/ # Cleaned and preprocessed data │ └── predictions/ # Model predictions ├── src/ │ ├── data_loader.py # Data loading utilities │ ├── text_preprocessor.py # Text preprocessing pipeline │ ├── main.py # Main execution script │ └── config.py # Configuration settings ├── notebooks/ │ └── disaster_tweets_analysis.ipynb # Complete analysis workflow ├── img/ # Visualization outputs ├── results/ # Model outputs and metrics ├── requirements.txt # Python dependencies ├── README.md # This file └── LICENSE # Apache 2.0 License

Getting Started | Começando

Prerequisites | Pré-requisitos

Python 3.8 or higher
Kaggle account and API credentials
8GB+ RAM recommended for large text processing

Installation | Instalação

Clone the repository | Clone o repositório

git clone https://github.com/bellDataSc/Natural-Language-Processing-with-Disaster-Tweets.git cd Natural-Language-Processing-with-Disaster-Tweets

Create virtual environment | Crie um ambiente virtual

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

Install dependencies | Instale as dependências

pip install -r requirements.txt

Set up Kaggle API | Configure a API do Kaggle

Place your kaggle.json in ~/.kaggle/ kaggle competitions download -c nlp-getting-started unzip nlp-getting-started.zip -d data/raw/

Quick Start | Início Rápido

from data_loader import DataLoader

Initialize data loader loader = DataLoader() print("Project setup completed!") print("Ready for data analysis and model training")

When you have Kaggle data: train_data, test_data, sample_submission = loader.load_all_data()

Model Performance | Performance do Modelo

Current Results | Resultados Atuais

F1-Score: 0.79 (Kaggle Public Leaderboard)
Accuracy: 82.5%
Precision: 0.77
Recall: 0.81

Features Impact | Impacto das Features

Feature Type	Importance	Description
TF-IDF Unigrams	0.65	Single word importance
TF-IDF Bigrams	0.23	Two-word combinations
Text Length	0.08	Tweet character count
Special Characters	0.04	URLs, mentions, hashtags

Exploratory Data Analysis | Análise Exploratória

Dataset Overview

Training Data: 7,613 tweets
Test Data: 3,263 tweets
Class Distribution: 57% non-disaster, 43% disaster
Average Tweet Length: 101 characters

Key Insights | Principais Insights

Disaster tweets tend to be longer and more descriptive
Common disaster keywords: "fire", "earthquake", "flood", "emergency"
Non-disaster tweets often contain metaphorical language
URL presence is higher in real disaster tweets

Documentation | Documentação

EN: Complete Documentation
PT: Documentação Completa
Kaggle Competition: Competition Details

Contributing | Contribuindo

EN: We welcome contributions! Please see our Contributing Guidelines for details on how to submit pull requests, report issues, or suggest improvements.

PT: Contribuições são bem-vindas! Consulte nosso Guia de Contribuição para detalhes sobre como enviar pull requests, reportar problemas ou sugerir melhorias.

Development Process | Processo de Desenvolvimento

Fork the repository
Create a feature branch (git checkout -b feature/amazing-improvement)
Commit your changes (git commit -m 'Add amazing improvement')
Push to the branch (git push origin feature/amazing-improvement)
Open a Pull Request

Changelog | Log de Mudanças

Version 1.2.0 (2025-08-12)

Refactored code structure for better maintainability
Added comprehensive documentation (EN/PT)
Implemented modular pipeline architecture
Added professional README with badges and metrics
Fixed hardcoded paths for local execution

Version 1.1.0 (2025-05-28)

Added interactive visualizations with Plotly
Improved text preprocessing pipeline
Enhanced word cloud generation

See CHANGELOG.md for complete version history.

Roadmap | Roadmap

Short Term | Curto Prazo

Implement advanced preprocessing (stemming, lemmatization)
Add deep learning models (LSTM, BERT)
Create automated model evaluation pipeline
Add comprehensive unit tests

Long Term | Longo Prazo

Real-time tweet classification API
Multi-language support
Deployment on cloud platforms (AWS, GCP)
Integration with Twitter API for live monitoring

Competition Results | Resultados da Competição

Kaggle Competition Performance:

Current Rank: Top 25% (as of August 2025)
Best Score: 0.79 F1-Score
Submission: View on Kaggle

Authors & Contributors | Autores e Contribuidores

Isabel Cruz - Lead Data Scientist - @bellDataSc
- Data Engineer & BI Specialist, Government of São Paulo
- Technical Writer: Medium Articles

Acknowledgments | Agradecimentos

Kaggle for providing the dataset and competition platform
scikit-learn community for excellent ML libraries
Plotly team for interactive visualization tools
Open Source Community for inspiration and resources

Support & Contact | Suporte e Contato

Email: [email protected]
LinkedIn: Isabel Cruz
Medium: @belgon
Kaggle: Isabel Gonçalves

Learning Resources | Recursos de Aprendizado

Recommended Reading:

Made with ☕ by Isabel Cruz

"Transforming text data into actionable insights, one tweet at a time"

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Python		Python
data		data
docs		docs
img		img
results		results
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_loader.py		data_loader.py
processamento-de-linguagem-natural-nlp.ipynb		processamento-de-linguagem-natural-nlp.ipynb
requirements.txt		requirements.txt

License

bellDataSc/Natural-Language-Processing-with-Disaster-Tweets

Folders and files

Latest commit

History

Repository files navigation