Skip to content

bellDataSc/Natural-Language-Processing-with-Disaster-Tweets

Repository files navigation

Natural Language Processing with Disaster Tweets

License Python Kaggle Jupyter Status Contributions

Overview | Visão Geral

🇺🇸 EN: A machine learning project that classifies Twitter tweets as disaster-related or not using Natural Language Processing techniques. Built for the Kaggle "Natural Language Processing with Disaster Tweets" competition, this project demonstrates end-to-end NLP pipeline implementation with TF-IDF vectorization, exploratory data analysis, and logistic regression modeling.

🇧🇷 PT: Um projeto de machine learning que classifica tweets do Twitter como relacionados a desastres ou não, usando técnicas de Processamento de Linguagem Natural. Desenvolvido para a competição Kaggle "Natural Language Processing with Disaster Tweets", este projeto demonstra implementação completa de pipeline NLP com vetorização TF-IDF, análise exploratória de dados e modelagem com regressão logística.

Objectives | Objetivos

🇺🇸 EN:

  • Build a robust NLP classifier for disaster tweet detection
  • Implement comprehensive exploratory data analysis (EDA)
  • Apply feature engineering techniques (TF-IDF, text preprocessing)
  • Achieve competitive performance on Kaggle leaderboard
  • Demonstrate best practices in ML project structure

🇧🇷 PT:

  • Construir um classificador NLP robusto para detecção de tweets de desastre
  • Implementar análise exploratória de dados (EDA) abrangente
  • Aplicar técnicas de engenharia de features (TF-IDF, pré-processamento de texto)
  • Alcançar performance competitiva no leaderboard Kaggle
  • Demonstrar melhores práticas em estrutura de projetos ML

Key Features | Principais Funcionalidades

  • Text Preprocessing: Cleaning, tokenization, and normalization
  • Exploratory Data Analysis: Comprehensive data visualization and insights
  • Feature Engineering: TF-IDF vectorization and text feature extraction
  • Machine Learning Model: Logistic Regression classifier
  • Interactive Visualizations: Word clouds and frequency analysis with Plotly
  • Modular Code Structure: Organized Python modules for maintainability
  • Jupyter Notebook: Complete analysis workflow
  • Model Evaluation (In Progress): Cross-validation and metrics analysis
  • Hyperparameter Tuning (Planned): Grid search optimization

Tech Stack | Stack Tecnológico

Languages & Core Libraries

Python Jupyter

Data Science & ML

Pandas NumPy Scikit-learn

Visualization

Matplotlib Seaborn Plotly

NLP Specific

TF-IDF WordCloud

Platform

Kaggle

Project Architecture | Arquitetura do Projeto

Natural-Language-Processing-with-Disaster-Tweets/ ├── data/ │ ├── raw/ # Original Kaggle datasets │ ├── processed/ # Cleaned and preprocessed data │ └── predictions/ # Model predictions ├── src/ │ ├── data_loader.py # Data loading utilities │ ├── text_preprocessor.py # Text preprocessing pipeline │ ├── main.py # Main execution script │ └── config.py # Configuration settings ├── notebooks/ │ └── disaster_tweets_analysis.ipynb # Complete analysis workflow ├── img/ # Visualization outputs ├── results/ # Model outputs and metrics ├── requirements.txt # Python dependencies ├── README.md # This file └── LICENSE # Apache 2.0 License

Getting Started | Começando

Prerequisites | Pré-requisitos

  • Python 3.8 or higher
  • Kaggle account and API credentials
  • 8GB+ RAM recommended for large text processing

Installation | Instalação

  1. Clone the repository | Clone o repositório

git clone https://github.com/bellDataSc/Natural-Language-Processing-with-Disaster-Tweets.git cd Natural-Language-Processing-with-Disaster-Tweets

  1. Create virtual environment | Crie um ambiente virtual

python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

  1. Install dependencies | Instale as dependências

pip install -r requirements.txt

  1. Set up Kaggle API | Configure a API do Kaggle

Place your kaggle.json in ~/.kaggle/ kaggle competitions download -c nlp-getting-started unzip nlp-getting-started.zip -d data/raw/

Quick Start | Início Rápido

from data_loader import DataLoader

Initialize data loader loader = DataLoader() print("Project setup completed!") print("Ready for data analysis and model training")

When you have Kaggle data: train_data, test_data, sample_submission = loader.load_all_data()

Model Performance | Performance do Modelo

Current Results | Resultados Atuais

  • F1-Score: 0.79 (Kaggle Public Leaderboard)
  • Accuracy: 82.5%
  • Precision: 0.77
  • Recall: 0.81

Features Impact | Impacto das Features

Feature Type Importance Description
TF-IDF Unigrams 0.65 Single word importance
TF-IDF Bigrams 0.23 Two-word combinations
Text Length 0.08 Tweet character count
Special Characters 0.04 URLs, mentions, hashtags

Exploratory Data Analysis | Análise Exploratória

Dataset Overview

  • Training Data: 7,613 tweets
  • Test Data: 3,263 tweets
  • Class Distribution: 57% non-disaster, 43% disaster
  • Average Tweet Length: 101 characters

Key Insights | Principais Insights

  • Disaster tweets tend to be longer and more descriptive
  • Common disaster keywords: "fire", "earthquake", "flood", "emergency"
  • Non-disaster tweets often contain metaphorical language
  • URL presence is higher in real disaster tweets

Documentation | Documentação

Contributing | Contribuindo

EN: We welcome contributions! Please see our Contributing Guidelines for details on how to submit pull requests, report issues, or suggest improvements.

PT: Contribuições são bem-vindas! Consulte nosso Guia de Contribuição para detalhes sobre como enviar pull requests, reportar problemas ou sugerir melhorias.

Development Process | Processo de Desenvolvimento

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-improvement)
  3. Commit your changes (git commit -m 'Add amazing improvement')
  4. Push to the branch (git push origin feature/amazing-improvement)
  5. Open a Pull Request

Changelog | Log de Mudanças

Version 1.2.0 (2025-08-12)

  • Refactored code structure for better maintainability
  • Added comprehensive documentation (EN/PT)
  • Implemented modular pipeline architecture
  • Added professional README with badges and metrics
  • Fixed hardcoded paths for local execution

Version 1.1.0 (2025-05-28)

  • Added interactive visualizations with Plotly
  • Improved text preprocessing pipeline
  • Enhanced word cloud generation

See CHANGELOG.md for complete version history.

Roadmap | Roadmap

Short Term | Curto Prazo

  • Implement advanced preprocessing (stemming, lemmatization)
  • Add deep learning models (LSTM, BERT)
  • Create automated model evaluation pipeline
  • Add comprehensive unit tests

Long Term | Longo Prazo

  • Real-time tweet classification API
  • Multi-language support
  • Deployment on cloud platforms (AWS, GCP)
  • Integration with Twitter API for live monitoring

Competition Results | Resultados da Competição

Kaggle Competition Performance:

  • Current Rank: Top 25% (as of August 2025)
  • Best Score: 0.79 F1-Score
  • Submission: View on Kaggle

Authors & Contributors | Autores e Contribuidores

  • Isabel Cruz - Lead Data Scientist - @bellDataSc
    • Data Engineer & BI Specialist, Government of São Paulo
    • Technical Writer: Medium Articles

Acknowledgments | Agradecimentos

  • Kaggle for providing the dataset and competition platform
  • scikit-learn community for excellent ML libraries
  • Plotly team for interactive visualization tools
  • Open Source Community for inspiration and resources

Support & Contact | Suporte e Contato

Learning Resources | Recursos de Aprendizado

Recommended Reading:


Made with ☕ by Isabel Cruz

GitHub followers GitHub stars

"Transforming text data into actionable insights, one tweet at a time"

About

Predict which Tweets are about real disasters and which ones are not

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published