Skip to content

AlexLopezGomez/taxi-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš• NYC Taxi Demand Predictor

A real-time machine learning system for predicting taxi demand in New York City

Live Dashboard Monitoring Python Streamlit

๐ŸŽฏ Problem Statement

Taxi demand prediction is crucial for optimizing fleet management, reducing wait times, and improving urban transportation efficiency. This project solves the challenge of predicting hourly taxi demand across different NYC zones using historical ride data and machine learning.

๐Ÿš€ Solution Overview

This end-to-end ML system:

  • Predicts hourly taxi demand for each NYC taxi zone
  • Processes 3+ years of NYC taxi ride data (2022-2025)
  • Delivers real-time predictions through interactive dashboards
  • Monitors model performance with comprehensive metrics
  • Scales automatically using feature stores and model registry

๐Ÿ—๏ธ System Architecture

NYC Taxi Data โ†’ Feature Engineering โ†’ ML Model โ†’ Predictions โ†’ Dashboard
      โ†“                โ†“                โ†“           โ†“           โ†“
   Raw Rides    Time Series Data    LightGBM    Feature Store  Streamlit

Key Components:

  • Data Pipeline: Automated ETL processing of NYC taxi data
  • Feature Store: Hopsworks-powered feature management
  • ML Pipeline: LightGBM with hyperparameter optimization
  • Model Registry: Versioned model deployment
  • Monitoring: Real-time performance tracking
  • Frontend: Interactive prediction dashboard

๐Ÿ”ง Tech Stack

  • ML Framework: LightGBM, XGBoost, Scikit-learn
  • Feature Store: Hopsworks
  • Frontend: Streamlit, Plotly, PyDeck
  • Data Processing: Pandas, NumPy, GeoPandas
  • Hyperparameter Tuning: Optuna
  • Orchestration: Poetry, Make
  • Monitoring: Custom metrics tracking

๐Ÿ“Š Live Applications

  • Real-time taxi demand predictions
  • Interactive NYC map visualization
  • Top 10 locations with highest predicted demand
  • Historical time-series analysis
  • Model performance metrics (MAE)
  • Hour-by-hour accuracy analysis
  • Location-specific performance tracking
  • Prediction vs actual comparison

๐Ÿšฆ Getting Started

Prerequisites

  • Python 3.9+
  • Poetry (for dependency management)
  • Hopsworks account (for feature store)

Installation

  1. Clone the repository

    git clone https://github.com/yourusername/taxi-demand-predictor.git
    cd taxi-demand-predictor
  2. Install dependencies

    make init
  3. Set up environment variables Create a .env file in the project root:

    HOPSWORKS_PROJECT_NAME=your_project_name
    HOPSWORKS_API_KEY=your_api_key
    

Usage

๐Ÿšจ First Time Setup (Required)

Before running any other commands, you must populate the feature store with historical data:

# โš ๏ธ REQUIRED: Backfill historical data (run once)
make backfill

Why is this needed? The ML model requires 28 days (672 hours) of historical data for training and inference. This is standard practice in time-series ML systems.

๐Ÿ”„ Regular Workflow

After the initial backfill, use these commands for regular operations:

# Generate features and store in feature store
make features

# Train the model
make training

# Generate predictions
make inference

# Launch prediction dashboard
make frontend-app

# Launch monitoring dashboard
make monitoring-app

๐Ÿ“‹ Complete Setup Checklist

  1. โœ… make init - Install dependencies
  2. โœ… Configure .env file
  3. โœ… make backfill - Essential first step
  4. โœ… make features - Update with recent data
  5. โœ… make training - Train the model
  6. โœ… make inference - Generate predictions

๐Ÿ” Project Structure

โ”œโ”€โ”€ data/                    # Raw and processed data
โ”œโ”€โ”€ models/                  # Trained model artifacts
โ”œโ”€โ”€ notebooks/               # Jupyter notebooks for analysis
โ”œโ”€โ”€ scripts/                 # Pipeline scripts
โ”œโ”€โ”€ src/                     # Source code
โ”‚   โ”œโ”€โ”€ config.py           # Configuration settings
โ”‚   โ”œโ”€โ”€ feature_store_api.py # Feature store interactions
โ”‚   โ”œโ”€โ”€ inference.py        # Prediction logic
โ”‚   โ”œโ”€โ”€ fronted.py          # Main dashboard
โ”‚   โ””โ”€โ”€ fronted_monitoring.py # Monitoring dashboard
โ”œโ”€โ”€ tests/                   # Test files
โ”œโ”€โ”€ Makefile                # Automation commands
โ””โ”€โ”€ pyproject.toml          # Dependencies

๐Ÿง  Model Performance

  • Algorithm: LightGBM with feature engineering
  • Features: 24ร—28 = 672 hourly historical features
  • Target: Maximum MAE threshold of 30.0
  • Optimization: Optuna-powered hyperparameter tuning
  • Validation: Time-series cross-validation

๐Ÿ”ฎ Key Features

  • Real-time Predictions: Hourly demand forecasting
  • Geospatial Visualization: Interactive NYC taxi zone maps
  • Time Series Analysis: Historical demand patterns
  • Model Monitoring: Performance tracking and alerts
  • Automated Pipelines: End-to-end ML workflow
  • Scalable Architecture: Feature store and model registry

๐ŸŽ“ Learning Outcomes

This project demonstrates:

  • End-to-end ML system design
  • Feature engineering for time series
  • Model deployment and monitoring
  • Real-time dashboard development
  • MLOps best practices
  • Geospatial data visualization

๐Ÿ”ง Troubleshooting

Common Issues

โŒ Empty training set error

# Solution: Run backfill first
make backfill

โŒ Inference fails with data errors

# Solution: Ensure backfill has been executed
make backfill
make features

โŒ Missing historical data

  • Cause: Backfill step was skipped
  • Solution: Always run make backfill before other commands

Important Notes

  • One-time operation: make backfill only needs to be run once unless you want to refresh all historical data
  • Data dependencies: All ML operations (training, inference) depend on the historical data populated by backfill

๐Ÿ™ Acknowledgments

This project is based on the excellent tutorial by Pau Labarta Bajo. Special thanks for providing the foundational architecture and approach for this taxi demand prediction system.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages