A real-time machine learning system for predicting taxi demand in New York City
Taxi demand prediction is crucial for optimizing fleet management, reducing wait times, and improving urban transportation efficiency. This project solves the challenge of predicting hourly taxi demand across different NYC zones using historical ride data and machine learning.
This end-to-end ML system:
- Predicts hourly taxi demand for each NYC taxi zone
- Processes 3+ years of NYC taxi ride data (2022-2025)
- Delivers real-time predictions through interactive dashboards
- Monitors model performance with comprehensive metrics
- Scales automatically using feature stores and model registry
NYC Taxi Data โ Feature Engineering โ ML Model โ Predictions โ Dashboard
โ โ โ โ โ
Raw Rides Time Series Data LightGBM Feature Store Streamlit
- Data Pipeline: Automated ETL processing of NYC taxi data
- Feature Store: Hopsworks-powered feature management
- ML Pipeline: LightGBM with hyperparameter optimization
- Model Registry: Versioned model deployment
- Monitoring: Real-time performance tracking
- Frontend: Interactive prediction dashboard
- ML Framework: LightGBM, XGBoost, Scikit-learn
- Feature Store: Hopsworks
- Frontend: Streamlit, Plotly, PyDeck
- Data Processing: Pandas, NumPy, GeoPandas
- Hyperparameter Tuning: Optuna
- Orchestration: Poetry, Make
- Monitoring: Custom metrics tracking
๐ฏ Prediction Dashboard
- Real-time taxi demand predictions
- Interactive NYC map visualization
- Top 10 locations with highest predicted demand
- Historical time-series analysis
๐ Monitoring Dashboard
- Model performance metrics (MAE)
- Hour-by-hour accuracy analysis
- Location-specific performance tracking
- Prediction vs actual comparison
- Python 3.9+
- Poetry (for dependency management)
- Hopsworks account (for feature store)
-
Clone the repository
git clone https://github.com/yourusername/taxi-demand-predictor.git cd taxi-demand-predictor -
Install dependencies
make init
-
Set up environment variables Create a
.envfile in the project root:HOPSWORKS_PROJECT_NAME=your_project_name HOPSWORKS_API_KEY=your_api_key
Before running any other commands, you must populate the feature store with historical data:
# โ ๏ธ REQUIRED: Backfill historical data (run once)
make backfillWhy is this needed? The ML model requires 28 days (672 hours) of historical data for training and inference. This is standard practice in time-series ML systems.
After the initial backfill, use these commands for regular operations:
# Generate features and store in feature store
make features
# Train the model
make training
# Generate predictions
make inference
# Launch prediction dashboard
make frontend-app
# Launch monitoring dashboard
make monitoring-app- โ
make init- Install dependencies - โ
Configure
.envfile - โ
make backfill- Essential first step - โ
make features- Update with recent data - โ
make training- Train the model - โ
make inference- Generate predictions
โโโ data/ # Raw and processed data
โโโ models/ # Trained model artifacts
โโโ notebooks/ # Jupyter notebooks for analysis
โโโ scripts/ # Pipeline scripts
โโโ src/ # Source code
โ โโโ config.py # Configuration settings
โ โโโ feature_store_api.py # Feature store interactions
โ โโโ inference.py # Prediction logic
โ โโโ fronted.py # Main dashboard
โ โโโ fronted_monitoring.py # Monitoring dashboard
โโโ tests/ # Test files
โโโ Makefile # Automation commands
โโโ pyproject.toml # Dependencies
- Algorithm: LightGBM with feature engineering
- Features: 24ร28 = 672 hourly historical features
- Target: Maximum MAE threshold of 30.0
- Optimization: Optuna-powered hyperparameter tuning
- Validation: Time-series cross-validation
- Real-time Predictions: Hourly demand forecasting
- Geospatial Visualization: Interactive NYC taxi zone maps
- Time Series Analysis: Historical demand patterns
- Model Monitoring: Performance tracking and alerts
- Automated Pipelines: End-to-end ML workflow
- Scalable Architecture: Feature store and model registry
This project demonstrates:
- End-to-end ML system design
- Feature engineering for time series
- Model deployment and monitoring
- Real-time dashboard development
- MLOps best practices
- Geospatial data visualization
โ Empty training set error
# Solution: Run backfill first
make backfillโ Inference fails with data errors
# Solution: Ensure backfill has been executed
make backfill
make featuresโ Missing historical data
- Cause: Backfill step was skipped
- Solution: Always run
make backfillbefore other commands
- One-time operation:
make backfillonly needs to be run once unless you want to refresh all historical data - Data dependencies: All ML operations (training, inference) depend on the historical data populated by backfill
This project is based on the excellent tutorial by Pau Labarta Bajo. Special thanks for providing the foundational architecture and approach for this taxi demand prediction system.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.