Machine learning model for predicting optimal nightly prices for Airbnb listings in New York City using property features and market data.
This project implements an end-to-end machine learning pipeline that predicts Airbnb listing prices with 97.2% accuracy (R² score). The system includes data processing, model training, evaluation, and deployment via both a web dashboard and REST API.
https://airbnb-pricing-etutc5pereohmu5bbtxdeo.streamlit.app/#evaluation-metrics
- High Accuracy: 97.2% R² score on test set
- Production-Ready: Includes both dashboard and API interfaces
- Robust Pipeline: Automated data cleaning, feature engineering, and model training
- Interactive Dashboard: Streamlit-based UI for easy price predictions
- RESTful API: FastAPI service for programmatic access
- Comprehensive Evaluation: Detailed metrics and visualizations
| Metric | Value |
|---|---|
| R² Score | 0.972 |
| RMSE | $49.19 |
| MAE | $3.69 |
| Median % Error | 0.00% |
| Test Set Size | 4,219 listings |
airbnb-pricing/
├── data/ # Data storage
│ ├── raw/ # Original datasets
│ └── processed/ # Cleaned data
├── models/ # Trained models
├── src/ # Source code
│ ├── ingest.py # Data loading
│ ├── clean.py # Data cleaning
│ ├── features.py # Feature engineering
│ ├── train.py # Model training
│ ├── evaluate.py # Model evaluation
│ └── predict.py # Inference
├── app/ # Streamlit dashboard
├── api/ # FastAPI service
└── tests/ # Unit tests
- Clone the repository
git clone <repository-url>
cd airbnb-pricing- Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Download data
- Visit Inside Airbnb
- Download NYC
listings.csv.gz - Place in
data/raw/directory
source .venv/bin/activate
cd src
python clean.py # Clean data
python train.py # Train model
python evaluate.py # Evaluate performanceOr use the automated pipeline:
./run_pipeline.sh./start_streamlit.shOpen http://localhost:8501 in your browser.
./start_api.shAPI documentation available at http://localhost:8000/docs
from src.predict import predict_from_dict
listing = {
'accommodates': 4,
'bedrooms': 2,
'bathrooms': 2.0,
'room_type': 'Entire home/apt',
'latitude': 40.7580,
'longitude': -73.9855,
'number_of_reviews': 50,
'amenities': 'WiFi,Kitchen,Heating',
'neighbourhood_cleansed': 'Manhattan',
'property_type': 'Apartment',
'bed_type': 'Real Bed',
'cancellation_policy': 'moderate'
}
price = predict_from_dict(listing)
print(f"Predicted price: ${price:.2f}/night")curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"accommodates": 2,
"bedrooms": 1,
"bathrooms": 1.0,
"room_type": "Entire home/apt",
"latitude": 40.7580,
"longitude": -73.9855
}'- Model: Random Forest Regressor
- Preprocessing: StandardScaler for numeric features, OneHotEncoder for categorical
- Hyperparameter Tuning: RandomizedSearchCV with 5-fold cross-validation
- Training Data: 21,094 NYC listings
Input Features:
- Property details (bedrooms, bathrooms, accommodates)
- Location (latitude, longitude, neighborhood)
- Policies (minimum nights, cancellation policy)
- Reviews and ratings
Engineered Features:
- Distance to city center
- Amenities count
- Temporal features (days since last review, host tenure)
- Interaction features (capacity × bedrooms, reviews × rating)
Core requirements:
- Python 3.8+
- pandas
- scikit-learn
- numpy
- matplotlib
- joblib
Web services:
- streamlit (dashboard)
- fastapi, uvicorn (API)
See requirements.txt for complete list.
The model includes comprehensive evaluation with:
- Performance metrics (RMSE, MAE, R²)
- Residual analysis
- Error distribution visualizations
- Comparison against baseline models
Run evaluation to generate detailed reports:
python src/evaluate.pyResults saved to models/residuals.png
MODEL_DOCUMENTATION.md: Complete model specifications and usage guideQUICKSTART.md: Quick reference for common commandsPROJECT_SUMMARY.md: Detailed project architecture
pytest tests/docker-compose up- Model size: 137 MB
- Inference time: <100ms
- Memory usage: ~500MB
- Recommended: Retrain quarterly for NYC market
- Trained specifically for NYC market
- Best performance for standard listings ($50-$500/night)
- Static model (requires retraining to capture seasonal trends)
- May underperform on luxury or unique properties
Data provided by Inside Airbnb, a mission-driven project providing data about Airbnb's impact on residential communities.
- Fork the repository
- Create a feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add improvement') - Push to branch (
git push origin feature/improvement) - Create Pull Request
This project uses publicly available data from Inside Airbnb. Review their data policies before commercial use.
- Inside Airbnb for providing open data
- NYC Open Data for geographic information
- Scikit-learn team for machine learning tools
For questions or issues, please open an issue on GitHub or refer to the documentation files.
Note: This is an educational/research project. Always verify predictions with current market data before making business decisions.