Real Estate Consulting Proposal

Flatiron Mod 4 Project

Overview

This project, a consulting proposal for a real estate investment firm, uses Time Series Analysis to present the top 5 zip codes for the firm to invest in in the US.

ReadMe Navigation

Repository Navigation
Business Understanding
Data Understanding
Predictive Analysis
1. Model Performance
2. Prediction
Conclusion
1. Recommendations
2. Areas for Growth
Project Info

Repository Navigation

DATA:
SRC:
FIGURES:
- Visualizations used/created throughout the project.
MODELS:
- Pickled files storing data relevant to the model creation.
NOTEBOOKS:
PRESENTATION:
- PDF
- Powerpoint

Business Understanding

The business context for this project is relatively straightforward - select 5 top zip codes for investment within the United States, based on a dataset given from Zillow.

GOAL: Predict top 5 zip code in United States for real estate firm to invest in.

To further clarify this goal, we use a risk-adjusted Return on Investment (ROI) as our metric for "top":

ROI: Predicted Price Increase / Initial Investment
Risk-adjustment: After some exploratory data analysis (EDA), we define risk-adjusted by limiting the range of zip codes to those without large ranges of error in the model prediction.

Data Understanding

The dataset for this project came from Zillow, with the following data from 1996 through 2018:

RegionID (Zip Code)
RegionName (Zillow-defined code)
City
State
Metro (Metropolitan Area)
CountyName
Size Rank (Ordinal City size categories)
Monthly Median Prices

An overview of the dataset:

import pandas as pd
df = pd.read_csv('data/raw/zillow_data.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	RegionID	RegionName	City	State	Metro	CountyName	SizeRank	1996-04	1996-05	1996-06	...	2017-07	2017-08	2017-09	2017-10	2017-11	2017-12	2018-01	2018-02	2018-03	2018-04
0	84654	60657	Chicago	IL	Chicago	Cook	1	334200.0	335400.0	336500.0	...	1005500	1007500	1007800	1009600	1013300	1018700	1024400	1030700	1033800	1030600
1	90668	75070	McKinney	TX	Dallas-Fort Worth	Collin	2	235700.0	236900.0	236700.0	...	308000	310000	312500	314100	315000	316600	318100	319600	321100	321800
2	91982	77494	Katy	TX	Houston	Harris	3	210400.0	212200.0	212200.0	...	321000	320600	320200	320400	320800	321200	321200	323000	326900	329900
3	84616	60614	Chicago	IL	Chicago	Cook	4	498100.0	500900.0	503100.0	...	1289800	1287700	1287400	1291500	1296600	1299000	1302700	1306400	1308500	1307000
4	93144	79936	El Paso	TX	El Paso	El Paso	5	77300.0	77300.0	77300.0	...	119100	119400	120000	120300	120300	120300	120300	120500	121000	121500

5 rows × 272 columns

The nationwide median price looked like this:

Data Analysis

Given the monthly price measurements, we used Time Series Analysis to gain a better understanding for the patterns in the data that would impact our predictions.
Additionally, we restrict our model construction to data after the 2008 recession, since the variation due to the anomaly is already known, and our model will perform better without having to account for such a huge fluctuation.
Finally, given computational resource restrictions, we limit our dataset to the 60 zip codes with the largest ROI in a 5-year rolling window.

Predictive Analysis

Time Series Analysis

By using time-series analysis, we can use time-linked historical data to discover which areas have the highest ROI, which is a time-delta linked indicator. This particular method enables median price value prediction.

In order to use this model, we must eliminate noise in the data in order to make accurate predictions, based on the assumptions of the model:

Stationarity - Ensures that the distribution of data does not change over time
Seasonality - Adjusting for regular fluctuations based on a fixed interval (e.g., higher prices in spring, lower prices in fall)
Autocorrelation - Adjusts for covariance of time-series with itself, based on its variables.
Trend - Adjusts for long-term trends such as overall increase, overall decrease.

The goal of these limitations is to remove the time-based impacts upon the median price to discover any potential underlying patterns in the data.

Performance Metric

Model Selection

Conclusions

Recommendations:

Higher Median Investment
Lower Median Investment

Areas for Growth:

Include Adjacent Data Resources

With additional time, we would incorporate other ways to measure "top 5" and ROI, including usin other sources like:

Zillow Rent Data
Competitor Median Prices
AirBnB rental increases/returns

Improve Model

The model was significantly limited due to time and resource constraints. With more of each, we could:

Fit model to all zip codes, not just limited set
Consider grouping zip codes and areas into better tiers
Train model on nationwide data for the ability to "drill-down" more precisely

Extend Time Frame for Analysis

With more data, we could predict longer term trends instead of just the limited 5-year period we selected.

Project Info

Contributors: Alexander Newton, Jake Miller Brooks

Languages : Python

Tools/IDE : Git, Command Line (Windows), Anaconda, Jupyter Notebook / Jupyter Lab, Google Slides

Libraries : numpy, pandas, matplotlib, seaborn, scikit-learn, statsmodels

Duration : August 2020 Last Update: 08.24.2020

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
figures		figures
models		models
notebooks		notebooks
presentation		presentation
src		src
temp		temp
README.ipynb		README.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real Estate Consulting Proposal

Overview

ReadMe Navigation

Repository Navigation

Business Understanding

Data Understanding

Predictive Analysis

Time Series Analysis

Performance Metric

Model Selection

Conclusions

Recommendations:

Areas for Growth:

Include Adjacent Data Resources

Improve Model

Extend Time Frame for Analysis

Project Info

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

anewt225/mod_4_project

Folders and files

Latest commit

History

Repository files navigation

Real Estate Consulting Proposal

Overview

ReadMe Navigation

Repository Navigation

Business Understanding

Data Understanding

Predictive Analysis

Time Series Analysis

Performance Metric

Model Selection

Conclusions

Recommendations:

Areas for Growth:

Include Adjacent Data Resources

Improve Model

Extend Time Frame for Analysis

Project Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages