Flatiron Mod 4 Project
This project, a consulting proposal for a real estate investment firm, uses Time Series Analysis to present the top 5 zip codes for the firm to invest in in the US.
-
- Model Performance
- Prediction
-
- Recommendations
- Areas for Growth
-
- Visualizations used/created throughout the project.
-
- Pickled files storing data relevant to the model creation.
The business context for this project is relatively straightforward - select 5 top zip codes for investment within the United States, based on a dataset given from Zillow.
GOAL: Predict top 5 zip code in United States for real estate firm to invest in.
To further clarify this goal, we use a risk-adjusted Return on Investment (ROI) as our metric for "top":
- ROI: Predicted Price Increase / Initial Investment
- Risk-adjustment: After some exploratory data analysis (EDA), we define risk-adjusted by limiting the range of zip codes to those without large ranges of error in the model prediction.
The dataset for this project came from Zillow, with the following data from 1996 through 2018:
- RegionID (Zip Code)
- RegionName (Zillow-defined code)
- City
- State
- Metro (Metropolitan Area)
- CountyName
- Size Rank (Ordinal City size categories)
- Monthly Median Prices
An overview of the dataset:
import pandas as pd
df = pd.read_csv('data/raw/zillow_data.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
RegionID | RegionName | City | State | Metro | CountyName | SizeRank | 1996-04 | 1996-05 | 1996-06 | ... | 2017-07 | 2017-08 | 2017-09 | 2017-10 | 2017-11 | 2017-12 | 2018-01 | 2018-02 | 2018-03 | 2018-04 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 84654 | 60657 | Chicago | IL | Chicago | Cook | 1 | 334200.0 | 335400.0 | 336500.0 | ... | 1005500 | 1007500 | 1007800 | 1009600 | 1013300 | 1018700 | 1024400 | 1030700 | 1033800 | 1030600 |
1 | 90668 | 75070 | McKinney | TX | Dallas-Fort Worth | Collin | 2 | 235700.0 | 236900.0 | 236700.0 | ... | 308000 | 310000 | 312500 | 314100 | 315000 | 316600 | 318100 | 319600 | 321100 | 321800 |
2 | 91982 | 77494 | Katy | TX | Houston | Harris | 3 | 210400.0 | 212200.0 | 212200.0 | ... | 321000 | 320600 | 320200 | 320400 | 320800 | 321200 | 321200 | 323000 | 326900 | 329900 |
3 | 84616 | 60614 | Chicago | IL | Chicago | Cook | 4 | 498100.0 | 500900.0 | 503100.0 | ... | 1289800 | 1287700 | 1287400 | 1291500 | 1296600 | 1299000 | 1302700 | 1306400 | 1308500 | 1307000 |
4 | 93144 | 79936 | El Paso | TX | El Paso | El Paso | 5 | 77300.0 | 77300.0 | 77300.0 | ... | 119100 | 119400 | 120000 | 120300 | 120300 | 120300 | 120300 | 120500 | 121000 | 121500 |
5 rows × 272 columns
The nationwide median price looked like this:
Data Analysis
-
Given the monthly price measurements, we used Time Series Analysis to gain a better understanding for the patterns in the data that would impact our predictions.
-
Additionally, we restrict our model construction to data after the 2008 recession, since the variation due to the anomaly is already known, and our model will perform better without having to account for such a huge fluctuation.
-
Finally, given computational resource restrictions, we limit our dataset to the 60 zip codes with the largest ROI in a 5-year rolling window.
By using time-series analysis, we can use time-linked historical data to discover which areas have the highest ROI, which is a time-delta linked indicator. This particular method enables median price value prediction.
In order to use this model, we must eliminate noise in the data in order to make accurate predictions, based on the assumptions of the model:
-
Stationarity - Ensures that the distribution of data does not change over time
-
Seasonality - Adjusting for regular fluctuations based on a fixed interval (e.g., higher prices in spring, lower prices in fall)
-
Autocorrelation - Adjusts for covariance of time-series with itself, based on its variables.
-
Trend - Adjusts for long-term trends such as overall increase, overall decrease.
The goal of these limitations is to remove the time-based impacts upon the median price to discover any potential underlying patterns in the data.
-
Higher Median Investment
-
Lower Median Investment
With additional time, we would incorporate other ways to measure "top 5" and ROI, including usin other sources like:
- Zillow Rent Data
- Competitor Median Prices
- AirBnB rental increases/returns
The model was significantly limited due to time and resource constraints. With more of each, we could:
- Fit model to all zip codes, not just limited set
- Consider grouping zip codes and areas into better tiers
- Train model on nationwide data for the ability to "drill-down" more precisely
With more data, we could predict longer term trends instead of just the limited 5-year period we selected.
Contributors: Alexander Newton, Jake Miller Brooks
Languages : Python
Tools/IDE : Git, Command Line (Windows), Anaconda, Jupyter Notebook / Jupyter Lab, Google Slides
Libraries : numpy, pandas, matplotlib, seaborn, scikit-learn, statsmodels
Duration : August 2020 Last Update: 08.24.2020