Trading today with tomorrow's signals
Developed with the software and tools below.
The goal of this project was to predict the closing stock price of the fictional company Waystar Royco (WAYA US) from July 30, 2021, to September 10, 2021. Historical stock price data from August 14, 2015, to July 29, 2021, was provided, including opening price, high price, low price, closing price, and trading volume for each day.
The task was to use regression and time series modeling techniques to make predictions, compare the models, and determine which is best suited for this type of stock price forecasting. Disclaimer that Waystar Royco is a fictional company, so external factors beyond the provided data should not be considered.
Original Data for the project: https://www.kaggle.com/competitions/ue19cs312-assignment/data
- Data exploration showed no null values or significant outliers. Features were highly correlated.
- Tried PCA for dimensionality reduction but 2 components explained all variability so it was not needed.
- Scaled and normalized data before modeling.
- Regression models tried: Linear Regression, Ridge, Lasso, Kernel Ridge, KNN. Linear Regression performed best.
- Time series models tried: ARIMA, SARIMAX, Holt-Winters. SARIMAX gave the lowest RMSE.
- Best SARIMAX parameters were (2,1,1)(2,1,1) with m=52 based on seasonal period.
- Regression beat time series overall in terms of performance metrics.
The linear regression model was simpler, avoided overfitting the seasonal patterns, and handled the fluctuations better than time series models. It had an RMSE of around 1.45 on the test set while SARIMAX achieved a much higher RMSE.
Even after tuning SARIMAX, the regression model was more robust. This indicates that classical regression is well-suited for this type of stock price forecasting problem, where a linear combination of the open, high, low prices and volume provides a good fit.
Time series should not be ruled out though, as they can capture seasonality and cyclic trends. With more complex data or longer time spans, SARIMAX may start to outperform regression. Overall the analysis shows regression as the better approach for now, but both should be considered depending on the structure of stock data.
└── Stock-Price-Forecasting-and-Model-Comparison/
├── README.md
├── data
│ ├── test.csv
│ └── train.csv
├── gridsearch_results.txt
├── notebooks
│ ├── gridsearch_cv_script.ipynb
│ └── submission.ipynb
├── requirements.txt
└── submission.csv