Models we should test #9

SverreNystad · 2023-09-23T09:03:29Z

As part of our ongoing efforts to improve the performance and accuracy of our predictive models, we need to evaluate a variety of machine learning and deep learning algorithms. Here's a list of models we should consider testing:

Linear Regression
- Rationale: A good baseline model. If our data has a linear relationship, this model might perform surprisingly well.
- Implementation: Scikit-learn or Statsmodels can be used.
- Potential Challenges: Assumes a linear relationship between predictors and target variable
Random Forest
- Rationale: An ensemble method that can capture complex non-linear relationships in the data.
- Implementation: Scikit-learn offers a straightforward implementation.
- Potential Challenges: Might overfit on very noisy data. Hyperparameter tuning is essential.
Gradient Boosting #13
- Rationale: Boosting technique that builds trees iteratively, correcting errors from previous trees.
- Implementation: We can use libraries like XGBoost or LightGBM.
- Potential Challenges: Requires careful tuning of learning rate and tree depth.
LSTM (Long Short-Term Memory) #10
- Rationale: Given its capacity to remember patterns over long sequences, it might be especially useful if our data has temporal sequences or time-series components.
- Implementation: Consider using TensorFlow, pytorch or Keras for implementation.
- Potential Challenges: LSTMs can be computationally intensive and might require more time to train.
ARIMA (AutoRegressive Integrated Moving Average)
- Rationale: ARIMA, which stands for AutoRegressive Integrated Moving Average, is a classical time series forecasting method. It's designed to capture autoregressive patterns (relationships between an observation and a number of lagged observations) and moving average patterns (relationship between an observation and a residual error from a moving average model applied to lagged observations) in the data. The integrated component refers to the idea of differencing the data to make the time series stationary.
- Implementation:
- Potential Challenges:
  - Overfitting: Models with too many parameters might fit the training data exceptionally well but perform poorly on new data.
  - Stationarity Assumption: ARIMA requires the data to be stationary. Even after differencing, some series might remain non-stationary.
  - Seasonality: ARIMA doesn't handle seasonality by default. If the data has a clear seasonal pattern, SARIMA (Seasonal ARIMA) might be more appropriate.
  - Noise: If the time series is too noisy, ARIMA might not be the best choice. Pre-processing or smoothing might be necessary.
    Computational Time: For large datasets or when performing grid search for hyperparameters, ARIMA can be computationally intensive.
1D CNN (Convolutional neural network) #17
- Rationale: 1D CNNs are highly effective for sequence data where spatial relationships exist, such as time series data or natural language. They can extract local features and learn spatial hierarchies which can be crucial for understanding complex patterns in sequence data.
- Implementation: TensorFlow and Keras offer user-friendly APIs to build 1D Convolutional Neural Networks. Define convolutional layers with filters and kernels optimized for 1D sequence data.
- Potential Challenges:
  - Data Requirements: CNNs usually require a large amount of data to generalize well.
  - Hyperparameter Tuning: Choosing the right architecture, kernel sizes, and the number of filters can be crucial and might require extensive experimentation.
  - Overfitting: Without adequate regularization, CNNs can overfit to the training data, especially when the amount of training data is limited.
Prophet #16
- Rationale: Developed by Facebook, Prophet is designed for forecasting with daily observations that display patterns on different time scales. It works well with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
- Implementation: it can be implemented using the fbprophet package. The implementation is quite straightforward — it primarily requires a dataframe with two columns: ds (date) and y (value to predict).
- Potential Challenges:
  - Seasonality Assumption: Prophet might not perform well if the data doesn’t exhibit any clear seasonality.
  - Hyperparameter Tuning: While Prophet performs well out of the box, tuning seasonality and holiday parameters can sometimes be tricky and require domain knowledge.
  - Uncertainty Intervals: The uncertainty intervals provided by Prophet are often too wide to be useful in practice, and might require additional post-processing or calibration.
  - Scalability: For very large datasets, Prophet can be computationally intensive and might require considerable time to train.
ETS (Exponential Smoothing State Space Model)
- Rationale: ETS is a widely-used forecasting method for time-series data, particularly effective for data with trends and seasonality. It applies exponential smoothing to various components of the time series (Error, Trend, Seasonality), hence the acronym ETS.
- Implementation: Libraries like statsmodels in Python offer implementation of ETS models.
- Potential Challenges:
  - Parameter Selection: Choosing the right error, trend, and seasonality components and their forms (additive or multiplicative) can significantly affect performance.
  - Stationarity and Seasonality: While ETS handles seasonality, it may not perform well with non-stationary data without appropriate differencing.
  - Forecast Confidence: ETS models might produce overly optimistic confidence intervals.
Catboost
- Rationale: Catboost is a high-performance gradient boosting library, particularly optimized for categorical data. It can handle categorical features automatically and is known for its robustness and efficiency.
- Implementation: Catboost has its own standalone library and can be easily implemented in Python.
- Potential Challenges:
  - Overfitting: Like other boosting methods, it can overfit if not tuned properly.
  - Parameter Tuning: Optimal performance requires tuning of parameters like learning rate, depth of trees, etc.
  - Computational Resource: Can be resource-intensive, especially with large datasets and complex models.
AutoGluon
- Rationale: AutoGluon automates machine learning tasks, making it easier to achieve strong predictive performance with minimal user intervention. It is particularly useful for those who may not have extensive machine learning expertise.
- Implementation: AutoGluon offers a high-level API in Python, streamlining the machine learning pipeline, including automatic model selection and hyperparameter tuning.
- Potential Challenges:
  - Customization Limitations: While it's great for quick results, there may be limitations in fine-tuning and customizing models for specific tasks.
  - Resource Intensity: AutoGluon can be resource-intensive, as it evaluates multiple models.
  - Understanding Results: The automated nature might obscure the understanding of why certain models work better than others in some cases.
SVR
- Rationale: SVR applies the principles of Support Vector Machines (SVM) for regression tasks. It is effective for both linear and non-linear regressions and is known for its robustness, particularly in high-dimensional space.
- Implementation: Implemented using libraries such as Scikit-learn in Python.
- Potential Challenges:
  Kernel Choice: Selecting the right kernel function (linear, polynomial, RBF, etc.) is crucial and can be challenging.
  - Parameter Tuning: Parameters like C (regularization), gamma (for non-linear SVR), and epsilon (margin of tolerance) require careful tuning.
  - Scalability: SVR can be computationally intensive, especially for large datasets.
  - Sparse Data Handling: SVR might not perform well with very sparse data.

SverreNystad · 2023-09-24T20:38:14Z

All models implemented only take in data from location A. We need to add the location data. We can not just fit the model over new data as this will overwrite old fitting as if it did not fit it. https://chat.openai.com/share/afad6b86-77fa-417a-ae02-89b0fb95cf13
This link can give more insight

SverreNystad · 2023-09-25T00:47:07Z

I see now that the prepare data does not work correctly. We need to split data into training and testing

SverreNystad · 2023-09-27T11:30:58Z

Good notebook showing how to do many of the models: https://www.kaggle.com/code/dimitreoliveira/deep-learning-for-time-series-forecasting

SverreNystad assigned pskoland, Gunnar2908 and SverreNystad Sep 23, 2023

SverreNystad added enhancement New feature or request help wanted Extra attention is needed epic labels Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models we should test #9

Models we should test #9

SverreNystad commented Sep 23, 2023 •

edited

Loading

SverreNystad commented Sep 24, 2023

SverreNystad commented Sep 25, 2023

SverreNystad commented Sep 27, 2023

Models we should test #9

Models we should test #9

Comments

SverreNystad commented Sep 23, 2023 • edited Loading

SverreNystad commented Sep 24, 2023

SverreNystad commented Sep 25, 2023

SverreNystad commented Sep 27, 2023

SverreNystad commented Sep 23, 2023 •

edited

Loading