-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add baseline multiseries regressor #4246
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4246 +/- ##
=======================================
- Coverage 99.7% 99.7% -0.0%
=======================================
Files 349 351 +2
Lines 38413 38497 +84
=======================================
+ Hits 38293 38376 +83
- Misses 120 121 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work 😸
for t in self.statistically_significant_lags: | ||
lagged_features[self.target_colname_prefix.format(t)] = y.shift(t) | ||
if isinstance(y, pd.DataFrame): | ||
lagged_features.update(self._delay_df(y, y.columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: should we just run self._encode_y_while_preserving_index(y)
even though we won't expect categorical columns just yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 interesting point, will we ever expect categorical columns? We're only supporting regression problems for multiseries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially sometime in the distant future! Just thought it would make it one step easier for whoever implements that 😄 it'll be a no-op anyways right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea in theory, but I'm worried about increasing runtime with checking if y
contains any categorical columns. We'd have to do so in all cases, which feels wasteful when we know we won't be dealing with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions for my clarification + a suggestion but once answered LGTM
evalml/tests/conftest.py
Outdated
@@ -830,6 +830,22 @@ def X_y_regression(): | |||
return X, y | |||
|
|||
|
|||
@pytest.fixture | |||
def X_y_multiseries_regression(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to utilize this for my VARMAX testing. In that case, does it make sense to have it align more with the inputs we expect for a mutliseries regressor (e.g. a series_id
column and the column names having series_id
value suffixes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also since this is a time series pipeline should we be extending ts_data()
instead? If we do decide to keep this we should also make it a time series dataset by adding a datetime column + changing the name to identify it as a time series dataset.
My bad I forgot we have the multiseries_ts_data_unstacked
and multiseries_ts_data_unstacked
functions. In that case, is there a reason why the test_test_multiseries_baseline_regressor.py test cases use those mocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout that this should more closely match the actual expected input - I wrote these before I wrote the stacking and unstacking functions and never updated it 😅 - same with why these tests don't use multiseries_ts_data_unstacked
. I'll refactor the test fixtures a bit to be consolidated and up to date.
if isinstance(y, pd.DataFrame): | ||
self.statistically_significant_lags = [self.start_delay] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the multiseries case, do we not try to find the significant lags? Does this just use all or none of the lags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to find the significant lags because we're not actually doing feature engineering here, just getting the properly lagged column that our baseline regressor relies on. By setting the lags that we calculate to be just self.start_delay
, we only compute the one we know we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.
Args: | ||
X (pd.DataFrame): Data of shape [n_samples, n_features]. | ||
|
||
Returns: | ||
pd.Series: Predicted values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this is correct, but just to double check we're returning the predictions with the predicted values stacked right (i.e. in series form and not as a dataframe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, this is a copypasta fail. We're returning the unstacked dataframe here
evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits, great work
evalml/pipelines/components/estimators/regressors/multiseries_time_series_baseline_regressor.py
Show resolved
Hide resolved
@property | ||
def feature_importance(self): | ||
"""Returns importance associated with each feature. | ||
|
||
Since baseline estimators do not use input features to calculate predictions, returns an array of zeroes. | ||
|
||
Returns: | ||
np.ndarray (float): An array of zeroes. | ||
""" | ||
importance = np.array([0] * self._num_features) | ||
return importance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably another nit, but if you're calling out all Baseline Estimators...is it worth putting together a story to add a BaselineEstimator class to the inheritance chain and have them all inherit this prop def?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #4255
if isinstance(y, pd.DataFrame): | ||
self.statistically_significant_lags = [self.start_delay] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add an explicit comment here for the case we're splitting on...e.g. if y is a dataframe, we expect it to be multiseries.
if categorical_columns and col_name in categorical_columns: | ||
col = X_categorical[col_name] | ||
for t in self.statistically_significant_lags: | ||
lagged_features[f"{col_name}_delay_{t}"] = col.shift(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not going to be doing any external matching on this name format, right? If so, I think we might want to establish a pattern of making this string format like a module level thing or accessible via the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call - adjusting!
Closes #4241
Implementation assumes one column per target series, which is not the final state of the input. Changes will probably have to be made once stacking/unstacking functions are implemented and this is integrated into search.
Implementation also assumes integer indices.