-
Notifications
You must be signed in to change notification settings - Fork 677
[ENH] Experimental PR for using LightningDataModule
for TimeSeriesDataset
(version -A)
#1770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @fkiraly, @agobbifbk I also have created a basic example notebook: https://colab.research.google.com/drive/1FvLlmEOgm3D3JgNFVeAtwPk4cXagJ0CY?usp=sharing Please have a look at it. |
this is brilliant! I think we are nearly there now.
|
I have added the inference vignette to the nb as well |
It seems a good starting point. I will review as soon as possible meanwhile I see some breaking points: if self.data_future is not None:
if self.group:
future_mask = self.data_future.groupby(self.group).groups[group_id]
future_data = self.data_future.loc[future_mask]
else:
future_data = self.data_future
## we can discuss if it is needed
result.update(
{
"t_f": torch.tensor(future_data[self.time].values), ## care, in case of timestamp this will not work
"x_f": torch.tensor(future_data[self.known].values),
}
) I see you define the scaler but there is no if isinstance(target_normalizer, str) and target_normalizer.lower() == "auto":
self.target_normalizer = RobustScaler()
else:
self.target_normalizer = target_normalizer This step may not working if not all the keys are defined: x = {
"encoder_cat": data["features"]["categorical"][encoder_indices],
"encoder_cont": data["features"]["continuous"][encoder_indices],
"decoder_cat": data["features"]["categorical"][decoder_indices],
"decoder_cont": data["features"]["continuous"][decoder_indices],
"encoder_lengths": torch.tensor(enc_length),
"decoder_lengths": torch.tensor(pred_length),
"decoder_target_lengths": torch.tensor(pred_length),
"groups": data["group"],
"encoder_time_idx": torch.arange(enc_length),
"decoder_time_idx": torch.arange(enc_length, enc_length + pred_length),
"target_scale": target_scale,
} I will go deeper on that in the next hours! |
Thanks for the review @agobbifbk !
Yes actually, I had some doubts about the scalers, transforms, normalizers, like where should we define them and perform the |
The suggested tests - if robust - would probably reveal such issues (and prevent them from being introduced) |
Hi @fkiraly, @agobbifbk, I have added some tests, will they suffice or we need to add something else as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brilliant! I the new tests cover input/output type assertions.
What one might want to add are tests for business logic, for instance, are train/validation sets non-overlapping and contiguous (if they are meant to be, in the design).
But overall, this covers the key logic in the class.
The next stage I would go for is model integration, in a separate PR stacked on top of this. Another option would be automation sugar, e.g., recognizing categorical columns by their dtype
-s or similar, but I would not go there before a successful end-to-end proof-of-concept for the prototype design.
(also, why are readthedocs failing? is this a soft dep isolation issue?) |
Sorry but in these days I'm finishing a project deliverable I didn't have much time to go through, I will do it between tomorrow and Monday morning. There is something strange to me: def setup(self, stage: Optional[str] = None):
total_series = len(self.time_series_dataset)
self._split_indices = torch.randperm(total_series)
self._train_size = int(self.train_val_test_split[0] * total_series)
self._val_size = int(self.train_val_test_split[1] * total_series)
def __len__(self) -> int:
"""Return number of time series in the dataset."""
return len(self._group_ids) When you split the timeseries usually you do it temporally (ranges of time) or with non overlapping continuous elements and usually it is done for each of the groups. Happy to discuss in the next meeting! |
Is there any soft dep issue in ptf as well? I mean I am just using the basic stuff... |
That is also what I thought, but it says |
LightningDataModule
for TimeSeriesDataset
LightningDataModule
for TimeSeriesDataset
(version -A)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1770 +/- ##
=======================================
Coverage ? 86.85%
=======================================
Files ? 47
Lines ? 5484
Branches ? 0
=======================================
Hits ? 4763
Misses ? 721
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Hello, I am relatively new to pytorch-forecasting and in time series forecasting in general. However I would like to suggest a small improvement that I found as useful from my previous experience in vision using pytorch-lightning. ProposalI suggest exposing the batch sampler argument in train_dataloader: pytorch-forecasting/pytorch_forecasting/data/data_modules.py Lines 331 to 338 in 5489462
I think this addition is relatively easy. Maybe there are other useful arguments to expose apart from this. MotivationThe motivation is to enable the use of custom batch samplers, which are sometimes necessary for specific loss functions -such as triplet loss- that require particular batch structures. Currently a RandomSampler is implicitly created due to However due to lack of experience with time series data, I am not sure if such addition has real impact, because such types of losses are less relevant to forecasting or only suitable to representation learning / model pre-training etc. My use-caseIn my case for example to address such missing functionality I implemented a quick and dirty patch on the existing base class (VisionDataModule) such that I could use any existing subclasses. Otherwise redundant subclassing just to expose the sampler argument would be required. |
Hello @ggalan87, thank you for the suggestion. We are indeed thinking extend the argument to expose! Sampler will be indeed something useful to control, but also the loss function can be part of the rework! Do you have any suggestion (keeping in mind that we need to do it as abstract as possible for compatibility between different data source and model layer)? |
Tries to implement D2 layer (with minimum functionality) using
LightningDataModule
Fixes #1766