Skip to content

[ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) #1770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

phoeenniixx
Copy link

@phoeenniixx phoeenniixx commented Feb 16, 2025

Tries to implement D2 layer (with minimum functionality) using LightningDataModule

Fixes #1766

@phoeenniixx
Copy link
Author

Hi @fkiraly, @agobbifbk
Although I have tried to make it as real as possible, but there may be some logical issues, please tell me if you find any.

I also have created a basic example notebook: https://colab.research.google.com/drive/1FvLlmEOgm3D3JgNFVeAtwPk4cXagJ0CY?usp=sharing

Please have a look at it.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 16, 2025

this is brilliant! I think we are nearly there now.
Could you also try the following?

  • adding an inference vignette, i.e., how to apply the model when trained - on different data, possibly
  • try to interface one of the current ptf models? To see how difficult it is in terms of adaptation (this could be a separate PR or notebook)

@fkiraly fkiraly added enhancement New feature or request API design API design & software architecture labels Feb 16, 2025
@phoeenniixx
Copy link
Author

I have added the inference vignette to the nb as well

@agobbifbk
Copy link

It seems a good starting point. I will review as soon as possible meanwhile I see some breaking points:

if self.data_future is not None:
            if self.group:
                future_mask = self.data_future.groupby(self.group).groups[group_id]
                future_data = self.data_future.loc[future_mask]
            else:
                future_data = self.data_future
            ## we can discuss if it is needed
            result.update( 
                {
                    "t_f": torch.tensor(future_data[self.time].values), ## care, in case of timestamp this will not work
                    "x_f": torch.tensor(future_data[self.known].values),
                }
            )

I see you define the scaler but there is no fit or fit_transform

   if isinstance(target_normalizer, str) and target_normalizer.lower() == "auto":
            self.target_normalizer = RobustScaler()
        else:
            self.target_normalizer = target_normalizer

This step may not working if not all the keys are defined:

            x = {
                "encoder_cat": data["features"]["categorical"][encoder_indices],
                "encoder_cont": data["features"]["continuous"][encoder_indices],
                "decoder_cat": data["features"]["categorical"][decoder_indices],
                "decoder_cont": data["features"]["continuous"][decoder_indices],
                "encoder_lengths": torch.tensor(enc_length),
                "decoder_lengths": torch.tensor(pred_length),
                "decoder_target_lengths": torch.tensor(pred_length),
                "groups": data["group"],
                "encoder_time_idx": torch.arange(enc_length),
                "decoder_time_idx": torch.arange(enc_length, enc_length + pred_length),
                "target_scale": target_scale,
            }

I will go deeper on that in the next hours!

@phoeenniixx
Copy link
Author

phoeenniixx commented Feb 17, 2025

Thanks for the review @agobbifbk !

I see you define the scaler but there is no fit or fit_transform
This step may not working if not all the keys are defined:

Yes actually, I had some doubts about the scalers, transforms, normalizers, like where should we define them and perform the fit, fit_transform?
For now, I just tried if this approach worked, and then we could make the preprocessing etc, more robust and complex.
I just wanted an approval that this is the direction we want to move, then we can start making the output as close as possible to the existing TimeSeriesDataset

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 17, 2025

It seems a good starting point. I will review as soon as possible meanwhile I see some breaking points:

The suggested tests - if robust - would probably reveal such issues (and prevent them from being introduced)

@phoeenniixx
Copy link
Author

Hi @fkiraly, @agobbifbk, I have added some tests, will they suffice or we need to add something else as well?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant! I the new tests cover input/output type assertions.

What one might want to add are tests for business logic, for instance, are train/validation sets non-overlapping and contiguous (if they are meant to be, in the design).

But overall, this covers the key logic in the class.

The next stage I would go for is model integration, in a separate PR stacked on top of this. Another option would be automation sugar, e.g., recognizing categorical columns by their dtype-s or similar, but I would not go there before a successful end-to-end proof-of-concept for the prototype design.

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 20, 2025

(also, why are readthedocs failing? is this a soft dep isolation issue?)

@agobbifbk
Copy link

Hi @fkiraly, @agobbifbk, I have added some tests, will they suffice or we need to add something else as well?

Sorry but in these days I'm finishing a project deliverable I didn't have much time to go through, I will do it between tomorrow and Monday morning.

There is something strange to me:

 def setup(self, stage: Optional[str] = None):
        total_series = len(self.time_series_dataset)
        self._split_indices = torch.randperm(total_series)

        self._train_size = int(self.train_val_test_split[0] * total_series)
        self._val_size = int(self.train_val_test_split[1] * total_series)

self.time_series_dataset is the number of timeseries you have (aka groups):

    def __len__(self) -> int:
        """Return number of time series in the dataset."""
        return len(self._group_ids)

When you split the timeseries usually you do it temporally (ranges of time) or with non overlapping continuous elements and usually it is done for each of the groups.
Here you are for example putting groups 0 and 1 in train and 2 in test. This is something you can definitively do, nothing wrong, but it is the hardest task you can image (and you need also to remember that you can not use the index as variable in the model since there are groups not seen in the train). I would expect something different, that is more intuitive for me but it could be difficult to implement in this framework. In my notebook here https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST_v2.ipynb there is a first attempt to solve it in the case we do not precompute the training dataset.

Happy to discuss in the next meeting!

@phoeenniixx
Copy link
Author

(also, why are readthedocs failing? is this a soft dep isolation issue?)

Is there any soft dep issue in ptf as well? I mean I am just using the basic stuff... sklearn torch etc, they are the core dep here no?

@fkiraly
Copy link
Collaborator

fkiraly commented Feb 22, 2025

Is there any soft dep issue in ptf as well? I mean I am just using the basic stuff... sklearn torch etc, they are the core dep here no?

That is also what I thought, but it says sphinx extension error in the message. Seems to be related to nbsphinx.

@phoeenniixx phoeenniixx changed the title [ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset [ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) Mar 20, 2025
Copy link

codecov bot commented Apr 4, 2025

Codecov Report

Attention: Patch coverage is 89.61749% with 19 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@e87230b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/data_modules.py 90.22% 13 Missing ⚠️
pytorch_forecasting/data/timeseries.py 88.00% 6 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1770   +/-   ##
=======================================
  Coverage        ?   86.85%           
=======================================
  Files           ?       47           
  Lines           ?     5484           
  Branches        ?        0           
=======================================
  Hits            ?     4763           
  Misses          ?      721           
  Partials        ?        0           
Flag Coverage Δ
cpu 86.85% <89.61%> (?)
pytest 86.85% <89.61%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ggalan87
Copy link

ggalan87 commented Apr 22, 2025

Hello, I am relatively new to pytorch-forecasting and in time series forecasting in general. However I would like to suggest a small improvement that I found as useful from my previous experience in vision using pytorch-lightning.

Proposal

I suggest exposing the batch sampler argument in train_dataloader:

def train_dataloader(self):
return DataLoader(
self.train_dataset,
batch_size=self.batch_size,
num_workers=self.num_workers,
shuffle=True,
collate_fn=self.collate_fn,
)

I think this addition is relatively easy. Maybe there are other useful arguments to expose apart from this.

Motivation

The motivation is to enable the use of custom batch samplers, which are sometimes necessary for specific loss functions -such as triplet loss- that require particular batch structures. Currently a RandomSampler is implicitly created due to shuffle=True in the DataLoader.

However due to lack of experience with time series data, I am not sure if such addition has real impact, because such types of losses are less relevant to forecasting or only suitable to representation learning / model pre-training etc.

My use-case

In my case for example to address such missing functionality I implemented a quick and dirty patch on the existing base class (VisionDataModule) such that I could use any existing subclasses. Otherwise redundant subclassing just to expose the sampler argument would be required.

@agobbifbk
Copy link

Hello @ggalan87, thank you for the suggestion. We are indeed thinking extend the argument to expose! Sampler will be indeed something useful to control, but also the loss function can be part of the rework! Do you have any suggestion (keeping in mind that we need to do it as abstract as possible for compatibility between different data source and model layer)?
THX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

[ENH] rework TimeSeriesDataSet using LightningDataModule - experimental
4 participants