[ENH] Experimental PR for using `LightningDataModule` for `TimeSeriesDataset` (version -A) #1770

phoeenniixx · 2025-02-16T08:08:06Z

Tries to implement D2 layer (with minimum functionality) using LightningDataModule

phoeenniixx · 2025-02-16T08:10:05Z

Hi @fkiraly, @agobbifbk
Although I have tried to make it as real as possible, but there may be some logical issues, please tell me if you find any.

I also have created a basic example notebook: https://colab.research.google.com/drive/1FvLlmEOgm3D3JgNFVeAtwPk4cXagJ0CY?usp=sharing

Please have a look at it.

fkiraly · 2025-02-16T08:57:23Z

this is brilliant! I think we are nearly there now.
Could you also try the following?

adding an inference vignette, i.e., how to apply the model when trained - on different data, possibly
try to interface one of the current ptf models? To see how difficult it is in terms of adaptation (this could be a separate PR or notebook)

phoeenniixx · 2025-02-16T12:41:52Z

I have added the inference vignette to the nb as well

agobbifbk · 2025-02-17T08:09:38Z

It seems a good starting point. I will review as soon as possible meanwhile I see some breaking points:

if self.data_future is not None:
            if self.group:
                future_mask = self.data_future.groupby(self.group).groups[group_id]
                future_data = self.data_future.loc[future_mask]
            else:
                future_data = self.data_future
            ## we can discuss if it is needed
            result.update( 
                {
                    "t_f": torch.tensor(future_data[self.time].values), ## care, in case of timestamp this will not work
                    "x_f": torch.tensor(future_data[self.known].values),
                }
            )

I see you define the scaler but there is no fit or fit_transform

   if isinstance(target_normalizer, str) and target_normalizer.lower() == "auto":
            self.target_normalizer = RobustScaler()
        else:
            self.target_normalizer = target_normalizer

This step may not working if not all the keys are defined:

            x = {
                "encoder_cat": data["features"]["categorical"][encoder_indices],
                "encoder_cont": data["features"]["continuous"][encoder_indices],
                "decoder_cat": data["features"]["categorical"][decoder_indices],
                "decoder_cont": data["features"]["continuous"][decoder_indices],
                "encoder_lengths": torch.tensor(enc_length),
                "decoder_lengths": torch.tensor(pred_length),
                "decoder_target_lengths": torch.tensor(pred_length),
                "groups": data["group"],
                "encoder_time_idx": torch.arange(enc_length),
                "decoder_time_idx": torch.arange(enc_length, enc_length + pred_length),
                "target_scale": target_scale,
            }

I will go deeper on that in the next hours!

phoeenniixx · 2025-02-17T11:50:04Z

Thanks for the review @agobbifbk !

I see you define the scaler but there is no fit or fit_transform
This step may not working if not all the keys are defined:

Yes actually, I had some doubts about the scalers, transforms, normalizers, like where should we define them and perform the fit, fit_transform?
For now, I just tried if this approach worked, and then we could make the preprocessing etc, more robust and complex.
I just wanted an approval that this is the direction we want to move, then we can start making the output as close as possible to the existing TimeSeriesDataset

fkiraly · 2025-02-17T18:01:23Z

It seems a good starting point. I will review as soon as possible meanwhile I see some breaking points:

The suggested tests - if robust - would probably reveal such issues (and prevent them from being introduced)

phoeenniixx · 2025-02-19T14:36:20Z

Hi @fkiraly, @agobbifbk, I have added some tests, will they suffice or we need to add something else as well?

fkiraly

This is brilliant! I the new tests cover input/output type assertions.

What one might want to add are tests for business logic, for instance, are train/validation sets non-overlapping and contiguous (if they are meant to be, in the design).

But overall, this covers the key logic in the class.

The next stage I would go for is model integration, in a separate PR stacked on top of this. Another option would be automation sugar, e.g., recognizing categorical columns by their dtype-s or similar, but I would not go there before a successful end-to-end proof-of-concept for the prototype design.

fkiraly · 2025-02-20T15:34:24Z

(also, why are readthedocs failing? is this a soft dep isolation issue?)

agobbifbk · 2025-02-20T16:15:00Z

Hi @fkiraly, @agobbifbk, I have added some tests, will they suffice or we need to add something else as well?

Sorry but in these days I'm finishing a project deliverable I didn't have much time to go through, I will do it between tomorrow and Monday morning.

There is something strange to me:

 def setup(self, stage: Optional[str] = None):
        total_series = len(self.time_series_dataset)
        self._split_indices = torch.randperm(total_series)

        self._train_size = int(self.train_val_test_split[0] * total_series)
        self._val_size = int(self.train_val_test_split[1] * total_series)

self.time_series_dataset is the number of timeseries you have (aka groups):

    def __len__(self) -> int:
        """Return number of time series in the dataset."""
        return len(self._group_ids)

When you split the timeseries usually you do it temporally (ranges of time) or with non overlapping continuous elements and usually it is done for each of the groups.
Here you are for example putting groups 0 and 1 in train and 2 in test. This is something you can definitively do, nothing wrong, but it is the hardest task you can image (and you need also to remember that you can not use the index as variable in the model since there are groups not seen in the train). I would expect something different, that is more intuitive for me but it could be difficult to implement in this framework. In my notebook here https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST_v2.ipynb there is a first attempt to solve it in the case we do not precompute the training dataset.

Happy to discuss in the next meeting!

phoeenniixx · 2025-02-20T19:45:37Z

(also, why are readthedocs failing? is this a soft dep isolation issue?)

Is there any soft dep issue in ptf as well? I mean I am just using the basic stuff... sklearn torch etc, they are the core dep here no?

fkiraly · 2025-02-22T10:06:48Z

Is there any soft dep issue in ptf as well? I mean I am just using the basic stuff... sklearn torch etc, they are the core dep here no?

That is also what I thought, but it says sphinx extension error in the message. Seems to be related to nbsphinx.

codecov · 2025-04-04T18:41:40Z

Codecov Report

Attention: Patch coverage is 89.61749% with 19 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@e87230b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
pytorch_forecasting/data/data_modules.py	90.22%	13 Missing ⚠️
pytorch_forecasting/data/timeseries.py	88.00%	6 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1770   +/-   ##
=======================================
  Coverage        ?   86.85%           
=======================================
  Files           ?       47           
  Lines           ?     5484           
  Branches        ?        0           
=======================================
  Hits            ?     4763           
  Misses          ?      721           
  Partials        ?        0

Flag	Coverage Δ
cpu	`86.85% <89.61%> (?)`
pytest	`86.85% <89.61%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ggalan87 · 2025-04-22T16:28:40Z

Hello, I am relatively new to pytorch-forecasting and in time series forecasting in general. However I would like to suggest a small improvement that I found as useful from my previous experience in vision using pytorch-lightning.

Proposal

I suggest exposing the batch sampler argument in train_dataloader:

pytorch-forecasting/pytorch_forecasting/data/data_modules.py

Lines 331 to 338 in 5489462

    
           def train_dataloader(self): 
        
               return DataLoader( 
        
                   self.train_dataset, 
        
                   batch_size=self.batch_size, 
        
                   num_workers=self.num_workers, 
        
                   shuffle=True, 
        
                   collate_fn=self.collate_fn, 
        
               )

I think this addition is relatively easy. Maybe there are other useful arguments to expose apart from this.

Motivation

The motivation is to enable the use of custom batch samplers, which are sometimes necessary for specific loss functions -such as triplet loss- that require particular batch structures. Currently a RandomSampler is implicitly created due to shuffle=True in the DataLoader.

However due to lack of experience with time series data, I am not sure if such addition has real impact, because such types of losses are less relevant to forecasting or only suitable to representation learning / model pre-training etc.

My use-case

In my case for example to address such missing functionality I implemented a quick and dirty patch on the existing base class (VisionDataModule) such that I could use any existing subclasses. Otherwise redundant subclassing just to expose the sampler argument would be required.

agobbifbk · 2025-04-30T06:17:10Z

Hello @ggalan87, thank you for the suggestion. We are indeed thinking extend the argument to expose! Sampler will be indeed something useful to control, but also the loss function can be part of the rework! Do you have any suggestion (keeping in mind that we need to do it as abstract as possible for compatibility between different data source and model layer)?
THX

phoeenniixx added 2 commits February 16, 2025 13:32

initial commit

ab33b97

adding the timeseries and data module

dd8b6a0

fkiraly added enhancement New feature or request API design API design & software architecture labels Feb 16, 2025

adding predict to setup

c4dd9cf

fkiraly assigned phoeenniixx Feb 17, 2025

fkiraly mentioned this pull request Feb 17, 2025

pytorch-forecasting and dsipts v2 API design sktime/enhancement-proposals#39

Open

phoeenniixx added 3 commits February 19, 2025 19:28

Adding tests and some debugging

54d7828

debug

9f8256e

update comments

fda5f7e

fkiraly reviewed Feb 20, 2025

View reviewed changes

Merge branch 'sktime:main' into newDataset

c4155f2

phoeenniixx changed the title ~~[ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset~~ [ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) Mar 20, 2025

Merge branch 'sktime:main' into newDataset

5489462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Experimental PR for using `LightningDataModule` for `TimeSeriesDataset` (version -A) #1770

[ENH] Experimental PR for using `LightningDataModule` for `TimeSeriesDataset` (version -A) #1770

phoeenniixx commented Feb 16, 2025 •

edited

Loading

phoeenniixx commented Feb 16, 2025

fkiraly commented Feb 16, 2025

phoeenniixx commented Feb 16, 2025

agobbifbk commented Feb 17, 2025

phoeenniixx commented Feb 17, 2025 •

edited

Loading

fkiraly commented Feb 17, 2025

phoeenniixx commented Feb 19, 2025

fkiraly left a comment

fkiraly commented Feb 20, 2025

agobbifbk commented Feb 20, 2025

phoeenniixx commented Feb 20, 2025

fkiraly commented Feb 22, 2025

codecov bot commented Apr 4, 2025 •

edited

Loading

ggalan87 commented Apr 22, 2025 •

edited

Loading

agobbifbk commented Apr 30, 2025

[ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) #1770

Are you sure you want to change the base?

[ENH] Experimental PR for using LightningDataModule for TimeSeriesDataset (version -A) #1770

Conversation

phoeenniixx commented Feb 16, 2025 • edited Loading

phoeenniixx commented Feb 16, 2025

fkiraly commented Feb 16, 2025

phoeenniixx commented Feb 16, 2025

agobbifbk commented Feb 17, 2025

phoeenniixx commented Feb 17, 2025 • edited Loading

fkiraly commented Feb 17, 2025

phoeenniixx commented Feb 19, 2025

fkiraly left a comment

Choose a reason for hiding this comment

fkiraly commented Feb 20, 2025

agobbifbk commented Feb 20, 2025

phoeenniixx commented Feb 20, 2025

fkiraly commented Feb 22, 2025

codecov bot commented Apr 4, 2025 • edited Loading

Codecov Report

ggalan87 commented Apr 22, 2025 • edited Loading

Proposal

Motivation

My use-case

agobbifbk commented Apr 30, 2025

[ENH] Experimental PR for using `LightningDataModule` for `TimeSeriesDataset` (version -A) #1770

[ENH] Experimental PR for using `LightningDataModule` for `TimeSeriesDataset` (version -A) #1770

phoeenniixx commented Feb 16, 2025 •

edited

Loading

phoeenniixx commented Feb 17, 2025 •

edited

Loading

codecov bot commented Apr 4, 2025 •

edited

Loading

ggalan87 commented Apr 22, 2025 •

edited

Loading