Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when saving TensorFlowModelDataset as partition #759

Closed
anabelchuinard opened this issue Aug 21, 2023 · 16 comments · Fixed by #978
Closed

Error when saving TensorFlowModelDataset as partition #759

anabelchuinard opened this issue Aug 21, 2023 · 16 comments · Fixed by #978
Assignees
Labels
bug Something isn't working

Comments

@anabelchuinard
Copy link

anabelchuinard commented Aug 21, 2023

Description

Can't save TensorFlowModelDataset objects as partition.

Context

I am dealing with a project where I have to train several models concurrently. I started writing my code using PartitionedDataset where each partition corresponds to the data relative to one training. When I am trying to save the resulting tensorflow models as a partition, I get an error. I wonder is this has to do with the fact that those inherit from the AbstractVersionedDataset instead of the AbstractDataset. And if yes, I am interested to know if there is any workaround for batch saving those.

This is the instance of my catalog corresponding to the models I want to save:

tensorflow_models:
  type: PartitionedDataset
  path: data/derived/ML/models
  filename_suffix: ".hdf5"
  dataset:
    type: kedro.extras.datasets.tensorflow.TensorFlowModelDataset

Note: Saving one model (not as partition) works.

Steps to Reproduce

  1. Generate a bunch of trained models
  2. Try to save them in a partition as TensorFlowModelDataset objects

Expected Result

Should save one .hdf5 file per partition with the name of the file being the associate dictionary key.

Actual Result

Getting this error:

DatasetError: Failed while saving data to data set PartitionedDataset(dataset_config={}, dataset_type=TensorFlowModelDataset,
path=...).
The first argument to `Layer.call` must always be passed.

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.12
  • Python version used (python -V): 3.9.16
  • Operating system and version: Mac M2
@astrojuanlu
Copy link
Member

Hi @anabelchuinard, thanks for opening this issue and sorry for the delay. It will take us some time but I'm labeling this issue so we don't lose track of it.

@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 5, 2023
@merelcht
Copy link
Member

merelcht commented Jul 8, 2024

Hi @anabelchuinard, do you still need help fixing this issue?

@anabelchuinard
Copy link
Author

@merelcht I found a non-kedronic workaround for this but would love to know if there is now a kedronic way for batch-saving those models.

@merelcht
Copy link
Member

merelcht commented Jul 9, 2024

Using the PartitionedDataset is definitely the recommended Kedro way for batch saving. I've done some digging and it seems that the following lines are causing issues for using the TensorFlowModelDataset with PartitionedDataset:

if callable(partition_data):
partition_data = partition_data() # noqa: PLW2901

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Jul 9, 2024
@merelcht merelcht changed the title Saving TensorFlowModelDataset as partition Error when saving TensorFlowModelDataset as partition Jul 9, 2024
@merelcht merelcht transferred this issue from kedro-org/kedro Jul 9, 2024
@merelcht merelcht added the bug Something isn't working label Jul 9, 2024
@merelcht merelcht moved this to To Do in Kedro Framework Aug 5, 2024
@ElenaKhaustova ElenaKhaustova self-assigned this Jan 6, 2025
@ElenaKhaustova ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 6, 2025
@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 7, 2025

Cause of the issue

The issue is in how we implement partitioned dataset lazy saving. To postpone data loading, we require return Callable types in the dictionary fed to PartitionedDataset instead of the actual object.

if callable(partition_data):
partition_data = partition_data() # noqa: PLW2901

When saving the data, we check if the Callable type was passed and call it to get the actual object. Since the TensorFlow model is callable, we make this call when saving, which causes the above error, though the user didn't mean to apply lazy saving.

So PartitionedDataset cannot save Callable types now, unless they're wrapped with another Callable, for example, lambda.

Current workaround

@anabelchuinard - To make PartitionedDataset save Callable in the current Kedro version you need to wrap an object as if you wanted to do a lazy saving:

save_dict = {
	"tensorflow_model_32": models["tensorflow_model_32"](),
	"tensorflow_model_64": models["tensorflow_model_64"](),
}

# Tensorflow model can be wrapped with lambda, to avoid calling it when saving
save_dict = {
	"tensorflow_model_32": lambda: models["tensorflow_model_32"](),
	"tensorflow_model_64": lambda: models["tensorflow_model_64"](),
}

Suggested fix

Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

Following PR to update docs

kedro-org/kedro#4402

@noklam
Copy link
Contributor

noklam commented Jan 7, 2025

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 8, 2025

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

I see the point but I think the issue is a little bit broader than this case. Particularly I don't think it's right to call any callable object and use this check to decide if we apply lazy saving. This affects all the ml-models cases (tensorflow, pytorch, scikit-learn, etc.) and potentially can also execute some unwanted code implemented in __call__. Moreover, it's not intuitive for users to wrap their objects to avoid such a behaviour.

In the solution suggested I tried to narrow down these cases from callable to lamda, so there's less chance to get them.

As an alternative, we can consider making lazy saving a default behaviour so we internally wrap and unwrap objects automatically. But here, the question is whether we need to make it the only option (as it is for lazy loading) or provide some interface to disable it.

@DimedS
Copy link
Member

DimedS commented Jan 8, 2025

Thanks for the investigation and PR, @ElenaKhaustova! I agree with @noklam that relying solely on lambda functions for lazy saving doesn't seem like a generic solution. While it is a breaking change, it's hard to determine how much it will impact users. In my opinion, it would be better to avoid treating all Callables as participants in lazy saving by default. However, this would also be a breaking change. As a simpler alternative, we could provide an option to disable lazy saving, as you suggested.

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 10, 2025

@noklam, @DimedS, @astrojuanlu

Based on the above arguments, my suggestion would be to make lazy saving a default behaviour like it's done for lazy loading now. For that, we can wrap and unwrap objects internally (instead of asking users to do so manually like we do now), which will guarantee that the Callable we get is expected to be called.

The other question is whether we should provide an option to disable lazy saving. Are there any known cases when disabling it might be critical? Note that we don't have such an option for lazy loading, so it's always enabled.

Please see the edited suggestion below.

@merelcht merelcht moved this from In Review to To Do in Kedro Framework Jan 13, 2025
@ElenaKhaustova ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 13, 2025
@DimedS
Copy link
Member

DimedS commented Jan 14, 2025

Based on the above arguments, my suggestion would be to make lazy saving a default behaviour like it's done for lazy loading now. For that, we can wrap and unwrap objects internally (instead of asking users to do so manually like we do now), which will guarantee that the Callable we get is expected to be called.

Hi @ElenaKhaustova,
Could you please explain how lazy saving will work? For instance, if I want to enable lazy saving and have a function in one partition that executes some code and returns a pandas DataFrame, how should I modify my function to align with your proposal of wrapping all partitions?

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 14, 2025

@DimedS

Could you please explain how lazy saving will work?

I think the easiest way with minimal changes will be to add lazy argument to save() function with True default value:

def save(self, data: dict[str, Any], lazy: bool =True) -> None:

Then:

  • If the input object is callable and lazy=True we unwrap it
  • If the input object is not callable and lazy=True we do nothing
  • If the input object is callable and lazy=False we do nothing
  • If the input object is not callable and lazy=False we do nothing
  • Lazy saving will be enabled by default similar to lazy loading

So a user will still need to wrap the object as it was required before and this behaviour won't change. But there will be a proper option to disable it. Now in case of working will callable, like a TF model, one needs to wrap it to avoid its calling: #759 (comment)

@DimedS
Copy link
Member

DimedS commented Jan 14, 2025

Thanks, @ElenaKhaustova! If I understand correctly, the default behavior will remain the same as the current one. However, we are adding the option to use lazy=False, which will prevent Callables from being unwrapped, allowing users to apply it in scenarios like TensorFlow. Is that correct? If so, I really like this idea!

@ElenaKhaustova
Copy link
Contributor

Thanks, @ElenaKhaustova! If I understand correctly, the default behavior will remain the same as the current one. However, we are adding the option to use lazy=False, which will prevent Callables from being unwrapped, allowing users to apply it in scenarios like TensorFlow. Is that correct? If so, I really like this idea!

Yes, exactly! That's the way to avoid a breaking change.

@merelcht
Copy link
Member

I think the easiest way with minimal changes will be to add lazy argument to save() function with True default value:

def save(self, data: dict[str, Any], lazy=True) -> None:

This sounds like a good and clean solution to me.

@astrojuanlu
Copy link
Member

@ElenaKhaustova Would users be able to toggle that from catalog.yml?

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 14, 2025

@ElenaKhaustova Would users be able to toggle that from catalog.yml?

Yes, I think we should also add it to make sure disabling is possible not only programmatically but with kedro run as well.

@ElenaKhaustova ElenaKhaustova moved this from In Progress to In Review in Kedro Framework Jan 16, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
6 participants