Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset #978

Merged
merged 21 commits into from
Jan 22, 2025

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Jan 7, 2025

Description

Fixes #759

Cause of the issue: #759 (comment)

Dataset lazy saving docs: https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving

Development notes

  1. Added a parameter (save_lazily) to enable/disable lazy saving for PartitionedDataset. The parameter is enabled by default but users still need to wrap objects with callable as it was required before to apply lazy saving, so the change is not breaking.
  2. We didn't add a similar argument for save function as was suggested initially since it contradicts the definition of AbstractDataset.save
  3. Added callable object saving and loading test
  4. Following docs update PR: Update partitioned dataset lazy saving docs kedro#4402

How to test without TensorFlow

The following code will execute without errors and save two objects if save_lazily is set to False. The code will fail and TestSaveCallable.__calll__ will be called if save_lazily is set to True or not set.

from kedro_datasets.partitions import PartitionedDataset


class TestSaveCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")

    def __call__(self, *args, **kwargs):
        print("--- Called ---")


class TestSaveNotCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")


def main():
    partitioned_dataset = PartitionedDataset(
        path="data/01_raw/pickle",
        dataset="kedro_datasets.pickle.PickleDataset",
        filename_suffix=".pkl",
        save_lazily=False
    )

    save_dict = {
        "object_1": TestSaveNotCallable(1),
        "object_2": TestSaveCallable(2),
    }

    partitioned_dataset.save(save_dict)


if __name__ == "__main__":
    main()

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think callable data (that is not intend to be lazy saved) is rare and is not worth to break PartitionedDataset for it. Left the comment in the original issue since I saw two PRs are associated.

#759 (comment)

@ElenaKhaustova ElenaKhaustova marked this pull request as draft January 14, 2025 14:45
@ElenaKhaustova ElenaKhaustova changed the title fix(datasets): Save callable with PartitionedDataset fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset Jan 15, 2025
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review January 16, 2025 12:08
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ElenaKhaustova ! LGTM!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean solution 👌

Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova requested a review from noklam January 20, 2025 10:23
Copy link
Member

@Galileo-Galilei Galileo-Galilei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the solution, but a bit confused about the naming choice which I find inconsistent with other datasets. What's the rationale of not using save_args and add save_lazily a key of the dictionary?

@ElenaKhaustova ElenaKhaustova merged commit 1a5e0fa into main Jan 22, 2025
51 of 52 checks passed
@ElenaKhaustova ElenaKhaustova deleted the fix/759-model-as-partition branch January 22, 2025 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when saving TensorFlowModelDataset as partition