fix(datasets): Add parameter to enable/disable lazy saving for `PartitionedDataset` #978

ElenaKhaustova · 2025-01-07T12:24:59Z

Description

Fixes #759

Dataset lazy saving docs: https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html#partitioned-dataset-lazy-saving

Development notes

Added a parameter (save_lazily) to enable/disable lazy saving for PartitionedDataset. The parameter is enabled by default but users still need to wrap objects with callable as it was required before to apply lazy saving, so the change is not breaking.
We didn't add a similar argument for save function as was suggested initially since it contradicts the definition of AbstractDataset.save
Added callable object saving and loading test
Following docs update PR: Update partitioned dataset lazy saving docs kedro#4402

How to test without TensorFlow

The following code will execute without errors and save two objects if save_lazily is set to False. The code will fail and TestSaveCallable.__calll__ will be called if save_lazily is set to True or not set.

from kedro_datasets.partitions import PartitionedDataset


class TestSaveCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")

    def __call__(self, *args, **kwargs):
        print("--- Called ---")


class TestSaveNotCallable:
    def __init__(self, a: int):
        print(f"--- Initialized {a} ---")


def main():
    partitioned_dataset = PartitionedDataset(
        path="data/01_raw/pickle",
        dataset="kedro_datasets.pickle.PickleDataset",
        filename_suffix=".pkl",
        save_lazily=False
    )

    save_dict = {
        "object_1": TestSaveNotCallable(1),
        "object_2": TestSaveCallable(2),
    }

    partitioned_dataset.save(save_dict)


if __name__ == "__main__":
    main()

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Elena Khaustova <[email protected]>

noklam

I think callable data (that is not intend to be lazy saved) is rare and is not worth to break PartitionedDataset for it. Left the comment in the original issue since I saw two PRs are associated.

#759 (comment)

Signed-off-by: Elena Khaustova <[email protected]>

DimedS

Thanks for the PR, @ElenaKhaustova ! LGTM!

merelcht

Nice and clean solution 👌

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

Signed-off-by: Elena Khaustova <[email protected]>

Galileo-Galilei

I really like the solution, but a bit confused about the naming choice which I find inconsistent with other datasets. What's the rationale of not using save_args and add save_lazily a key of the dictionary?

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

ElenaKhaustova added 9 commits January 7, 2025 12:21

Replaced callable check

aaf4a72

Signed-off-by: Elena Khaustova <[email protected]>

Updateds lazy_save test

f35b850

Signed-off-by: Elena Khaustova <[email protected]>

Added test_callable_save

262c059

Signed-off-by: Elena Khaustova <[email protected]>

Fixed lint

e65368d

Signed-off-by: Elena Khaustova <[email protected]>

Fixed docs links

f3388d1

Signed-off-by: Elena Khaustova <[email protected]>

Fixed all docs links

61903eb

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

51d62ef

Signed-off-by: Elena Khaustova <[email protected]>

Fixed all docs links

f49a1f8

Signed-off-by: Elena Khaustova <[email protected]>

Fixed typo

3144ba8

Signed-off-by: Elena Khaustova <[email protected]>

This was referenced Jan 7, 2025

Error when saving TensorFlowModelDataset as partition #759

Closed

Update PartitionedDataset lazy saving docs page kedro-org/kedro#4401

Closed

Update partitioned dataset lazy saving docs kedro-org/kedro#4402

Merged

ElenaKhaustova marked this pull request as ready for review January 7, 2025 16:00

ElenaKhaustova requested review from ravi-kumar-pilla, noklam, DimedS and ankatiyar January 7, 2025 16:01

noklam reviewed Jan 7, 2025

View reviewed changes

ElenaKhaustova marked this pull request as draft January 14, 2025 14:45

ElenaKhaustova added 9 commits January 14, 2025 16:02

Merge branch 'main' into fix/759-model-as-partition

849d895

Signed-off-by: Elena Khaustova <[email protected]>

Added argument to disable lazy saving

b1d0a1d

Signed-off-by: Elena Khaustova <[email protected]>

Removed save function argument

e2755c3

Signed-off-by: Elena Khaustova <[email protected]>

Updated unit test

c49c730

Signed-off-by: Elena Khaustova <[email protected]>

Fixed lint

3df7213

Signed-off-by: Elena Khaustova <[email protected]>

Updated related docs

f6bc42b

Signed-off-by: Elena Khaustova <[email protected]>

Revert test changes

8365e10

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/759-model-as-partition

34e3813

Updated baseline

a9d294f

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova changed the title ~~fix(datasets): Save callable with PartitionedDataset~~ fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset Jan 15, 2025

Updated release notes

ff2867c

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

e4278e2

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova marked this pull request as ready for review January 16, 2025 12:08

ElenaKhaustova requested review from merelcht and astrojuanlu January 16, 2025 12:08

DimedS approved these changes Jan 16, 2025

View reviewed changes

merelcht approved these changes Jan 17, 2025

View reviewed changes

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py Outdated Show resolved Hide resolved

Updated docstrings

32ebe56

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova requested review from idanov, deepyaman, Galileo-Galilei, datajoely, rashidakanchwala, lrcouto and SajidAlamQB January 17, 2025 11:45

ankatiyar approved these changes Jan 17, 2025

View reviewed changes

SajidAlamQB approved these changes Jan 20, 2025

View reviewed changes

ElenaKhaustova requested a review from tynandebold January 20, 2025 10:16

tynandebold approved these changes Jan 20, 2025

View reviewed changes

ElenaKhaustova requested a review from noklam January 20, 2025 10:23

datajoely approved these changes Jan 20, 2025

View reviewed changes

noklam approved these changes Jan 20, 2025

View reviewed changes

rashidakanchwala approved these changes Jan 20, 2025

View reviewed changes

lrcouto approved these changes Jan 20, 2025

View reviewed changes

Galileo-Galilei reviewed Jan 20, 2025

View reviewed changes

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py Show resolved Hide resolved

Galileo-Galilei approved these changes Jan 22, 2025

View reviewed changes

ElenaKhaustova merged commit 1a5e0fa into main Jan 22, 2025
51 of 52 checks passed

ElenaKhaustova deleted the fix/759-model-as-partition branch January 22, 2025 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): Add parameter to enable/disable lazy saving for `PartitionedDataset` #978

fix(datasets): Add parameter to enable/disable lazy saving for `PartitionedDataset` #978

ElenaKhaustova commented Jan 7, 2025 •

edited

Loading

noklam left a comment

DimedS left a comment

merelcht left a comment

Galileo-Galilei left a comment

fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset #978

fix(datasets): Add parameter to enable/disable lazy saving for PartitionedDataset #978

Conversation

ElenaKhaustova commented Jan 7, 2025 • edited Loading

Description

Development notes

How to test without TensorFlow

Checklist

noklam left a comment

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Galileo-Galilei left a comment

Choose a reason for hiding this comment

fix(datasets): Add parameter to enable/disable lazy saving for `PartitionedDataset` #978

fix(datasets): Add parameter to enable/disable lazy saving for `PartitionedDataset` #978

ElenaKhaustova commented Jan 7, 2025 •

edited

Loading