Skip to content

Commit

Permalink
Update partitioned dataset lazy saving docs (#4402)
Browse files Browse the repository at this point in the history
* Updated Partitioned dataset lazy saving docs

Signed-off-by: Elena Khaustova <[email protected]>

* Updated release notes

Signed-off-by: Elena Khaustova <[email protected]>

* Fixed typo

Signed-off-by: Elena Khaustova <[email protected]>

* Updated docs based on new solution

Signed-off-by: Elena Khaustova <[email protected]>

* Applied revire comments

Signed-off-by: Elena Khaustova <[email protected]>

---------

Signed-off-by: Elena Khaustova <[email protected]>
  • Loading branch information
ElenaKhaustova authored Jan 22, 2025
1 parent fba7c53 commit 9ee181f
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 0 deletions.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
* Safeguard hooks when user incorrectly registers a hook class in settings.py.
* Fixed parsing paths with query and fragment.
* Remove lowercase transformation in regex validation.
* Updated `Partitioned dataset lazy saving` docs page.

## Breaking changes to the API
## Documentation changes
Expand Down
19 changes: 19 additions & 0 deletions docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ new_partitioned_dataset:
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: True
```
Here is the node definition:
Expand Down Expand Up @@ -238,6 +239,24 @@ def create_partitions() -> Dict[str, Callable[[], Any]]:
When using lazy saving, the dataset will be written _after_ the `after_node_run` [hook](../hooks/introduction).
```

```{note}
Lazy saving is the default behaviour, meaning that if a `Callable` type is provided, the dataset will be written _after_ the `after_node_run` hook is executed.
```

In certain cases, it might be useful to disable lazy saving, such as when your object is already a `Callable` (e.g., a TensorFlow model) and you do not intend to save it lazily.
To disable the lazy saving set `save_lazily` parameter to `False`:

```yaml
# conf/base/catalog.yml

new_partitioned_dataset:
type: partitions.PartitionedDataset
path: s3://my-bucket-name
dataset: pandas.CSVDataset
filename_suffix: ".csv"
save_lazily: False
```
## Incremental datasets
{class}`IncrementalDataset<kedro-datasets:kedro_datasets.partitions.IncrementalDataset>` is a subclass of `PartitionedDataset`, which stores the information about the last processed partition in the so-called `checkpoint`. `IncrementalDataset` addresses the use case when partitions have to be processed incrementally, that is, each subsequent pipeline run should process just the partitions which were not processed by the previous runs.
Expand Down

0 comments on commit 9ee181f

Please sign in to comment.