Is dataset caching persistent across runs? #32

charlesbmi · 2024-07-12T19:18:48Z

Thanks for creating this awesome project! I am excited to use it as a plugin for Kedro pipeline-parameter sweeps (e.g. via Hydra or Optuna).

I was interested in this point in the README:

you can run the same session multiple times with many speed optimisation (including dataset caching)

but I couldn't find any information about it in the code-base. Is it implemented? If so, is the dataset cached to disk across session runs, or is it just kedro.io.CachedDataSet under the hood?

takikadiri · 2024-07-14T14:16:35Z

Hi charlesbmi, i'm glad you liked the project !

Yes the cached dataset is persisted across runs. kedro-boot cache/preload some datastes as MemoryDataset in order to speedup the runs and achieve low latency. The process of preparing the catalog for multiple runs process is called catalog compilation. You can dry run the compilation process with kedro boot compile --pipeline your_pipeline, the list of artifact datasets that would be cached are described in the compilation report.

In your use case, you would have a thin application that inject some parameters into your pipelines, kedro-boot would preload all other datasets as MemoryDataset as they are not changed between runs.

Let us know if it's worked for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is dataset caching persistent across runs? #32

Is dataset caching persistent across runs? #32

charlesbmi commented Jul 12, 2024

takikadiri commented Jul 14, 2024

Is dataset caching persistent across runs? #32

Is dataset caching persistent across runs? #32

Comments

charlesbmi commented Jul 12, 2024

takikadiri commented Jul 14, 2024