Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is dataset caching persistent across runs? #32

Open
charlesbmi opened this issue Jul 12, 2024 · 1 comment
Open

Is dataset caching persistent across runs? #32

charlesbmi opened this issue Jul 12, 2024 · 1 comment

Comments

@charlesbmi
Copy link

Thanks for creating this awesome project! I am excited to use it as a plugin for Kedro pipeline-parameter sweeps (e.g. via Hydra or Optuna).

I was interested in this point in the README:

you can run the same session multiple times with many speed optimisation (including dataset caching)

but I couldn't find any information about it in the code-base. Is it implemented? If so, is the dataset cached to disk across session runs, or is it just kedro.io.CachedDataSet under the hood?

@takikadiri
Copy link
Owner

Hi charlesbmi, i'm glad you liked the project !

Yes the cached dataset is persisted across runs. kedro-boot cache/preload some datastes as MemoryDataset in order to speedup the runs and achieve low latency. The process of preparing the catalog for multiple runs process is called catalog compilation. You can dry run the compilation process with kedro boot compile --pipeline your_pipeline, the list of artifact datasets that would be cached are described in the compilation report.

In your use case, you would have a thin application that inject some parameters into your pipelines, kedro-boot would preload all other datasets as MemoryDataset as they are not changed between runs.

Let us know if it's worked for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants