Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models.
Suppose you want to version master_table
. To enable versioning, simply add a versioned
entry in catalog.yml
as follows:
master_table:
type: pandas.CSVDataSet
filepath: data/03_primary/master_table.csv
versioned: true
The DataCatalog
will create a versioned CSVDataSet
called master_table
. The actual csv file location will be data/03_primary/master_table.csv/<version>/master_table.csv
, where the first /master_table.csv/
is a directory and <version>
corresponds to a global save version string formatted as YYYY-MM-DDThh.mm.ss.sssZ
.
In a similar way, you can version your machine learning model. Enable versioning for regressor
as follow:
regressor:
type: pickle.PickleDataSet
filepath: data/06_models/regressor.pickle
versioned: true
This will save versioned pickle models every time you run the pipeline.
Note: The list of the datasets supporting versioning can be find in the documentation.
By default, the DataCatalog
will load the latest version of the dataset. However, you can run the pipeline with a particular versioned dataset with --load-version
flag as follows:
kedro run --load-version="master_table:YYYY-MM-DDThh.mm.ss.sssZ"
where --load-version
contains a dataset name and a version timestamp separated by :
.