In this section, we discuss the data set-up phase. The steps are as follows:
- Add datasets to your
data/
folder, according to data engineering convention - Register the datasets with the Data Catalog in
conf/base/catalog.yml
, which is the registry of all data sources available for use by the project. This ensures that your code is reproducible when it references datasets in different locations and/or environments.
You can find further information about the Data Catalog in specific documentation covering advanced usage.
The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data for model building by creating a master table.
The spaceflight tutorial has three files and uses two data formats: .csv
and .xlsx
. Download and save the files to the data/01_raw/
folder of your project directory:
Here are some examples of how you can download the files from GitHub to the data/01_raw
directory inside your project:
Using cURL in a Unix terminal:
Click to expand
# reviews
curl -o data/01_raw/reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv
# companies
curl -o data/01_raw/companies.csv https://quantumblacklabs.github.io/kedro/companies.csv
# shuttles
curl -o data/01_raw/shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx
Using cURL for Windows:
Click to expand
curl -o data\01_raw\reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv
curl -o data\01_raw\companies.csv https://quantumblacklabs.github.io/kedro/companies.csv
curl -o data\01_raw\shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx
Using Wget in a Unix terminal:
Click to expand
# reviews
wget -O data/01_raw/reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv
# companies
wget -O data/01_raw/companies.csv https://quantumblacklabs.github.io/kedro/companies.csv
# shuttles
wget -O data/01_raw/shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx
Using Wget for Windows:
Click to expand
wget -O data\01_raw\reviews.csv https://quantumblacklabs.github.io/kedro/reviews.csv
wget -O data\01_raw\companies.csv https://quantumblacklabs.github.io/kedro/companies.csv
wget -O data\01_raw\shuttles.xlsx https://quantumblacklabs.github.io/kedro/shuttles.xlsx
You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a conf/base/catalog.yml
file, and you register each dataset by adding a named entry into the .yml
file. The entry should include the following:
- File location (path)
- Parameters for the given dataset
- Type of data
- Versioning
Kedro supports a number of different data types, and those supported can be found in the API documentation. Kedro uses fssspec
to read data from a variety of data stores including local file systems, network file systems, cloud object stores and HDFS.
For the spaceflights data, first register the csv
datasets by adding this snippet to the end of the conf/base/catalog.yml
file:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
To check whether Kedro can load the data correctly, open a kedro ipython
session and run:
companies = catalog.load("companies")
companies.head()
The command loads the dataset named companies
(as per top-level key in catalog.yml
) from the underlying filepath data/01_raw/companies.csv
into the variable companies
, which is of type pandas.DataFrame
. The head
method from pandas
then displays the first five rows of the DataFrame.
When you have finished, close ipython
session as follows:
exit()
Now register the xlsx
dataset by adding this snippet to the end of the conf/base/catalog.yml
file:
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
To test that everything works as expected, load the dataset within a new kedro ipython
session and display its first five rows:
shuttles = catalog.load("shuttles")
shuttles.head()
When you have finished, close ipython
session as follows:
exit()
Kedro supports a number of datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline.
You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.