This section shows how to add datasets to the project's data
folder. It also reviews how those datasets are registered in Kedro's Data Catalog, which is the registry of all data sources available for use by the project.
The spaceflights tutorial makes use of three fictional datasets of companies shuttling customers to the Moon and back. The data comes in two different formats: .csv
and .xlsx
:
companies.csv
contains data about space travel companies, such as their location, fleet count and ratingreviews.csv
is a set of reviews from customers for categories, such as comfort and priceshuttles.xlsx
is a set of attributes for spacecraft across the fleet, such as their engine type and passenger capacity
The spaceflights starter has already added the datasets to the data/01_raw
folder of your project.
The following information about a dataset must be registered before Kedro can load it:
- File location (path)
- Parameters for the given dataset
- Type of data
- Versioning
Open conf/base/catalog.yml
for the spaceflights project to inspect the contents. The two csv
datasets are registered as follows:
Click to expand
companies:
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
reviews:
type: pandas.CSVDataset
filepath: data/01_raw/reviews.csv
Likewise for the xlsx
dataset:
Click to expand
shuttles:
type: pandas.ExcelDataset
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
The additional line, load_args
, is passed to the excel file read method (pd.read_excel
) as a keyword argument. Although not specified here, the equivalent output is save_args
and the value would be passed to pd.DataFrame.to_excel
method.
Open a kedro ipython
session in your terminal from the project root directory:
kedro ipython
Then type the following into the IPython prompt to test load some csv
data:
companies = catalog.load("companies")
companies.head()
- The first command creates a variable (
companies
), which is of typepandas.DataFrame
and loads the dataset (also namedcompanies
as per top-level key incatalog.yml
) from the underlying filepathdata/01_raw/companies.csv
. - The
head
method frompandas
displays the first five rows of the DataFrame.
Click to expand
INFO Loading data from 'companies' (CSVDataset)
Out[1]:
id company_rating company_location total_fleet_count iata_approved
0 35029 100% Niue 4.0 f
1 30292 67% Anguilla 6.0 f
2 19032 67% Russian Federation 4.0 f
3 8238 91% Barbados 15.0 t
4 30342 NaN Sao Tome and Principe 2.0 t
Similarly, to test that the xlsx
data is loaded as expected:
shuttles = catalog.load("shuttles")
shuttles.head()
You should see output such as the following:
Click to expand
INFO Loading data from 'shuttles' (ExcelDataset)
Out[1]:
id shuttle_location shuttle_type engine_type ... d_check_complete moon_clearance_complete price company_id
0 63561 Niue Type V5 Quantum ... f f $1,325.0 35029
1 36260 Anguilla Type V5 Quantum ... t f $1,780.0 30292
2 57015 Russian Federation Type V5 Quantum ... f f $1,715.0 19032
3 14035 Barbados Type V5 Plasma ... f f $4,770.0 8238
4 10036 Sao Tome and Principe Type V2 Plasma ... f f $2,820.0 30342
When you have finished, close ipython
session with exit()
.
.. youtube:: rl2cncGxyts
:width: 100%
{py:mod}Kedro supports numerous datasets <kedro-datasets:kedro_datasets>
out of the box, but you can also add support for any proprietary data format or filesystem.
You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.
Kedro uses fsspec
to read data from a variety of data stores including local file systems, network file systems, HDFS, and all of the widely-used cloud object stores.