Skip to content

The merlin dataloader lets you rapidly load tabular data for training deep leaning models with TensorFlow, PyTorch or JAX

License

Notifications You must be signed in to change notification settings

NVIDIA-Merlin/dataloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1441a12 · Oct 17, 2023
Oct 17, 2023
Feb 7, 2023
Jul 30, 2022
Jun 6, 2023
Jun 6, 2023
May 12, 2023
Apr 28, 2023
May 12, 2023
Nov 28, 2022
Feb 13, 2023
Nov 29, 2022
Nov 22, 2022
Nov 29, 2022
Jul 1, 2022
Oct 25, 2022
Jun 6, 2023
Jul 15, 2022
Nov 23, 2022
Nov 28, 2022
Jun 6, 2023
Nov 22, 2022

Repository files navigation

PyPI - Python Version PyPI version shields.io GitHub License Documentation

The merlin-dataloader lets you quickly train recommender models for TensorFlow, PyTorch and JAX. It eliminates the biggest bottleneck in training recommender models, by providing GPU optimized dataloaders that read data directly into the GPU, and then do a 0-copy transfer to TensorFlow and PyTorch using dlpack.

The benefits of the Merlin Dataloader include:

  • Over 10x speedup over native framework dataloaders
  • Handles larger than memory datasets
  • Per-epoch shuffling
  • Distributed training

Installation

Merlin-dataloader requires Python version 3.7+. Additionally, GPU support requires CUDA 11.0+.

To install using Conda:

conda install -c nvidia -c rapidsai -c numba -c conda-forge merlin-dataloader python=3.7 cudatoolkit=11.2

To install from PyPi:

pip install merlin-dataloader

There are also docker containers on NGC with the merlin-dataloader and dependencies included on them

Basic Usage

# Get a merlin dataset from a set of parquet files
import merlin.io
dataset = merlin.io.Dataset(PARQUET_FILE_PATHS, engine="parquet")

# Create a Tensorflow dataloader from the dataset, loading 65K items
# per batch
from merlin.dataloader.tensorflow import Loader
loader = Loader(dataset, batch_size=65536)

# Get a single batch of data. Inputs will be a dictionary of columnname
# to TensorFlow tensors
inputs, target = next(loader)

# Train a Keras model with the dataloader
model = tf.keras.Model( ... )
model.fit(loader, epochs=5)