Skip to content

Data loading & preprocessing pipeline feature #1282

Closed
@ageron

Description

@ageron

The DataLoader is nice, but if I understand correctly it requires the dataset to fit in memory. For large datasets that don't fit in memory, it would be nice to have an easy way to load & preprocess the data efficiently, similar to TensorFlow's tf.data API. Maybe something like this exists already?

If not, perhaps one option would be to provide custom transducers to make it possible to write things like:

data = (csv_file_paths |> Shuffle(length(csv_file_paths)) |> Interleave(CSV.File; threads=4)
             |> Map(preprocess_sample) |> Shuffle(100_000) |> Batch(32) |> Prefetch(1))

This would load records from multiple files (in random file order), pick 4 randomly, interleave their records, preprocess every record, shuffle records using a 100,000 element buffer, and batch the records with batch size 32, and prefetch 1 batch (so the CPU can prepare the next batch while the GPU is working on the previous batch). Then the data could be used for training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions