Data loading & preprocessing pipeline feature

The `DataLoader` is nice, but if I understand correctly it requires the dataset to fit in memory. For large datasets that don't fit in memory, it would be nice to have an easy way to load & preprocess the data efficiently, similar to TensorFlow's tf.data API. Maybe something like this exists already?

If not, perhaps one option would be to provide custom [transducers](https://github.com/JuliaFolds/Transducers.jl) to make it possible to write things like:

```julia
data = (csv_file_paths |> Shuffle(length(csv_file_paths)) |> Interleave(CSV.File; threads=4)
             |> Map(preprocess_sample) |> Shuffle(100_000) |> Batch(32) |> Prefetch(1))
```

This would load records from multiple files (in random file order), pick 4 randomly, interleave their records, preprocess every record, shuffle records using a 100,000 element buffer, and batch the records with batch size 32, and prefetch 1 batch (so the CPU can prepare the next batch while the GPU is working on the previous batch). Then the `data` could be used for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data loading & preprocessing pipeline feature #1282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Data loading & preprocessing pipeline feature #1282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions