Description
The DataLoader
is nice, but if I understand correctly it requires the dataset to fit in memory. For large datasets that don't fit in memory, it would be nice to have an easy way to load & preprocess the data efficiently, similar to TensorFlow's tf.data API. Maybe something like this exists already?
If not, perhaps one option would be to provide custom transducers to make it possible to write things like:
data = (csv_file_paths |> Shuffle(length(csv_file_paths)) |> Interleave(CSV.File; threads=4)
|> Map(preprocess_sample) |> Shuffle(100_000) |> Batch(32) |> Prefetch(1))
This would load records from multiple files (in random file order), pick 4 randomly, interleave their records, preprocess every record, shuffle records using a 100,000 element buffer, and batch the records with batch size 32, and prefetch 1 batch (so the CPU can prepare the next batch while the GPU is working on the previous batch). Then the data
could be used for training.