Skip to content

Conversation

@ayushb03
Copy link

@ayushb03 ayushb03 commented Mar 6, 2025

Problem

Currently, Baler only supports loading data from numpy arrays stored in .npz files. This limits the tool's flexibility, especially when working with datasets that require custom loading or preprocessing, or when trying to use standard PyTorch datasets from libraries like torchvision.

Solution

This PR adds support for loading and using external PyTorch torch.utils.data.Dataset objects directly with Baler. The implementation allows users to:

  • Use custom dataset implementations
  • Leverage pre-existing PyTorch datasets
  • Apply custom transformations during data loading
  • Maintain all of Baler's existing autoencoder functionality with these external datasets

Implementation Details

  1. Added a new load_external_dataset function in data_processing.py that takes a PyTorch Dataset and returns train and validation DataLoaders
  2. Modified perform_training in baler.py to detect and load external datasets when provided
  3. Updated the train function in training.py to accept DataLoaders directly as an alternative to numpy arrays
  4. Added documentation and examples for using external datasets
  5. Created unit tests for the new functionality

Documentation Added

  • Created a comprehensive guide in docs/guides/external_dataset.md
  • Added a working example in docs/guides/external_dataset_example.py
  • Updated CHANGELOG.md

Testing

Unit tests verify that:

  • External datasets can be correctly loaded and split into train/validation sets
  • DataLoaders created from external datasets have the correct batch size and structure
  • Reproducibility is maintained when deterministic mode is enabled

image

Limitations

  • Currently, compression and decompression phases still require NumPy array inputs
  • Dataset items must be directly usable by the model without additional preprocessing beyond what the Dataset class provides
  • If a dataset returns tuples (e.g., data and labels), users will need to create a wrapper that only returns the data portion

Closes

Closes #382

ayushb03 added 3 commits March 7, 2025 00:36
- Extend Baler to support using external PyTorch Dataset objects directly
- Update README with documentation about external dataset support
- Modify baler.py, data_processing.py, and training.py to handle external datasets
- Add flexible dataset loading with support for custom dataset initialization and configuration
- Implement deterministic and non-deterministic data loading for external datasets
@neogyk
Copy link
Contributor

neogyk commented Mar 27, 2025

Helo, @ayushb03. Can we add additional test cases for this pr?

@ayushb03
Copy link
Author

Hey @neogyk I'll add em soon

@neogyk
Copy link
Contributor

neogyk commented Apr 28, 2025

@ayushb03 , how is the progress on this issue?

- Refactor load_external_dataset to manage edge cases for test_size values (0.0 and 1.0).
- Improve documentation for function parameters and return types.
- Add new test cases for edge cases and dataset loading without shuffling.
- Rename SimpleDataset to SyntheticFeatureDataset for clarity.
@ayushb03
Copy link
Author

ayushb03 commented May 5, 2025

Hey @neogyk ! I’ve pushed the latest changes including updated test cases and improved edge case handling for load_external_dataset. Covered scenarios like extreme test_size values, reproducibility, ordering, and support for complex dataset structures. Sorry for the delay, got a bit packed with work. I'm also adding quick summary of the test cases.

Test Dataset Types

  • SyntheticFeatureDataset: Random tensors
  • SequentialFeatureDataset: Numeric sequences
  • FeatureLabelDictDataset: Feature-label pairs

Summary

Basic Loading

  • Splits into train/validation
  • Verifies shapes and batch sizes
  • Tests parameters (batch size, shuffle)

Reproducibility

  • Identical batches with same seed
  • Confirms deterministic behavior

Edge Cases

  • Handles test_size 0.0 and 1.0
  • Validates empty sets

Order Preservation

  • Maintains order with shuffle=False
  • Validates subset selection

Complex Structures

  • Loads dictionary datasets
  • Preserves structure and dimensions

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add the support of loading and operating external dataset torch.utils.data.Dataset

2 participants