- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.9k
Dataset tools #2100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset tools #2100
Conversation
- Introduced dataset tools for LeRobotDataset, including functions for deleting episodes, splitting datasets, adding/removing features, and merging datasets. - Added an example script demonstrating the usage of these utilities. - Implemented comprehensive tests for all new functionalities to ensure reliability and correctness.
- copy unchanged video and parquet files to avoid recreating the entire dataset
- remove hardcoded split names
add lerobot-edit-dataset shortcut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces dataset editing tools for LeRobotDataset, enabling users to modify datasets through operations like deleting episodes, splitting datasets, adding/removing features, and merging datasets. The implementation includes comprehensive functionality with CLI support.
- Comprehensive dataset tools including delete episodes, split, merge, and feature manipulation
- Command-line interface script with configurable operations and examples
- Complete test coverage for all new functionalities
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description | 
|---|---|
| tests/datasets/test_dataset_tools.py | Comprehensive test suite covering all dataset tool operations | 
| src/lerobot/scripts/lerobot_edit_dataset.py | CLI script for dataset editing operations with detailed usage examples | 
| src/lerobot/datasets/dataset_tools.py | Core implementation of dataset manipulation functions | 
| pyproject.toml | Added CLI shortcut for the dataset editing script | 
| examples/dataset/use_dataset_tools.py | Example script demonstrating usage of dataset tools | 
Comments suppressed due to low confidence (1)
src/lerobot/datasets/dataset_tools.py:1
- These imports should be moved to the top of the file rather than being placed inside the function to follow Python import conventions.
#!/usr/bin/env python
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Michel Aractingi <[email protected]>
| Hey Michel, this looks great and will be very useful! I like the public API and I plan to update LeRobot Data Studio to use this API, which will also bring datasets v3 support to that app and make it easier to maintain across dataset version upgrade. I'm running into an error when running the delete command. I think this might be a datasets conversion problem and all of the files are not getting correctly copied locally. I encountered a pyav codec error when converting this dataset from v2.1 to v3 so maybe that is related, I opened a PR for that here #2115 ╰➤ python -m lerobot.scripts.lerobot_edit_dataset         --repo_id jackvial/screwdriver_panel_center_080225_16_e5         --operation.type delete_episodes         --operation.episode_indices "[0, 2]"
Generating train split: 244 examples [00:00, 100069.44 examples/s]
Generating train split: 244 examples [00:00, 113084.00 examples/s]
Generating train split: 244 examples [00:00, 113284.28 examples/s]
Generating train split: 244 examples [00:00, 119907.46 examples/s]
Generating train split: 244 examples [00:00, 122067.05 examples/s]
INFO 2025-10-04 15:11:11 _dataset.py:155 Deleting episodes [0, 2] from jackvial/screwdriver_panel_center_080225_16_e5
INFO 2025-10-04 15:11:11 set_tools.py:98 Deleting 2 episodes from dataset
INFO 2025-10-04 15:11:11 et_tools.py:540 Processing videos for observation.images.screwdriver
Processing observation.images.screwdriver video files:   0%|                    | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                    
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 277, in <module>
    main()
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 273, in main
    edit_dataset()
  File "/home/jack/code/lerobot/src/lerobot/configs/parser.py", line 225, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 260, in edit_dataset
    handle_delete_episodes(cfg)
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 156, in handle_delete_episodes
    new_dataset = delete_episodes(
                  ^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/dataset_tools.py", line 121, in delete_episodes
    video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/dataset_tools.py", line 611, in _copy_and_reindex_videos
    frames = decode_video_frames(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 248, in decode_video_frames_torchcodec
    decoder = decoder_cache.get_decoder(str(video_path))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 192, in get_decoder
    file_handle = fsspec.open(video_path).__enter__()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/spec.py", line 1310, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 201, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 365, in __init__
    self._open()
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 370, in _open
    self.f = open(self.path, mode=self.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/jack/.cache/huggingface/lerobot/jackvial/screwdriver_panel_center_080225_16_e5/videos/observation.images.screwdriver/chunk-000/file-000.mp4'I tried with a fresh dataset after fixing the dataset conversion bug and getting the same error, the dataset looks good on the hub but only the metadata is present on the local disk ╰➤ ls -lah /home/jack/.cache/huggingface/lerobot/jackvial/screwdriver_attach_panel_ls_080125_9_e8/
total 32K
drwxrwxr-x   3 jack jack 4.0K Oct  4 15:19 .
drwxrwxr-x 251 jack jack  20K Oct  4 15:19 ..
drwxrwxr-x   2 jack jack 4.0K Oct  4 15:19 meta╰➤ python -m lerobot.scripts.lerobot_edit_dataset         --repo_id jackvial/screwdriver_attach_panel_ls_080125_9_e8         --operation.type delete_episodes         --operation.episode_indices "[0, 2]"
stats.json: 7.23kB [00:00, 11.0MB/s]                                            | 0/4 [00:00<?, ?it/s]
info.json: 4.04kB [00:00, 11.1MB/s]rquet:   0%|                           | 0.00/57.7k [00:00<?, ?B/s]
meta/tasks.parquet: 100%|████████████████████████████████████████| 2.92k/2.92k [00:00<00:00, 8.40kB/s]
meta/episodes/chunk-000/file-000.parquet: 100%|███████████████████| 57.7k/57.7k [00:00<00:00, 119kB/s]
Fetching 4 files: 100%|█████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.83it/s]
Generating train split: 8 examples [00:00, 365.09 examples/s]
README.md: 4.53kB [00:00, 23.5MB/s]                                            | 0/10 [00:00<?, ?it/s]
.gitattributes: 2.46kB [00:00, 18.9MB/s]                                  | 0.00/60.3k [00:00<?, ?B/s]
data/chunk-000/file-000.parquet: 100%|████████████████████████████| 60.3k/60.3k [00:00<00:00, 424kB/s]
videos/observation.images.side/chunk-000(…): 100%|███████████████| 15.5M/15.5M [00:00<00:00, 35.7MB/s]
videos/observation.images.top/chunk-000/(…): 100%|███████████████| 22.6M/22.6M [00:00<00:00, 33.5MB/s]
videos/observation.images.screwdriver/ch(…): 100%|███████████████| 31.4M/31.4M [00:00<00:00, 44.6MB/s]
Fetching 10 files: 100%|██████████████████████████████████████████████| 10/10 [00:00<00:00, 11.12it/s]
Generating train split: 1822 examples [00:00, 524503.90 examples/s]31.4M/31.4M [00:00<00:00, 47.0MB/s]
INFO 2025-10-04 15:19:08 _dataset.py:155 Deleting episodes [0, 2] from jackvial/screwdriver_attach_panel_ls_080125_9_e8
INFO 2025-10-04 15:19:08 set_tools.py:98 Deleting 2 episodes from dataset
INFO 2025-10-04 15:19:08 et_tools.py:540 Processing videos for observation.images.screwdriver
Processing observation.images.screwdriver video files:   0%|                    | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                    
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 277, in <module>
    main()
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 273, in main
    edit_dataset()
  File "/home/jack/code/lerobot/src/lerobot/configs/parser.py", line 225, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 260, in edit_dataset
    handle_delete_episodes(cfg)
  File "/home/jack/code/lerobot/src/lerobot/scripts/lerobot_edit_dataset.py", line 156, in handle_delete_episodes
    new_dataset = delete_episodes(
                  ^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/dataset_tools.py", line 121, in delete_episodes
    video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/dataset_tools.py", line 611, in _copy_and_reindex_videos
    frames = decode_video_frames(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 69, in decode_video_frames
    return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 248, in decode_video_frames_torchcodec
    decoder = decoder_cache.get_decoder(str(video_path))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/src/lerobot/datasets/video_utils.py", line 192, in get_decoder
    file_handle = fsspec.open(video_path).__enter__()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/spec.py", line 1310, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 201, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 365, in __init__
    self._open()
  File "/home/jack/code/lerobot/venv/lib/python3.12/site-packages/fsspec/implementations/local.py", line 370, in _open
    self.f = open(self.path, mode=self.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/jack/.cache/huggingface/lerobot/jackvial/screwdriver_attach_panel_ls_080125_9_e8/videos/observation.images.screwdriver/chunk-000/file-000.mp4' | 
…etting new_repo_id
| lerobot-train --policy.path=lerobot/smolvla_base --dataset.repo_id=Keith-Luo/pick_wine_bottle_and_pour --batch_size=64 --steps=20000 --output_dir=outputs/train/pick_wine_bottle_and_pour_smovla --job_name=pick_wine_bottle_and_pour_smolvla --policy.device=cuda --wandb.enable=true --policy.repo_id=Keith-Luo/pick_wine_bottle_and_pour_smolvla are there still some bugs in merging script? deleting script is okay for training | 
| Hey @Keith-Luo just tried to train with a merged dataset and couldn't reproduce your error. Are you sure this is not a bug in your dataset. | 
| 
 Maybe I can upload my dataset, and can you please try again? Hold on a second | 
| 
 Hey @michel-aractingi, yes I can test again this evening | 
| @michel-aractingi delete episodes and push to hub are looking good now. Here's what I tested  python -m lerobot.scripts.lerobot_edit_dataset         --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5         --new_repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5_edited_4        --operation.type delete_episodes         --operation.episode_indic
es "[0, 2]" --push_to_hub=trueSuccessfully created new dataset ackvial/screwdriver_attach_panel_rs_080125_20_e5_edited_4 python -m lerobot.scripts.lerobot_edit_dataset         --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5         --new_repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5_edited_5        --operation.type delete_episodes         --operation.episode_indic
es "[0, 2, 4]" --push_to_hub=trueSuccessfully created new dataset ackvial/screwdriver_attach_panel_rs_080125_20_e5_edited_5 | 
| @michel-aractingi split looks good. Worth emphasizing in the docs/comments that the split names can be anything you like, the examples of train, val, and test might make the user think those are special names that need to be used. Split By Percentagepython -m lerobot.scripts.lerobot_edit_dataset \
        --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5 \
        --operation.type split \
        --operation.splits '{"train": 0.8, "val": 0.2}' --push_to_hub=true
 Split By Episode Selection python -m lerobot.scripts.lerobot_edit_dataset         --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5         --operation.type split         --operation.splits '{"some_split": [0, 1, 2, 3], "some_other_split": [4]}' --push_to_hub=true | 
| @michel-aractingi merge looks good but maybe change the name of  python -m lerobot.scripts.lerobot_edit_dataset \
        --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5_merged_2 \
        --operation.type merge \
        --operation.repo_ids "['jackvial/screwdriver_attach_panel_rs_080125_20_e5_some_split', 'jackvial/screwdriver_attach_panel_rs_080125_20_e5_some_other_split']" --push_to_hub=true | 
| @michel-aractingi Remove feature looks good python -m lerobot.scripts.lerobot_edit_dataset \
        --repo_id jackvial/screwdriver_attach_panel_rs_080125_20_e5_split_to_remove_feature_from_0 \
        --operation.type remove_feature \
        --operation.feature_names "['observation.images.top']" --push_to_hub=trueSome considerations for a feature version of these dataset tools: 
 | 
| 1. **Delete Episodes** - Remove specific episodes from a dataset | ||
| 2. **Split Dataset** - Divide a dataset into multiple smaller datasets | ||
| 3. **Merge Datasets** - Combine multiple datasets into one | ||
| 4. **Add Features** - Add new features to a dataset | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add an example for add feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think this and the renaming features would make nice small PRs for the community to do to get them engaged in the dataset
…ata info when splitting and aggregating datasets
… merging There were three critical bugs in aggregate.py that prevented correct dataset merging: 1. Video file indices: Changed from += to = assignment to correctly reference merged video files 2. Video timestamps: Implemented per-source-file offset tracking to maintain continuous timestamps when merging split datasets (was causing non-monotonic timestamp warnings) 3. File rotation offsets: Store timestamp offsets after rotation decision to prevent out-of-bounds frame access (was causing "Invalid frame index" errors with small file size limits) Changes: - Updated update_meta_data() to apply per-source-file timestamp offsets - Updated aggregate_videos() to track offsets correctly during file rotation - Added get_video_duration_in_s import for duration calculation
…se that the split size results in zero episodes
Signed-off-by: Steven Palma <[email protected]>
| Hi @michel-aractingi , does it support merging two same dataset to become a dataset? I met some problems in this use case. | 
Dataset Editing Tools
examples/dataset/use_dataset_tools.py.src/lerobot/scripts/lerobot_edit_dataset.pyto run a configurable script to edit your dataset with a simple cli.Usage examples
Delete episodes 0, 2, and 5 from a dataset:
Delete episodes and save to a new dataset:
Split dataset by fractions:
Split dataset by episode indices:
Split into more than two splits:
Merge multiple datasets:
Remove camera feature: