Skip to content

Commit b8f7e40

Browse files
michel-aractingiCarolinePascaljackvialimstevenpmwork
authored
Dataset tools (#2100)
* feat(dataset-tools): add dataset utilities and example script - Introduced dataset tools for LeRobotDataset, including functions for deleting episodes, splitting datasets, adding/removing features, and merging datasets. - Added an example script demonstrating the usage of these utilities. - Implemented comprehensive tests for all new functionalities to ensure reliability and correctness. * style fixes * move example to dataset dir * missing lisence * fixes mostly path * clean comments * move tests to functions instead of class based * - fix video editting, decode, delete frames and rencode video - copy unchanged video and parquet files to avoid recreating the entire dataset * Fortify tooling tests * Fix type issue resulting from saving numpy arrays with shape 3,1,1 * added lerobot_edit_dataset * - revert changes in examples - remove hardcoded split names * update comment * fix comment add lerobot-edit-dataset shortcut * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> Signed-off-by: Michel Aractingi <[email protected]> * style nit after copilot review * fix: bug in dataset root when editing the dataset in place (without setting new_repo_id * Fix bug in aggregate.py when accumelating video timestamps; add tests to fortify aggregate videos * Added missing output repo id * migrate delete episode to using pyav instead of decoding, writing frames to disk and encoding again. Co-authored-by: Caroline Pascal <[email protected]> * added modified suffix in case repo_id is not set in delete_episode * adding docs for dataset tools * bump av version and add back time_base assignment * linter * modified push_to_hub logic in lerobot_edit_dataset * fix(progress bar): fixing the progress bar issue in dataset tools * chore(concatenate): removing no longer needed concatenate_datasets usage * fix(file sizes forwarding): forwarding files and chunk sizes in metadata info when splitting and aggregating datasets * style fix * refactor(aggregate): Fix video indexing and timestamp bugs in dataset merging There were three critical bugs in aggregate.py that prevented correct dataset merging: 1. Video file indices: Changed from += to = assignment to correctly reference merged video files 2. Video timestamps: Implemented per-source-file offset tracking to maintain continuous timestamps when merging split datasets (was causing non-monotonic timestamp warnings) 3. File rotation offsets: Store timestamp offsets after rotation decision to prevent out-of-bounds frame access (was causing "Invalid frame index" errors with small file size limits) Changes: - Updated update_meta_data() to apply per-source-file timestamp offsets - Updated aggregate_videos() to track offsets correctly during file rotation - Added get_video_duration_in_s import for duration calculation * Improved docs for split dataset and added a check for the possible case that the split size results in zero episodes * chore(docs): update merge documentation details Signed-off-by: Steven Palma <[email protected]> --------- Co-authored-by: CarolinePascal <[email protected]> Co-authored-by: Jack Vial <[email protected]> Co-authored-by: Steven Palma <[email protected]>
1 parent 656fc0f commit b8f7e40

File tree

13 files changed

+2593
-30
lines changed

13 files changed

+2593
-30
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525
title: Using LeRobotDataset
2626
- local: porting_datasets_v3
2727
title: Porting Large Datasets
28+
- local: using_dataset_tools
29+
title: Using the Dataset Tools
2830
title: "Datasets"
2931
- sections:
3032
- local: act
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Using Dataset Tools
2+
3+
This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets.
4+
5+
## Overview
6+
7+
LeRobot provides several utilities for manipulating datasets:
8+
9+
1. **Delete Episodes** - Remove specific episodes from a dataset
10+
2. **Split Dataset** - Divide a dataset into multiple smaller datasets
11+
3. **Merge Datasets** - Combine multiple datasets into one. The datasets must have identical features, and episodes are concatenated in the order specified in `repo_ids`
12+
4. **Add Features** - Add new features to a dataset
13+
5. **Remove Features** - Remove features from a dataset
14+
15+
The core implementation is in `lerobot.datasets.dataset_tools`.
16+
An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.
17+
18+
## Command-Line Tool: lerobot-edit-dataset
19+
20+
`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.
21+
22+
Run `lerobot-edit-dataset --help` for more information on the configuration of each operation.
23+
24+
### Usage Examples
25+
26+
#### Delete Episodes
27+
28+
Remove specific episodes from a dataset. This is useful for filtering out undesired data.
29+
30+
```bash
31+
# Delete episodes 0, 2, and 5 (modifies original dataset)
32+
lerobot-edit-dataset \
33+
--repo_id lerobot/pusht \
34+
--operation.type delete_episodes \
35+
--operation.episode_indices "[0, 2, 5]"
36+
37+
# Delete episodes and save to a new dataset (preserves original dataset)
38+
lerobot-edit-dataset \
39+
--repo_id lerobot/pusht \
40+
--new_repo_id lerobot/pusht_after_deletion \
41+
--operation.type delete_episodes \
42+
--operation.episode_indices "[0, 2, 5]"
43+
```
44+
45+
#### Split Dataset
46+
47+
Divide a dataset into multiple subsets.
48+
49+
```bash
50+
# Split by fractions (e.g. 80% train, 20% test, 20% val)
51+
lerobot-edit-dataset \
52+
--repo_id lerobot/pusht \
53+
--operation.type split \
54+
--operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}'
55+
56+
# Split by specific episode indices
57+
lerobot-edit-dataset \
58+
--repo_id lerobot/pusht \
59+
--operation.type split \
60+
--operation.splits '{"task1": [0, 1, 2, 3], "task2": [4, 5]}'
61+
```
62+
63+
There are no constraints on the split names, they can be determined by the user. Resulting datasets are saved under the repo id with the split name appended, e.g. `lerobot/pusht_train`, `lerobot/pusht_task1`, `lerobot/pusht_task2`.
64+
65+
#### Merge Datasets
66+
67+
Combine multiple datasets into a single dataset.
68+
69+
```bash
70+
# Merge train and validation splits back into one dataset
71+
lerobot-edit-dataset \
72+
--repo_id lerobot/pusht_merged \
73+
--operation.type merge \
74+
--operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']"
75+
```
76+
77+
#### Remove Features
78+
79+
Remove features from a dataset.
80+
81+
```bash
82+
# Remove a camera feature
83+
lerobot-edit-dataset \
84+
--repo_id lerobot/pusht \
85+
--operation.type remove_feature \
86+
--operation.feature_names "['observation.images.top']"
87+
```
88+
89+
### Push to Hub
90+
91+
Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:
92+
93+
```bash
94+
lerobot-edit-dataset \
95+
--repo_id lerobot/pusht \
96+
--new_repo_id lerobot/pusht_after_deletion \
97+
--operation.type delete_episodes \
98+
--operation.episode_indices "[0, 2, 5]" \
99+
--push_to_hub
100+
```
101+
102+
There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`.
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
#!/usr/bin/env python
2+
3+
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
"""
18+
Example script demonstrating dataset tools utilities.
19+
20+
This script shows how to:
21+
1. Delete episodes from a dataset
22+
2. Split a dataset into train/val sets
23+
3. Add/remove features
24+
4. Merge datasets
25+
26+
Usage:
27+
python examples/dataset/use_dataset_tools.py
28+
"""
29+
30+
import numpy as np
31+
32+
from lerobot.datasets.dataset_tools import (
33+
add_feature,
34+
delete_episodes,
35+
merge_datasets,
36+
remove_feature,
37+
split_dataset,
38+
)
39+
from lerobot.datasets.lerobot_dataset import LeRobotDataset
40+
41+
42+
def main():
43+
dataset = LeRobotDataset("lerobot/pusht")
44+
45+
print(f"Original dataset: {dataset.meta.total_episodes} episodes, {dataset.meta.total_frames} frames")
46+
print(f"Features: {list(dataset.meta.features.keys())}")
47+
48+
print("\n1. Deleting episodes 0 and 2...")
49+
filtered_dataset = delete_episodes(dataset, episode_indices=[0, 2], repo_id="lerobot/pusht_filtered")
50+
print(f"Filtered dataset: {filtered_dataset.meta.total_episodes} episodes")
51+
52+
print("\n2. Splitting dataset into train/val...")
53+
splits = split_dataset(
54+
dataset,
55+
splits={"train": 0.8, "val": 0.2},
56+
)
57+
print(f"Train split: {splits['train'].meta.total_episodes} episodes")
58+
print(f"Val split: {splits['val'].meta.total_episodes} episodes")
59+
60+
print("\n3. Adding a reward feature...")
61+
62+
reward_values = np.random.randn(dataset.meta.total_frames).astype(np.float32)
63+
dataset_with_reward = add_feature(
64+
dataset,
65+
feature_name="reward",
66+
feature_values=reward_values,
67+
feature_info={
68+
"dtype": "float32",
69+
"shape": (1,),
70+
"names": None,
71+
},
72+
repo_id="lerobot/pusht_with_reward",
73+
)
74+
75+
def compute_success(row_dict, episode_index, frame_index):
76+
episode_length = 10
77+
return float(frame_index >= episode_length - 10)
78+
79+
dataset_with_success = add_feature(
80+
dataset_with_reward,
81+
feature_name="success",
82+
feature_values=compute_success,
83+
feature_info={
84+
"dtype": "float32",
85+
"shape": (1,),
86+
"names": None,
87+
},
88+
repo_id="lerobot/pusht_with_reward_and_success",
89+
)
90+
91+
print(f"New features: {list(dataset_with_success.meta.features.keys())}")
92+
93+
print("\n4. Removing the success feature...")
94+
dataset_cleaned = remove_feature(
95+
dataset_with_success, feature_names="success", repo_id="lerobot/pusht_cleaned"
96+
)
97+
print(f"Features after removal: {list(dataset_cleaned.meta.features.keys())}")
98+
99+
print("\n5. Merging train and val splits back together...")
100+
merged = merge_datasets([splits["train"], splits["val"]], output_repo_id="lerobot/pusht_merged")
101+
print(f"Merged dataset: {merged.meta.total_episodes} episodes")
102+
103+
print("\n6. Complex workflow example...")
104+
105+
if len(dataset.meta.camera_keys) > 1:
106+
camera_to_remove = dataset.meta.camera_keys[0]
107+
print(f"Removing camera: {camera_to_remove}")
108+
dataset_no_cam = remove_feature(
109+
dataset, feature_names=camera_to_remove, repo_id="pusht_no_first_camera"
110+
)
111+
print(f"Remaining cameras: {dataset_no_cam.meta.camera_keys}")
112+
113+
print("\nDone! Check ~/.cache/huggingface/lerobot/ for the created datasets.")
114+
115+
116+
if __name__ == "__main__":
117+
main()

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ dependencies = [
6767
"cmake>=3.29.0.1,<4.2.0",
6868
"einops>=0.8.0,<0.9.0",
6969
"opencv-python-headless>=4.9.0,<4.13.0",
70-
"av>=14.2.0,<16.0.0",
70+
"av>=15.0.0,<16.0.0",
7171
"jsonlines>=4.0.0,<5.0.0",
7272
"packaging>=24.2,<26.0",
7373
"pynput>=1.7.7,<1.9.0",
@@ -175,6 +175,7 @@ lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
175175
lerobot-info="lerobot.scripts.lerobot_info:main"
176176
lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
177177
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
178+
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
178179

179180
# ---------------- Tool Configurations ----------------
180181
[tool.setuptools.packages.find]

0 commit comments

Comments
 (0)