-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Dataset tools #2100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset tools #2100
Changes from 32 commits
d3cff4c
01c4bf9
9a6b84e
da3cc59
28afab3
8ac6be8
90b0b47
49b923f
cbcc146
2049802
ef5fd5e
1ac4434
012d9ee
fa593f2
0dc81f5
298fb6f
a8cc159
e961c1f
4fd895d
a55c962
e96c5b7
cbd1cf4
45bd6e7
d7776d5
c48362c
6664f55
3e60cf7
a9954bf
f06d1f0
81411af
24c2f46
c336c20
9d66171
e09fb06
3dc3787
44ee441
3d86d9b
fb612fc
68ead83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| # Using Dataset Tools | ||
|
|
||
| This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets. | ||
|
|
||
| ## Overview | ||
|
|
||
| LeRobot provides several utilities for manipulating datasets: | ||
|
|
||
| 1. **Delete Episodes** - Remove specific episodes from a dataset | ||
| 2. **Split Dataset** - Divide a dataset into multiple smaller datasets | ||
| 3. **Merge Datasets** - Combine multiple datasets into one | ||
| 4. **Add Features** - Add new features to a dataset | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to add an example for add feature There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, I think this and the renaming features would make nice small PRs for the community to do to get them engaged in the dataset |
||
| 5. **Remove Features** - Remove features from a dataset | ||
|
|
||
| The core implementation is in `lerobot.datasets.dataset_tools`. | ||
| An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`. | ||
|
|
||
| ## Command-Line Tool: lerobot-edit-dataset | ||
|
|
||
| `lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features. | ||
|
|
||
| Run `lerobot-edit-dataset --help` for more information on the configuration of each operation. | ||
|
|
||
| ### Usage Examples | ||
|
|
||
| #### Delete Episodes | ||
|
|
||
| Remove specific episodes from a dataset. This is useful for filtering out undesired data. | ||
|
|
||
| ```bash | ||
| # Delete episodes 0, 2, and 5 (modifies original dataset) | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --operation.type delete_episodes \ | ||
| --operation.episode_indices "[0, 2, 5]" | ||
|
|
||
| # Delete episodes and save to a new dataset (preserves original dataset) | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --new_repo_id lerobot/pusht_after_deletion \ | ||
| --operation.type delete_episodes \ | ||
| --operation.episode_indices "[0, 2, 5]" | ||
| ``` | ||
|
|
||
| #### Split Dataset | ||
|
|
||
| Divide a dataset into multiple subsets. | ||
|
|
||
| ```bash | ||
| # Split by fractions (e.g. 80% train, 20% test, 20% val) | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --operation.type split \ | ||
| --operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}' | ||
imstevenpmwork marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Split by specific episode indices | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --operation.type split \ | ||
| --operation.splits '{"train": [0, 1, 2, 3], "val": [4, 5]}' | ||
| ``` | ||
|
|
||
| Resulting datasets are saved under the repo id with the split name appended, e.g. `lerobot/pusht_train`, `lerobot/pusht_test`, `lerobot/pusht_val`. | ||
|
|
||
| #### Merge Datasets | ||
|
|
||
| Combine multiple datasets into a single dataset. | ||
|
|
||
| ```bash | ||
| # Merge train and validation splits back into one dataset | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht_merged \ | ||
| --operation.type merge \ | ||
| --operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']" | ||
| ``` | ||
|
|
||
| #### Remove Features | ||
|
|
||
| Remove features from a dataset. | ||
|
|
||
| ```bash | ||
| # Remove a camera feature | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --operation.type remove_feature \ | ||
| --operation.feature_names "['observation.images.top']" | ||
| ``` | ||
|
|
||
| ### Push to Hub | ||
|
|
||
| Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub: | ||
|
|
||
| ```bash | ||
| lerobot-edit-dataset \ | ||
| --repo_id lerobot/pusht \ | ||
| --new_repo_id lerobot/pusht_after_deletion \ | ||
| --operation.type delete_episodes \ | ||
| --operation.episode_indices "[0, 2, 5]" \ | ||
| --push_to_hub | ||
| ``` | ||
|
|
||
| There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| Example script demonstrating dataset tools utilities. | ||
|
|
||
| This script shows how to: | ||
| 1. Delete episodes from a dataset | ||
| 2. Split a dataset into train/val sets | ||
| 3. Add/remove features | ||
| 4. Merge datasets | ||
|
|
||
| Usage: | ||
| python examples/dataset/use_dataset_tools.py | ||
| """ | ||
|
|
||
| import numpy as np | ||
|
|
||
| from lerobot.datasets.dataset_tools import ( | ||
| add_feature, | ||
| delete_episodes, | ||
| merge_datasets, | ||
| remove_feature, | ||
| split_dataset, | ||
| ) | ||
| from lerobot.datasets.lerobot_dataset import LeRobotDataset | ||
|
|
||
|
|
||
| def main(): | ||
| dataset = LeRobotDataset("lerobot/pusht") | ||
|
|
||
| print(f"Original dataset: {dataset.meta.total_episodes} episodes, {dataset.meta.total_frames} frames") | ||
| print(f"Features: {list(dataset.meta.features.keys())}") | ||
|
|
||
| print("\n1. Deleting episodes 0 and 2...") | ||
| filtered_dataset = delete_episodes(dataset, episode_indices=[0, 2], repo_id="lerobot/pusht_filtered") | ||
| print(f"Filtered dataset: {filtered_dataset.meta.total_episodes} episodes") | ||
|
|
||
| print("\n2. Splitting dataset into train/val...") | ||
| splits = split_dataset( | ||
| dataset, | ||
| splits={"train": 0.8, "val": 0.2}, | ||
| ) | ||
| print(f"Train split: {splits['train'].meta.total_episodes} episodes") | ||
| print(f"Val split: {splits['val'].meta.total_episodes} episodes") | ||
|
|
||
| print("\n3. Adding a reward feature...") | ||
|
|
||
| reward_values = np.random.randn(dataset.meta.total_frames).astype(np.float32) | ||
| dataset_with_reward = add_feature( | ||
| dataset, | ||
| feature_name="reward", | ||
| feature_values=reward_values, | ||
| feature_info={ | ||
| "dtype": "float32", | ||
| "shape": (1,), | ||
| "names": None, | ||
| }, | ||
| repo_id="lerobot/pusht_with_reward", | ||
| ) | ||
|
|
||
| def compute_success(row_dict, episode_index, frame_index): | ||
| episode_length = 10 | ||
| return float(frame_index >= episode_length - 10) | ||
|
|
||
| dataset_with_success = add_feature( | ||
| dataset_with_reward, | ||
| feature_name="success", | ||
| feature_values=compute_success, | ||
| feature_info={ | ||
| "dtype": "float32", | ||
| "shape": (1,), | ||
| "names": None, | ||
| }, | ||
| repo_id="lerobot/pusht_with_reward_and_success", | ||
| ) | ||
|
|
||
| print(f"New features: {list(dataset_with_success.meta.features.keys())}") | ||
|
|
||
| print("\n4. Removing the success feature...") | ||
| dataset_cleaned = remove_feature( | ||
| dataset_with_success, feature_names="success", repo_id="lerobot/pusht_cleaned" | ||
| ) | ||
| print(f"Features after removal: {list(dataset_cleaned.meta.features.keys())}") | ||
|
|
||
| print("\n5. Merging train and val splits back together...") | ||
| merged = merge_datasets([splits["train"], splits["val"]], output_repo_id="lerobot/pusht_merged") | ||
| print(f"Merged dataset: {merged.meta.total_episodes} episodes") | ||
|
|
||
| print("\n6. Complex workflow example...") | ||
|
|
||
| if len(dataset.meta.camera_keys) > 1: | ||
| camera_to_remove = dataset.meta.camera_keys[0] | ||
| print(f"Removing camera: {camera_to_remove}") | ||
| dataset_no_cam = remove_feature( | ||
| dataset, feature_names=camera_to_remove, repo_id="pusht_no_first_camera" | ||
| ) | ||
| print(f"Remaining cameras: {dataset_no_cam.meta.camera_keys}") | ||
|
|
||
| print("\nDone! Check ~/.cache/huggingface/lerobot/ for the created datasets.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Uh oh!
There was an error while loading. Please reload this page.