Skip to content
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
d3cff4c
feat(dataset-tools): add dataset utilities and example script
michel-aractingi Aug 12, 2025
01c4bf9
style fixes
michel-aractingi Oct 1, 2025
9a6b84e
move example to dataset dir
michel-aractingi Oct 1, 2025
da3cc59
missing lisence
michel-aractingi Oct 1, 2025
28afab3
fixes mostly path
michel-aractingi Oct 1, 2025
8ac6be8
clean comments
michel-aractingi Oct 1, 2025
90b0b47
move tests to functions instead of class based
michel-aractingi Oct 1, 2025
49b923f
- fix video editting, decode, delete frames and rencode video
michel-aractingi Oct 1, 2025
cbcc146
Fortify tooling tests
michel-aractingi Oct 1, 2025
2049802
Fix type issue resulting from saving numpy arrays with shape 3,1,1
michel-aractingi Oct 2, 2025
ef5fd5e
added lerobot_edit_dataset
michel-aractingi Oct 2, 2025
1ac4434
- revert changes in examples
michel-aractingi Oct 2, 2025
012d9ee
update comment
michel-aractingi Oct 2, 2025
fa593f2
fix comment
michel-aractingi Oct 2, 2025
0dc81f5
Merge branch 'main' into feat/dataset_tools
michel-aractingi Oct 2, 2025
298fb6f
Merge branch 'main' into feat/dataset_tools
michel-aractingi Oct 2, 2025
a8cc159
Apply suggestion from @Copilot
michel-aractingi Oct 2, 2025
e961c1f
style nit after copilot review
michel-aractingi Oct 2, 2025
4fd895d
fix: bug in dataset root when editing the dataset in place (without s…
michel-aractingi Oct 4, 2025
a55c962
Fix bug in aggregate.py when accumelating video timestamps; add tests…
michel-aractingi Oct 5, 2025
e96c5b7
Added missing output repo id
michel-aractingi Oct 5, 2025
cbd1cf4
Merge branch 'main' into feat/dataset_tools
michel-aractingi Oct 6, 2025
45bd6e7
migrate delete episode to using pyav instead of decoding, writing fra…
michel-aractingi Oct 7, 2025
d7776d5
added modified suffix in case repo_id is not set in delete_episode
michel-aractingi Oct 7, 2025
c48362c
adding docs for dataset tools
michel-aractingi Oct 7, 2025
6664f55
Merge branch 'main' into feat/dataset_tools
michel-aractingi Oct 7, 2025
3e60cf7
bump av version and add back time_base assignment
michel-aractingi Oct 7, 2025
a9954bf
linter
michel-aractingi Oct 7, 2025
f06d1f0
modified push_to_hub logic in lerobot_edit_dataset
michel-aractingi Oct 7, 2025
81411af
fix(progress bar): fixing the progress bar issue in dataset tools
CarolinePascal Oct 7, 2025
24c2f46
chore(concatenate): removing no longer needed concatenate_datasets usage
CarolinePascal Oct 7, 2025
c336c20
Merge branch 'main' into feat/dataset_tools
michel-aractingi Oct 8, 2025
9d66171
fix(file sizes forwarding): forwarding files and chunk sizes in metad…
CarolinePascal Oct 9, 2025
e09fb06
style fix
michel-aractingi Oct 9, 2025
3dc3787
refactor(aggregate): Fix video indexing and timestamp bugs in dataset…
michel-aractingi Oct 10, 2025
44ee441
Improved docs for split dataset and added a check for the possible ca…
michel-aractingi Oct 10, 2025
3d86d9b
Merge branch 'main' into feat/dataset_tools
imstevenpmwork Oct 10, 2025
fb612fc
chore(docs): update merge documentation details
imstevenpmwork Oct 10, 2025
68ead83
Merge branch 'main' into feat/dataset_tools
imstevenpmwork Oct 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
title: Using LeRobotDataset
- local: porting_datasets_v3
title: Porting Large Datasets
- local: using_dataset_tools
title: Using the Dataset Tools
title: "Datasets"
- sections:
- local: act
Expand Down
102 changes: 102 additions & 0 deletions docs/source/using_dataset_tools.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Using Dataset Tools

This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets.

## Overview

LeRobot provides several utilities for manipulating datasets:

1. **Delete Episodes** - Remove specific episodes from a dataset
2. **Split Dataset** - Divide a dataset into multiple smaller datasets
3. **Merge Datasets** - Combine multiple datasets into one
4. **Add Features** - Add new features to a dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add an example for add feature

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this and the renaming features would make nice small PRs for the community to do to get them engaged in the dataset

5. **Remove Features** - Remove features from a dataset

The core implementation is in `lerobot.datasets.dataset_tools`.
An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.

## Command-Line Tool: lerobot-edit-dataset

`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.

Run `lerobot-edit-dataset --help` for more information on the configuration of each operation.

### Usage Examples

#### Delete Episodes

Remove specific episodes from a dataset. This is useful for filtering out undesired data.

```bash
# Delete episodes 0, 2, and 5 (modifies original dataset)
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type delete_episodes \
--operation.episode_indices "[0, 2, 5]"

# Delete episodes and save to a new dataset (preserves original dataset)
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--new_repo_id lerobot/pusht_after_deletion \
--operation.type delete_episodes \
--operation.episode_indices "[0, 2, 5]"
```

#### Split Dataset

Divide a dataset into multiple subsets.

```bash
# Split by fractions (e.g. 80% train, 20% test, 20% val)
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type split \
--operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}'

# Split by specific episode indices
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type split \
--operation.splits '{"train": [0, 1, 2, 3], "val": [4, 5]}'
```

Resulting datasets are saved under the repo id with the split name appended, e.g. `lerobot/pusht_train`, `lerobot/pusht_test`, `lerobot/pusht_val`.

#### Merge Datasets

Combine multiple datasets into a single dataset.

```bash
# Merge train and validation splits back into one dataset
lerobot-edit-dataset \
--repo_id lerobot/pusht_merged \
--operation.type merge \
--operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']"
```

#### Remove Features

Remove features from a dataset.

```bash
# Remove a camera feature
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--operation.type remove_feature \
--operation.feature_names "['observation.images.top']"
```

### Push to Hub

Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:

```bash
lerobot-edit-dataset \
--repo_id lerobot/pusht \
--new_repo_id lerobot/pusht_after_deletion \
--operation.type delete_episodes \
--operation.episode_indices "[0, 2, 5]" \
--push_to_hub
```

There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`.
117 changes: 117 additions & 0 deletions examples/dataset/use_dataset_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
#!/usr/bin/env python

# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
Example script demonstrating dataset tools utilities.

This script shows how to:
1. Delete episodes from a dataset
2. Split a dataset into train/val sets
3. Add/remove features
4. Merge datasets

Usage:
python examples/dataset/use_dataset_tools.py
"""

import numpy as np

from lerobot.datasets.dataset_tools import (
add_feature,
delete_episodes,
merge_datasets,
remove_feature,
split_dataset,
)
from lerobot.datasets.lerobot_dataset import LeRobotDataset


def main():
dataset = LeRobotDataset("lerobot/pusht")

print(f"Original dataset: {dataset.meta.total_episodes} episodes, {dataset.meta.total_frames} frames")
print(f"Features: {list(dataset.meta.features.keys())}")

print("\n1. Deleting episodes 0 and 2...")
filtered_dataset = delete_episodes(dataset, episode_indices=[0, 2], repo_id="lerobot/pusht_filtered")
print(f"Filtered dataset: {filtered_dataset.meta.total_episodes} episodes")

print("\n2. Splitting dataset into train/val...")
splits = split_dataset(
dataset,
splits={"train": 0.8, "val": 0.2},
)
print(f"Train split: {splits['train'].meta.total_episodes} episodes")
print(f"Val split: {splits['val'].meta.total_episodes} episodes")

print("\n3. Adding a reward feature...")

reward_values = np.random.randn(dataset.meta.total_frames).astype(np.float32)
dataset_with_reward = add_feature(
dataset,
feature_name="reward",
feature_values=reward_values,
feature_info={
"dtype": "float32",
"shape": (1,),
"names": None,
},
repo_id="lerobot/pusht_with_reward",
)

def compute_success(row_dict, episode_index, frame_index):
episode_length = 10
return float(frame_index >= episode_length - 10)

dataset_with_success = add_feature(
dataset_with_reward,
feature_name="success",
feature_values=compute_success,
feature_info={
"dtype": "float32",
"shape": (1,),
"names": None,
},
repo_id="lerobot/pusht_with_reward_and_success",
)

print(f"New features: {list(dataset_with_success.meta.features.keys())}")

print("\n4. Removing the success feature...")
dataset_cleaned = remove_feature(
dataset_with_success, feature_names="success", repo_id="lerobot/pusht_cleaned"
)
print(f"Features after removal: {list(dataset_cleaned.meta.features.keys())}")

print("\n5. Merging train and val splits back together...")
merged = merge_datasets([splits["train"], splits["val"]], output_repo_id="lerobot/pusht_merged")
print(f"Merged dataset: {merged.meta.total_episodes} episodes")

print("\n6. Complex workflow example...")

if len(dataset.meta.camera_keys) > 1:
camera_to_remove = dataset.meta.camera_keys[0]
print(f"Removing camera: {camera_to_remove}")
dataset_no_cam = remove_feature(
dataset, feature_names=camera_to_remove, repo_id="pusht_no_first_camera"
)
print(f"Remaining cameras: {dataset_no_cam.meta.camera_keys}")

print("\nDone! Check ~/.cache/huggingface/lerobot/ for the created datasets.")


if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ dependencies = [
"cmake>=3.29.0.1,<4.2.0",
"einops>=0.8.0,<0.9.0",
"opencv-python-headless>=4.9.0,<4.13.0",
"av>=14.2.0,<16.0.0",
"av>=15.0.0,<16.0.0",
"jsonlines>=4.0.0,<5.0.0",
"packaging>=24.2,<26.0",
"pynput>=1.7.7,<1.9.0",
Expand Down Expand Up @@ -175,6 +175,7 @@ lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
lerobot-info="lerobot.scripts.lerobot_info:main"
lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"

# ---------------- Tool Configurations ----------------
[tool.setuptools.packages.find]
Expand Down
24 changes: 9 additions & 15 deletions src/lerobot/datasets/aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,9 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
Returns:
dict: Updated videos_idx with current chunk and file indices.
"""
for key in videos_idx:
videos_idx[key]["episode_duration"] = src_meta.total_frames / src_meta.fps

for key, video_idx in videos_idx.items():
unique_chunk_file_pairs = {
(chunk, file)
Expand All @@ -250,6 +253,8 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
chunk_idx = video_idx["chunk"]
file_idx = video_idx["file"]

rotated_to_new_file = False

for src_chunk_idx, src_file_idx in unique_chunk_file_pairs:
src_path = src_meta.root / DEFAULT_VIDEO_PATH.format(
video_key=key,
Expand All @@ -263,21 +268,15 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
file_index=file_idx,
)

# If a new file is created, we don't want to increment the latest_duration
update_latest_duration = False

if not dst_path.exists():
# First write to this destination file
dst_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(str(src_path), str(dst_path))
continue # not accumulating further, already copied the file in place
continue

# Check file sizes before appending
src_size = get_video_size_in_mb(src_path)
dst_size = get_video_size_in_mb(dst_path)

if dst_size + src_size >= video_files_size_in_mb:
# Rotate to a new chunk/file
chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
video_key=key,
Expand All @@ -286,24 +285,19 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
)
dst_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(str(src_path), str(dst_path))
rotated_to_new_file = True
else:
# Get the timestamps shift for this video
timestamps_shift_s = dst_meta.info["total_frames"] / dst_meta.info["fps"]

# Append to existing video file
concatenate_video_files(
[dst_path, src_path],
dst_path,
)
# Update the latest_duration when appending (shifts timestamps!)
update_latest_duration = not update_latest_duration

# Update the videos_idx with the final chunk and file indices for this key
videos_idx[key]["chunk"] = chunk_idx
videos_idx[key]["file"] = file_idx

if update_latest_duration:
videos_idx[key]["latest_duration"] += timestamps_shift_s
if rotated_to_new_file:
videos_idx[key]["latest_duration"] = 0

return videos_idx

Expand Down
Loading