huggingface · michel-aractingi · Oct 10, 2025 · Aug 12, 2025 · Oct 1, 2025 · Oct 1, 2025
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -25,6 +25,8 @@
     title: Using LeRobotDataset
   - local: porting_datasets_v3
     title: Porting Large Datasets
+  - local: using_dataset_tools
+    title: Using the Dataset Tools
   title: "Datasets"
 - sections:
   - local: act

diff --git a/docs/source/using_dataset_tools.mdx b/docs/source/using_dataset_tools.mdx
@@ -0,0 +1,102 @@
+# Using Dataset Tools
+
+This guide covers the dataset tools utilities available in LeRobot for modifying and editing existing datasets.
+
+## Overview
+
+LeRobot provides several utilities for manipulating datasets:
+
+1. **Delete Episodes** - Remove specific episodes from a dataset
+2. **Split Dataset** - Divide a dataset into multiple smaller datasets
+3. **Merge Datasets** - Combine multiple datasets into one
+4. **Add Features** - Add new features to a dataset
+5. **Remove Features** - Remove features from a dataset
+
+The core implementation is in `lerobot.datasets.dataset_tools`.
+An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.
+
+## Command-Line Tool: lerobot-edit-dataset
+
+`lerobot-edit-dataset` is a command-line script for editing datasets. It can be used to delete episodes, split datasets, merge datasets, add features, and remove features.
+
+Run `lerobot-edit-dataset --help` for more information on the configuration of each operation.
+
+### Usage Examples
+
+#### Delete Episodes
+
+Remove specific episodes from a dataset. This is useful for filtering out undesired data.
+
+```bash
+# Delete episodes 0, 2, and 5 (modifies original dataset)
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --operation.type delete_episodes \
+    --operation.episode_indices "[0, 2, 5]"
+
+# Delete episodes and save to a new dataset (preserves original dataset)
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --new_repo_id lerobot/pusht_after_deletion \
+    --operation.type delete_episodes \
+    --operation.episode_indices "[0, 2, 5]"
+```
+
+#### Split Dataset
+
+Divide a dataset into multiple subsets.
+
+```bash
+# Split by fractions (e.g. 80% train, 20% test, 20% val)
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --operation.type split \
+    --operation.splits '{"train": 0.8, "test": 0.2, "val": 0.2}'
+
+# Split by specific episode indices
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --operation.type split \
+    --operation.splits '{"train": [0, 1, 2, 3], "val": [4, 5]}'
+```
+
+Resulting datasets are saved under the repo id with the split name appended, e.g. `lerobot/pusht_train`, `lerobot/pusht_test`, `lerobot/pusht_val`.
+
+#### Merge Datasets
+
+Combine multiple datasets into a single dataset.
+
+```bash
+# Merge train and validation splits back into one dataset
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht_merged \
+    --operation.type merge \
+    --operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']"
+```
+
+#### Remove Features
+
+Remove features from a dataset.
+
+```bash
+# Remove a camera feature
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --operation.type remove_feature \
+    --operation.feature_names "['observation.images.top']"
+```
+
+### Push to Hub
+
+Add the `--push_to_hub` flag to any command to automatically upload the resulting dataset to the Hugging Face Hub:
+
+```bash
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --new_repo_id lerobot/pusht_after_deletion \
+    --operation.type delete_episodes \
+    --operation.episode_indices "[0, 2, 5]" \
+    --push_to_hub
+```
+
+There is also a tool for adding features to a dataset that is not yet covered in `lerobot-edit-dataset`.
diff --git a/examples/dataset/use_dataset_tools.py b/examples/dataset/use_dataset_tools.py
@@ -0,0 +1,117 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Example script demonstrating dataset tools utilities.
+
+This script shows how to:
+1. Delete episodes from a dataset
+2. Split a dataset into train/val sets
+3. Add/remove features
+4. Merge datasets
+
+Usage:
+    python examples/dataset/use_dataset_tools.py
+"""
+
+import numpy as np
+
+from lerobot.datasets.dataset_tools import (
+    add_feature,
+    delete_episodes,
+    merge_datasets,
+    remove_feature,
+    split_dataset,
+)
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+
+def main():
+    dataset = LeRobotDataset("lerobot/pusht")
+
+    print(f"Original dataset: {dataset.meta.total_episodes} episodes, {dataset.meta.total_frames} frames")
+    print(f"Features: {list(dataset.meta.features.keys())}")
+
+    print("\n1. Deleting episodes 0 and 2...")
+    filtered_dataset = delete_episodes(dataset, episode_indices=[0, 2], repo_id="lerobot/pusht_filtered")
+    print(f"Filtered dataset: {filtered_dataset.meta.total_episodes} episodes")
+
+    print("\n2. Splitting dataset into train/val...")
+    splits = split_dataset(
+        dataset,
+        splits={"train": 0.8, "val": 0.2},
+    )
+    print(f"Train split: {splits['train'].meta.total_episodes} episodes")
+    print(f"Val split: {splits['val'].meta.total_episodes} episodes")
+
+    print("\n3. Adding a reward feature...")
+
+    reward_values = np.random.randn(dataset.meta.total_frames).astype(np.float32)
+    dataset_with_reward = add_feature(
+        dataset,
+        feature_name="reward",
+        feature_values=reward_values,
+        feature_info={
+            "dtype": "float32",
+            "shape": (1,),
+            "names": None,
+        },
+        repo_id="lerobot/pusht_with_reward",
+    )
+
+    def compute_success(row_dict, episode_index, frame_index):
+        episode_length = 10
+        return float(frame_index >= episode_length - 10)
+
+    dataset_with_success = add_feature(
+        dataset_with_reward,
+        feature_name="success",
+        feature_values=compute_success,
+        feature_info={
+            "dtype": "float32",
+            "shape": (1,),
+            "names": None,
+        },
+        repo_id="lerobot/pusht_with_reward_and_success",
+    )
+
+    print(f"New features: {list(dataset_with_success.meta.features.keys())}")
+
+    print("\n4. Removing the success feature...")
+    dataset_cleaned = remove_feature(
+        dataset_with_success, feature_names="success", repo_id="lerobot/pusht_cleaned"
+    )
+    print(f"Features after removal: {list(dataset_cleaned.meta.features.keys())}")
+
+    print("\n5. Merging train and val splits back together...")
+    merged = merge_datasets([splits["train"], splits["val"]], output_repo_id="lerobot/pusht_merged")
+    print(f"Merged dataset: {merged.meta.total_episodes} episodes")
+
+    print("\n6. Complex workflow example...")
+
+    if len(dataset.meta.camera_keys) > 1:
+        camera_to_remove = dataset.meta.camera_keys[0]
+        print(f"Removing camera: {camera_to_remove}")
+        dataset_no_cam = remove_feature(
+            dataset, feature_names=camera_to_remove, repo_id="pusht_no_first_camera"
+        )
+        print(f"Remaining cameras: {dataset_no_cam.meta.camera_keys}")
+
+    print("\nDone! Check ~/.cache/huggingface/lerobot/ for the created datasets.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pyproject.toml b/pyproject.toml
@@ -67,7 +67,7 @@ dependencies = [
     "cmake>=3.29.0.1,<4.2.0",
     "einops>=0.8.0,<0.9.0",
     "opencv-python-headless>=4.9.0,<4.13.0",
-    "av>=14.2.0,<16.0.0",
+    "av>=15.0.0,<16.0.0",
     "jsonlines>=4.0.0,<5.0.0",
     "packaging>=24.2,<26.0",
     "pynput>=1.7.7,<1.9.0",
@@ -175,6 +175,7 @@ lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
 lerobot-info="lerobot.scripts.lerobot_info:main"
 lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
+lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 
 # ---------------- Tool Configurations ----------------
 [tool.setuptools.packages.find]

diff --git a/src/lerobot/datasets/aggregate.py b/src/lerobot/datasets/aggregate.py
@@ -236,6 +236,9 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
     Returns:
         dict: Updated videos_idx with current chunk and file indices.
     """
+    for key in videos_idx:
+        videos_idx[key]["episode_duration"] = src_meta.total_frames / src_meta.fps
+
     for key, video_idx in videos_idx.items():
         unique_chunk_file_pairs = {
             (chunk, file)
@@ -250,6 +253,8 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
         chunk_idx = video_idx["chunk"]
         file_idx = video_idx["file"]
 
+        rotated_to_new_file = False
+
         for src_chunk_idx, src_file_idx in unique_chunk_file_pairs:
             src_path = src_meta.root / DEFAULT_VIDEO_PATH.format(
                 video_key=key,
@@ -263,21 +268,15 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
                 file_index=file_idx,
             )
 
-            # If a new file is created, we don't want to increment the latest_duration
-            update_latest_duration = False
-
             if not dst_path.exists():
-                # First write to this destination file
                 dst_path.parent.mkdir(parents=True, exist_ok=True)
                 shutil.copy(str(src_path), str(dst_path))
-                continue  # not accumulating further, already copied the file in place
+                continue
 
-            # Check file sizes before appending
             src_size = get_video_size_in_mb(src_path)
             dst_size = get_video_size_in_mb(dst_path)
 
             if dst_size + src_size >= video_files_size_in_mb:
-                # Rotate to a new chunk/file
                 chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
                 dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
                     video_key=key,
@@ -286,24 +285,19 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
                 )
                 dst_path.parent.mkdir(parents=True, exist_ok=True)
                 shutil.copy(str(src_path), str(dst_path))
+                rotated_to_new_file = True
             else:
-                # Get the timestamps shift for this video
-                timestamps_shift_s = dst_meta.info["total_frames"] / dst_meta.info["fps"]
-
                 # Append to existing video file
                 concatenate_video_files(
                     [dst_path, src_path],
                     dst_path,
                 )
-                # Update the latest_duration when appending (shifts timestamps!)
-                update_latest_duration = not update_latest_duration
 
-        # Update the videos_idx with the final chunk and file indices for this key
         videos_idx[key]["chunk"] = chunk_idx
         videos_idx[key]["file"] = file_idx
 
-        if update_latest_duration:
-            videos_idx[key]["latest_duration"] += timestamps_shift_s
+        if rotated_to_new_file:
+            videos_idx[key]["latest_duration"] = 0
 
     return videos_idx