Add script to generate embedding for dataset #2138

pkooij · 2025-10-08T09:37:14Z

This PR introduces a way to generate image and text embeddings to a dataset to be more efficient when training a dataset for multiple epochs. For example for learning a general reward we contain a specific dataset with OXE to improve generalization. In order to nor recompute the image and text embeddings each time we finetune for OXE we can use this script to add the embeddings to the dataset. We can additionally remove the videos in the dataset to safe space.

Testing:
Both the generate and validate script were tested on this dataset: lerobot/utokyo_xarm_bimanual. The generated dataset can be found here: pepijn223/utokyo_xarm_bimanual_embeddings.

michel-aractingi · 2025-10-14T07:59:34Z

src/lerobot/datasets/generating_embeddings/generate_embeddings.py

+                    if isinstance(img, torch.Tensor):
+                        if img.ndim == 4:
+                            # Shape: (T, C, H, W) where T=1 for single timestamp
+                            img = img[0]  # Now (C, H, W)
+                        elif img.ndim == 3:
+                            # Shape: (C, H, W)
+                            pass
+                        else:
+                            raise ValueError(
+                                f"Unexpected video frame shape {img.shape} for camera {cam_key}. "
+                                f"Expected (T, C, H, W) or (C, H, W). Episode {ep_idx}, Frame {frame_idx}"
+                            )
+
+                        # Convert to numpy: (C, H, W) float32 [0, 1] -> (H, W, C) uint8 [0, 255]
+                        img_np = (img.permute(1, 2, 0).numpy() * 255).astype(np.uint8)
+                    else:
+                        img_np = np.array(img)
+                else:
+                    # Load from image file
+                    img = item[cam_key]
+                    # Convert to numpy if needed
+                    if isinstance(img, torch.Tensor):
+                        if img.ndim == 3:
+                            img_np = (img.permute(1, 2, 0).numpy() * 255).astype(np.uint8)
+                        else:
+                            raise ValueError(f"Unexpected image shape {img.shape} for camera {cam_key}")
+                    else:
+                        img_np = np.array(img)


Should we add testing for this part,
this logic could be buggy if we have image datasets instead of videos.
Better if we had tests to verify

michel-aractingi · 2025-10-14T08:27:00Z

src/lerobot/datasets/generating_embeddings/encoders.py

+        self.batch_size = batch_size
+        self.model_name = model_name
+        logger.info(f"Loading DinoV2 model: {model_name}")
+        self.model = torch.hub.load("facebookresearch/dinov2", model_name)  # nosec B614


Is there a reason why you didn't use AutoModel from transformers here also? We do it for the SAC encoder.

lerobot/src/lerobot/policies/sac/modeling_sac.py

Line 941 in 6f5bb4d

self.image_enc_layers = AutoModel.from_pretrained(config.vision_encoder_name, trust_remote_code=True)

Something like:

self.model = AutoModel.from_pretrained("facebook/dinov2_base"):

michel-aractingi · 2025-10-14T08:28:10Z

src/lerobot/datasets/generating_embeddings/validate_embeddings.py

+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+
+def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:


michel-aractingi · 2025-10-14T08:57:07Z

src/lerobot/datasets/generating_embeddings/generate_embeddings.py

+    for (chunk_idx, file_idx), frame_indices in tqdm(frames_by_file.items(), desc="Updating parquet files"):
+        parquet_path = output_dataset.root / DEFAULT_DATA_PATH.format(
+            chunk_index=chunk_idx, file_index=file_idx
+        )
+
+        # Load the parquet file
+        df = pd.read_parquet(parquet_path)
+
+        # Add embedding columns
+        df["task_embedding"] = [all_task_embeddings[idx].tolist() for idx in frame_indices]
+
+        for cam_key in dataset.meta.camera_keys:
+            df[f"{cam_key}_embedding"] = [
+                image_embeddings_dict[cam_key][idx].tolist() for idx in frame_indices
+            ]
+
+        # Save the updated parquet file
+        df.to_parquet(parquet_path, index=False)


For this chunk I suggest to do it in a more effiecient way either by using dataset_tools. I'll push a PR.

Reading a writing the parquet file to disk will be slow and result in memory explosion for huge datasets.

Add generate and validate script

374e1d9

pkooij self-assigned this Oct 8, 2025

pkooij requested a review from michel-aractingi October 8, 2025 09:37

pkooij added dataset Issues regarding data inputs, processing, or datasets performance Issues aimed at improving speed or resource usage policies Items related to robot policies labels Oct 8, 2025

fix precommit

c3e5404

pkooij marked this pull request as ready for review October 8, 2025 09:39

pkooij added 4 commits October 8, 2025 18:11

Merge branch 'main' into feat/generate_embeddings

abdc523

Merge branch 'main' into feat/generate_embeddings

7d18a85

Merge branch 'main' into feat/generate_embeddings

8d9e668

Merge branch 'main' into feat/generate_embeddings

f726042

michel-aractingi reviewed Oct 14, 2025

View reviewed changes

michel-aractingi mentioned this pull request Oct 14, 2025

Feat/expand add features #2202

Merged

Merge branch 'main' into feat/generate_embeddings

74d8473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add script to generate embedding for dataset #2138

Add script to generate embedding for dataset #2138

pkooij commented Oct 8, 2025 •

edited

Loading

Uh oh!

michel-aractingi Oct 14, 2025

Uh oh!

michel-aractingi Oct 14, 2025

Uh oh!

michel-aractingi Oct 14, 2025

Uh oh!

michel-aractingi Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from lerobot.datasets.lerobot_dataset import LeRobotDataset


		def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:

Add script to generate embedding for dataset #2138

Are you sure you want to change the base?

Add script to generate embedding for dataset #2138

Conversation

pkooij commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michel-aractingi Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

michel-aractingi Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

michel-aractingi Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

michel-aractingi Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pkooij commented Oct 8, 2025 •

edited

Loading