Question about fine-tuning OpenVLA‑OFT with non‑RLDS (GR00T / LeRobot v2 style) datasets

Hi, thank you for releasing OpenVLA‑OFT – this repository is extremely helpful.

I have a question about the **intended data interface for fine‑tuning**, especially regarding datasets that are *not* converted to RLDS.

---

### Background

I have demonstration data generated with **Isaac GR00T (v1.6 / N1.6)**, structured in a **LeRobot v2–style format**, roughly as:

`.
├─meta 
│ ├─episodes.jsonl
│ ├─modality.json # -> GR00T LeRobot specific
│ ├─info.json
│ └─tasks.jsonl
├─videos
│ └─chunk-000
│   └─observation.images.front
│     └─episode_000001.mp4
│     └─episode_000000.mp4
└─data
  └─chunk-000
    ├─episode_000001.parquet
    └─episode_000000.parquet`

Each episode contains:
- RGB images (front camera)
- robot state / proprio (stored in parquet)
- continuous actions
- a task description (language)

This format is compatible with how GR00T / LeRobot v2 store demonstrations.

---

### What I tried

I attempted to fine‑tune OpenVLA‑OFT by:
- writing a custom **PyTorch `Dataset` / `DataLoader`**
- producing batches shaped like:
  - `observation = { full_image, state, task_description }`
  - `actions = (current + future action chunk)`
- using action chunking + L1 regression, following `vla-scripts/finetune.py`

However, this required significant patching of:
- `experiments/robot/openvla_utils.py`
- assumptions around RLDS, `dataset_statistics.json`, and pretrained action heads

At this point I’m unsure whether:
- this approach is *supported but undocumented*, or
- **fine‑tuning is intentionally designed around RLDS only**, and non‑RLDS inputs are out of scope.

---

### My question

**What is the officially intended / recommended way to fine‑tune OpenVLA‑OFT with datasets that are not in RLDS format?**

Specifically:
1. Is converting datasets (like GR00T / LeRobot v2) to RLDS considered the *expected* path?
2. Or is there a supported way to plug in a custom loader that outputs observation dicts + action chunks?
3. If the latter is possible, is there an example or minimal reference for the expected batch format?

I fully understand if RLDS is the only supported interface – I mainly want to confirm the design intent so I can align with it instead of fighting the codebase.

Thanks again for your work, and for any guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about fine-tuning OpenVLA‑OFT with non‑RLDS (GR00T / LeRobot v2 style) datasets #152

Background

What I tried

My question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about fine-tuning OpenVLA‑OFT with non‑RLDS (GR00T / LeRobot v2 style) datasets #152

Description

Background

What I tried

My question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions