Skip to content

intuitive-robots/flower_vla_pret

Repository files navigation

FlowerVLA

Paper, Project Page, Finetuning Code

Moritz Reuss1, Hongyi Zhou1, Marcel Ruehle1, Ömer Erdinç Yağmurlu1, Fabian Otto2, Rudolf Lioutikov1

1Intuitive Robots Lab (IRL), Karlsruhe Institute of Technology (KIT) 2Microsoft Research

An efficient Vision-Language-Action Model for Robot Learning

FLOWER VLA is a lightweight, efficient Vision-Language-Action (VLA) policy for robotic manipulation tasks that achieves state-of-the-art performance on multiple benchmarks. Built on a rectified flow architecture with several key architecture features:

  • Efficient Architecture: At less than ~1B parameters, FLOWER is significantly smaller than other VLA models
  • Low Training Cost: Only requires ~200 GPU hours of pretraining
  • Low Memory Footprint: Uses <8GB of GPU memory for inference
  • SOTA Performance: Achieves sota results on CALVIN and LIBERO benchmarks

For the finetuning code for FLOWER for CALVIN and LIBERO heck out our other codebase: flower_vla_calvin

Table of Contents

Installation

Requirements

  • Python 3.10
  • CUDA 11.8+
  • 24GB+ GPU memory (training) (more is better:))
  • 20GB+ disk space (datasets can be loaded from the google cloud)

Basic Setup

# Create conda environment
conda create -n flower python=3.10
conda activate flower

# Clone repository
git clone --recurse-submodules [email protected]:mbreuss/flower_vla.git
cd flower_vla

# Install requirements
pip install -r requirements_simpler.txt

Pretraining Guide

First you need to chose a pretraining mix. Some datasets are not included in the google cloud storage and need to be loaded from the local storage instead. Below you will find guides for the most important datasets and how to download them:

Dataset Preparation

Standard Datasets

Create a central dataset directory:

export DATA_DIR=~/tensorflow_datasets

Bridge Dataset

This is the recommended bridge dataset from Berkley, that is not part of OXE.

wget -r -np -nd -A '*' \
  https://rail.eecs.berkeley.edu/datasets/bridge_release/data/tfds/bridge_dataset/ \
  -P $DATA_DIR/bridge_dataset

BiPlay Dataset

BiPlay is a diverse bimanual aloha dataset from project page.

git lfs install
git clone https://huggingface.co/datasets/oier-mees/BiPlay \
  $DATA_DIR/aloha_play_dataset

Configuration Setup

FLOWER uses huggingface accelerate library for efficient multi-GPU training. If you run it locally on a multi GPU system you can config the training config using the following answers:

Accelerate Configuration

accelerate config

Example settings for 2-GPU training:

This machine
multi-GPU
1  # Number of machines
NO  # fp16
NO  # bf16
NO  # Gradient accumulation
NO  # Gradient clipping
NO  # CPU offload
2   # Number of GPUs
0,1 # GPU indices
yes # Use DDP
bf16 # Mixed precision type

For training on a slurm cluster we provide an example script used for pretraining FLOWER on 4 H100 GPUs. Note, that it is important to have a main process port for being able to download the required datasets from the google cloud.

Training Configuration

Modify conf/training.yaml:

# Basic Training Settings
batch_size: 512  # Total higher is better
gradient_accumulation_steps: 4 # recommended to use for llimited GPU memory settings to achieve larger batch sizes
max_train_steps: 500000
eval_every_n_steps: 10000  # does a short validation loss prediction for sanity checking NOTE: the validation loss does not correlate with the evaluation success rate and it is normal that it stagnates after some time. The model is still getting better. 
max_eval_steps: 100 # how many batches to use for validation loss

# Dataset Configuration
DATA_NAME: "trinity"  # datamix yo want to use 
DATA_PATH: "~/tensorflow_datasets"

# Optimization Settings
learning_rate_dit: 1e-4 # we use seuperate lr for the Flow Transformer and VLM to achieve the best results
learning_rate_vlm: 1e-5 # lower lr for VLM is crucial while the higher one for the flow helps too
weight_decay: 0.1 # high weight decay for the flow part and low one for the VLM part

# Hardware Settings
num_workers: 8  # Adjust based on CPU cores
pin_memory: true 

Training

Single Node Training

accelerate launch flower/training.py

Continue from checkpoint:

accelerate launch flower/training.py \
  +step=100 \
  +continue_training=/path/to/checkpoint_100

Multi-Node Training

# Node 1 (Master)
accelerate launch --multi_gpu --num_processes=2 \
  --main_process_ip="MASTER_IP" \
  --main_process_port=29500 \
  --num_machines=2 \
  --machine_rank=0 \
  flower/training.py

# Node 2
accelerate launch --multi_gpu --num_processes=2 \
  --main_process_ip="MASTER_IP" \
  --main_process_port=29500 \
  --num_machines=2 \
  --machine_rank=1 \
  flower/training.py

Enhanced Debugging

Tensorflow is a bit annoying to debug when adding new datasets and transforms. Therefore use the debug_transforms.py script to get proper error messages.

export TORCH_DISTRIBUTED_DEBUG=DETAIL
python flower/test_dataloader.py
python flower/debug_transforms.py

Advanced Usage

You can create custom dataset mixes for pretraining and finetuning. The code for the oxe dataset is based on the code from Octo and OpenVLA.

Custom Dataset Mixes

Modify flower_vla/dataset/oxe/mixes.py:

CUSTOM_MIX = [
    ("bridge_dataset", 4.0),
    ("fractal20220817_data", 2.0),
    ("eef_droid", 0.2),
]

Adding a new dataset

You need to handle several things to integrate a new dataset into the code:

  1. Define a datset config in flower_vla/dataset/oxe/configs.py
  2. Define a transform for it in flower_vla/dataset/oxe/transforms.py
  3. Add the value for the frequency to flower_vla/dataset/utils/frequency_mapping.py
  4. Add it to the dataset index flower_vla/dataset/utils/dataset_index.py
  5. Add the desired action chunk length to flower_vla/dataset/utils/act_seq_mapping.py

Now you should be good to go. If you still encounter issues use the debug_transforms.py script for testing. Otherwise feel free to raise an issue or write me an email.

Citation

If you found the code useful, please cite our work:

@inproceedings{
  reuss2025flower,
  title={{FLOWER}: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models},
  author={Moritz Reuss and Hongyi Zhou and Marcel R{\"u}hle and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Otto and Rudolf Lioutikov},
  booktitle={9th Annual Conference on Robot Learning},
  year={2025},
  url={https://openreview.net/forum?id=JeppaebLRD}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work is only possible because of the code from the following open-source projects and datasets:


About

[CoRL 2025] Pretraining code for FLOWER VLA on OXE

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages