|
| 1 | +# Qwen-GR00T: Training, Inference, and Serving |
| 2 | + |
| 3 | +This guide covers how to train, run inference, and serve Qwen-GR00T models using FlagScale. Qwen-GR00T uses a Qwen3-VL backbone as the vision-language model with a DiT-based flow matching action head. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +### Clone Repository |
| 8 | + |
| 9 | +```sh |
| 10 | +git clone https://github.com/FlagOpen/FlagScale.git |
| 11 | +cd FlagScale/ |
| 12 | +``` |
| 13 | + |
| 14 | +### Setup Conda Environment |
| 15 | + |
| 16 | +Create a new conda environment for robotics training: |
| 17 | + |
| 18 | +```sh |
| 19 | +conda create -n flagos-robo python=3.12 |
| 20 | +conda activate flagos-robo |
| 21 | +``` |
| 22 | + |
| 23 | +Install FlagScale and training dependencies: |
| 24 | + |
| 25 | +```sh |
| 26 | +cd FlagScale/ |
| 27 | +pip install ".[cuda-train]" --verbose |
| 28 | +``` |
| 29 | + |
| 30 | +Install additional dependencies for downloading models/datasets: |
| 31 | + |
| 32 | +```sh |
| 33 | +# For HuggingFace Hub |
| 34 | +pip install huggingface_hub |
| 35 | + |
| 36 | +# For ModelScope (optional) |
| 37 | +pip install modelscope |
| 38 | +``` |
| 39 | + |
| 40 | +## Download Models |
| 41 | + |
| 42 | +Download the base VLM model. Qwen-GR00T supports Qwen3-VL and Qwen2.5-VL as the VLM backbone: |
| 43 | + |
| 44 | +**Using HuggingFace Hub:** |
| 45 | + |
| 46 | +```sh |
| 47 | +cd FlagScale/ |
| 48 | +python examples/pi0/download.py \ |
| 49 | + --repo_id Qwen/Qwen3-VL-4B-Instruct \ |
| 50 | + --output_dir /workspace/models \ |
| 51 | + --source huggingface |
| 52 | +``` |
| 53 | + |
| 54 | +**Using ModelScope:** |
| 55 | + |
| 56 | +```sh |
| 57 | +cd FlagScale/ |
| 58 | +python examples/pi0/download.py \ |
| 59 | + --repo_id Qwen/Qwen3-VL-4B-Instruct \ |
| 60 | + --output_dir /workspace/models \ |
| 61 | + --source modelscope |
| 62 | +``` |
| 63 | + |
| 64 | +The model will be downloaded to (example with `/workspace/models`): |
| 65 | +- `/workspace/models/Qwen/Qwen3-VL-4B-Instruct` |
| 66 | + |
| 67 | + |
| 68 | +## Training |
| 69 | + |
| 70 | +### Prepare Dataset |
| 71 | + |
| 72 | +FlagScale uses the **LeRobotDataset v3.0** format. For detailed information about the format structure, see the [LeRobotDataset v3.0 documentation](https://huggingface.co/docs/lerobot/en/lerobot-dataset-v3). |
| 73 | + |
| 74 | +For example, to download the `libero_goal` dataset: |
| 75 | + |
| 76 | +**Using HuggingFace Hub:** |
| 77 | + |
| 78 | +```sh |
| 79 | +cd FlagScale/ |
| 80 | +python examples/pi0/download.py \ |
| 81 | + --repo_id IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \ |
| 82 | + --output_dir /workspace/datasets \ |
| 83 | + --repo_type dataset \ |
| 84 | + --source huggingface |
| 85 | +``` |
| 86 | + |
| 87 | +**Using ModelScope:** |
| 88 | + |
| 89 | +```sh |
| 90 | +cd FlagScale/ |
| 91 | +python examples/pi0/download.py \ |
| 92 | + --repo_id IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \ |
| 93 | + --output_dir /workspace/datasets \ |
| 94 | + --repo_type dataset \ |
| 95 | + --source modelscope |
| 96 | +``` |
| 97 | + |
| 98 | +The dataset will be downloaded to (example with `/workspace/datasets`): |
| 99 | +- `/workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot` |
| 100 | + |
| 101 | +### Edit Config |
| 102 | + |
| 103 | +FlagScale uses a two-level configuration system: |
| 104 | + |
| 105 | +1. **Experiment-level config** (`examples/qwen_gr00t/conf/train.yaml`): Defines experiment settings, environment variables, and resource allocation |
| 106 | +2. **Task-level config** (`examples/qwen_gr00t/conf/train/qwen_gr00t.yaml`): Defines model, dataset, and training hyperparameters |
| 107 | + |
| 108 | +#### Experiment-Level Config |
| 109 | + |
| 110 | +Edit the experiment-level config for multi-GPU training: |
| 111 | + |
| 112 | +```sh |
| 113 | +cd FlagScale/ |
| 114 | +vim examples/qwen_gr00t/conf/train.yaml |
| 115 | +``` |
| 116 | + |
| 117 | +Configure the following fields: |
| 118 | + |
| 119 | +- `experiment.envs.CUDA_VISIBLE_DEVICES` - GPU devices to use (default: `"0,1,2,3,4,5,6,7"` for 8 GPUs) |
| 120 | +- `experiment.envs.CUDA_DEVICE_MAX_CONNECTIONS` - Connection limit (typically `1`) |
| 121 | +- `experiment.exp_name` - Experiment name |
| 122 | +- `experiment.exp_dir` - Output directory for checkpoints and logs |
| 123 | + |
| 124 | +#### Task-Level Config |
| 125 | + |
| 126 | +Edit the task-level config for model and training settings: |
| 127 | + |
| 128 | +```sh |
| 129 | +cd FlagScale/ |
| 130 | +vim examples/qwen_gr00t/conf/train/qwen_gr00t.yaml |
| 131 | +``` |
| 132 | + |
| 133 | +Configure the following fields: |
| 134 | + |
| 135 | +**System settings** (training hyperparameters): |
| 136 | +- `system.batch_size` - Batch size per GPU (default: `16`) |
| 137 | +- `system.train_steps` - Total training steps (default: `30000`) |
| 138 | +- `system.grad_clip_norm` - Gradient clipping norm (default: `1.0`) |
| 139 | +- `system.use_amp` - Whether to use automatic mixed precision (default: `true`) |
| 140 | +- `system.shuffle` - Whether to shuffle training data (default: `true`) |
| 141 | +- `system.num_workers` - Number of data loading workers (default: `4`) |
| 142 | +- `system.checkpoint.save_checkpoint` - Whether to save checkpoints (default: `true`) |
| 143 | +- `system.checkpoint.save_freq` - Steps between checkpoints (default: `1000`) |
| 144 | +- `system.checkpoint.output_directory` - Checkpoint output directory (default: `${experiment.exp_dir}`) |
| 145 | + |
| 146 | +**Model settings**: |
| 147 | +- `model.model_name` - Model name: `"qwen_gr00t"` |
| 148 | +- `model.checkpoint_dir` - Path to the pretrained base VLM model (e.g., `/workspace/models/Qwen/Qwen3-VL-4B-Instruct`) |
| 149 | +- `model.vlm.type` - VLM backbone type: `"qwen3-vl"` or `"qwen2.5-vl"` |
| 150 | +- `model.qwenvl.base_vlm` - Path to the base VLM (same as `model.checkpoint_dir`) |
| 151 | +- `model.qwenvl.attn_implementation` - Attention implementation (default: `"flash_attention_2"`) |
| 152 | +- `model.qwenvl.vl_hidden_dim` - VLM hidden dimension (default: `2048`) |
| 153 | +- `model.dino.dino_backbone` - DINOv2 backbone variant (default: `"dinov2_vits14"`) |
| 154 | +- `model.action_model.use_state` - Whether to condition the action model on proprioceptive state (default: `false`) |
| 155 | +- `model.action_model.type` - Action model type (default: `"flow_matching"`) |
| 156 | +- `model.action_model.action_model_type` - DiT variant (default: `"DiT-B"`) |
| 157 | +- `model.action_model.action_dim` - Action dimension (default: `7`) |
| 158 | +- `model.action_model.state_dim` - State dimension (default: `7`) |
| 159 | +- `model.action_model.future_action_window_size` - Future action window (default: `7`) |
| 160 | +- `model.action_model.action_horizon` - Action horizon (default: `8`) |
| 161 | +- `model.action_model.num_inference_timesteps` - Inference diffusion steps (default: `4`) |
| 162 | +- `model.reduce_in_full_precision` - Whether to reduce gradients in FP32 (default: `true`) |
| 163 | + |
| 164 | +**Optimizer settings**: |
| 165 | +- `model.optimizer.name` - Optimizer name (default: `"AdamW"`) |
| 166 | +- `model.optimizer.lr` - Base learning rate (default: `2.5e-5`) |
| 167 | +- `model.optimizer.betas` - Optimizer betas (default: `[0.9, 0.95]`) |
| 168 | +- `model.optimizer.eps` - Optimizer epsilon (default: `1.0e-8`) |
| 169 | +- `model.optimizer.weight_decay` - Weight decay (default: `1.0e-8`) |
| 170 | +- `model.optimizer.param_groups` - Per-module learning rates: |
| 171 | + ```yaml |
| 172 | + param_groups: |
| 173 | + vlm: |
| 174 | + lr: 1.0e-05 |
| 175 | + action_model: |
| 176 | + lr: 1.0e-04 |
| 177 | + ``` |
| 178 | +- `model.optimizer.scheduler.name` - Scheduler name (default: `"cosine_with_min_lr"`) |
| 179 | +- `model.optimizer.scheduler.warmup_steps` - Warmup steps (default: `5000`) |
| 180 | +- `model.optimizer.scheduler.scheduler_kwargs.min_lr` - Minimum learning rate (default: `1.0e-6`) |
| 181 | + |
| 182 | +**Module freezing** (optional): |
| 183 | +```yaml |
| 184 | +model: |
| 185 | + freeze: |
| 186 | + # Freeze VLM, train only action head |
| 187 | + freeze_patterns: |
| 188 | + - "qwen_vl_interface\\..*" |
| 189 | + # Optionally keep specific modules trainable |
| 190 | + keep_patterns: |
| 191 | + - "qwen_vl_interface\\.model\\.visual\\.merger\\..*" |
| 192 | +``` |
| 193 | + |
| 194 | +**Data settings**: |
| 195 | +- `data.data_path` - Path to LeRobot dataset root (e.g., `/workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot`) |
| 196 | +- `data.vla_data.data_mix` - Dataset mix name (e.g., `"libero_goal_old"`) |
| 197 | +- `data.vla_data.action_type` - Action type (e.g., `"delta_qpos"`) |
| 198 | +- `data.vla_data.default_image_resolution` - Image resolution `[C, H, W]` (default: `[3, 224, 224]`) |
| 199 | +- `data.vla_data.obs` - Observation image keys (default: `["image_0"]`) |
| 200 | +- `data.observation_delta_indices` - Observation delta indices (default: `[0]`) |
| 201 | +- `data.action_delta_indices` - Action delta indices (default: `[0,1,2,3,4,5,6,7]`) |
| 202 | +- `data.preprocessor` - Preprocessor pipeline configuration |
| 203 | +- `data.postprocessor` - Postprocessor pipeline configuration |
| 204 | + |
| 205 | +### Start Training |
| 206 | +```sh |
| 207 | +cd FlagScale/ |
| 208 | +flagscale train qwen_gr00t -c ./examples/qwen_gr00t/conf/train.yaml |
| 209 | +``` |
| 210 | + |
| 211 | +Training logs are saved to `outputs/<exp_name>/logs/host_0_localhost.output` by default. |
| 212 | + |
| 213 | +Checkpoints are saved to `${experiment.exp_dir}/checkpoints`. |
| 214 | + |
| 215 | +### Stop Training |
| 216 | +```sh |
| 217 | +cd FlagScale/ |
| 218 | +flagscale train qwen_gr00t --stop |
| 219 | +``` |
| 220 | + |
| 221 | +## Inference |
| 222 | + |
| 223 | +### Prepare Inference Inputs |
| 224 | + |
| 225 | +You can extract inference inputs (images, state, task) from a dataset using the provided script: |
| 226 | + |
| 227 | +```sh |
| 228 | +cd FlagScale/ |
| 229 | +python examples/pi0/dump_dataset_inputs.py \ |
| 230 | + --dataset_root /workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \ |
| 231 | + --output_dir ./qwen_gr00t_inference_inputs \ |
| 232 | + --frame_index 100 |
| 233 | +``` |
| 234 | + |
| 235 | +This will create: |
| 236 | +- `frame_100_observation_images_*.jpg` - Image files |
| 237 | +- `frame_100_state.pt` - State tensor |
| 238 | +- `frame_100_task.txt` - Task prompt |
| 239 | +- `extraction_summary.json` - Summary of extracted files |
| 240 | + |
| 241 | +### Edit Config |
| 242 | + |
| 243 | +```sh |
| 244 | +cd FlagScale/ |
| 245 | +vim examples/qwen_gr00t/conf/inference/qwen_gr00t.yaml |
| 246 | +``` |
| 247 | + |
| 248 | +Configure the following fields: |
| 249 | + |
| 250 | +**Engine settings:** |
| 251 | +- `engine.model_variant` - Model variant (default: `"QwenGr00t"`) |
| 252 | +- `engine.model` - Path to trained checkpoint (e.g., `/workspace/outputs/qwen_gr00t_train/checkpoints/last`) |
| 253 | +- `engine.device` - Device to use (e.g., `"cuda"`) |
| 254 | + |
| 255 | +**Generate settings:** |
| 256 | +- `generate.images` - Dictionary mapping image keys to file paths: |
| 257 | + ```yaml |
| 258 | + images: |
| 259 | + observation.images.wrist_image: /path/to/wrist_image.jpg |
| 260 | + observation.images.image: /path/to/image.jpg |
| 261 | + ``` |
| 262 | +- `generate.state_path` - Path to state tensor file (`.pt` file) |
| 263 | +- `generate.task_path` - Path to task prompt file (`.txt` file) |
| 264 | + |
| 265 | +### Run Inference |
| 266 | + |
| 267 | +```sh |
| 268 | +cd FlagScale/ |
| 269 | +flagscale inference qwen_gr00t -c ./examples/qwen_gr00t/conf/inference.yaml |
| 270 | +``` |
| 271 | + |
| 272 | +Inference logs are saved to `outputs/qwen_gr00t_inference/inference_logs/host_0_localhost.output` by default. |
| 273 | + |
| 274 | +The predicted action tensor is printed to the console and saved in the log file. |
| 275 | + |
| 276 | +## Serving |
| 277 | + |
| 278 | +### Edit Config |
| 279 | + |
| 280 | +```sh |
| 281 | +cd FlagScale/ |
| 282 | +vim examples/qwen_gr00t/conf/serve/qwen_gr00t.yaml |
| 283 | +``` |
| 284 | + |
| 285 | +Configure the following fields: |
| 286 | + |
| 287 | +**Engine arguments:** |
| 288 | +- `engine_args.host` - Server host (default: `"0.0.0.0"`) |
| 289 | +- `engine_args.port` - Server port (default: `5000`) |
| 290 | +- `engine_args.model_variant` - Model variant (default: `"QwenGr00t"`) |
| 291 | +- `engine_args.model` - Path to trained checkpoint (e.g., `/workspace/outputs/qwen_gr00t_train/checkpoints/last`) |
| 292 | +- `engine_args.device` - Device to use (e.g., `"cuda"`) |
| 293 | + |
| 294 | +### Run Serving |
| 295 | + |
| 296 | +```sh |
| 297 | +cd FlagScale/ |
| 298 | +flagscale serve qwen_gr00t -c ./examples/qwen_gr00t/conf/serve.yaml |
| 299 | +``` |
| 300 | + |
| 301 | +Serving logs are saved to `outputs/<exp_name>/logs/host_0_localhost.output` by default. |
| 302 | + |
| 303 | +### Stop Serving |
| 304 | + |
| 305 | +```sh |
| 306 | +cd FlagScale/ |
| 307 | +flagscale serve qwen_gr00t --stop |
| 308 | +``` |
0 commit comments