Skip to content

Commit bcf923d

Browse files
[Model] Add Qwen-GR00T VLA model (flagos-ai#1120)
### PR Category <!-- One of [ Train | Inference | Compress | Serve | RL | Core | Hardware | CICD | Tools | Others ] --> Train/Inference/Serve ### PR Types <!-- One of [ User Experience | New Features | Bug Fixes | Improvements | Performance | Breaking Change| Deprecations | Test Case | Docs | Others ] --> New Features ### PR Description <!-- Describe what you’ve done --> - Add VLA model framework (flagscale/models/vla/), support vlm backbone + action model composition through registry - Add Qwen-GR00T, support FSDP2 distributed training, single-GPU inference, and WebSocket-based serving - Tested on libero goal, achieving comparable results to StarVLA (baseline 97.4%) Task | flagscale_zero2 | flagscale_fsdp_uniform_bf16 | flagscale_fsdp_uniform_bf16_without_hardcode_images_order -- | -- | -- | -- 0: open middle drawer | 50/50 | 50/50 | 50/50 1: put bowl on stove | 50/50 | 49/50 | 47/50 2: wine bottle on top of cabinet | 48/50 | 45/50 | 46/50 3: open top drawer, put bowl inside | 49/50 | 48/50 | 45/50 4: put bowl on top of cabinet | 50/50 | 48/50 | 49/50 5: push plate to front of stove | 49/50 | 50/50 | 49/50 6: put cream cheese in bowl | 49/50 | 46/50 | 48/50 7: turn on stove | 50/50 | 50/50 | 50/50 8: put bowl on plate | 49/50 | 50/50 | 48/50 9: wine bottle on rack | 50/50 | 49/50 | 50/50 Total | 494/500 (98.8%) | 485/500 (97.0%) | 482/500 (96.4%) --------- Co-authored-by: MC952-arch <MC952-arch@qq.com>
1 parent 97f3096 commit bcf923d

40 files changed

+3594
-105
lines changed

.github/workflows/unit_tests_common.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,9 @@ jobs:
165165
--pip-deps typer \
166166
--retry-count 3
167167
168+
# TODO: temp solution to install newly added deps, remove once the new image is built
169+
pip install qwen_vl_utils==0.0.14 diffusers==0.36.0 websocket-client==1.8.0 websocket==0.2.1 websockets==15.0.1 msgpack==1.1.0
170+
168171
# Copy test data (keep existing logic)
169172
mkdir -p /opt/data
170173
cp -r /home/gitlab-runner/data/Megatron-LM/* /opt/data/ || true

examples/qwen_gr00t/README.md

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# Qwen-GR00T: Training, Inference, and Serving
2+
3+
This guide covers how to train, run inference, and serve Qwen-GR00T models using FlagScale. Qwen-GR00T uses a Qwen3-VL backbone as the vision-language model with a DiT-based flow matching action head.
4+
5+
## Installation
6+
7+
### Clone Repository
8+
9+
```sh
10+
git clone https://github.com/FlagOpen/FlagScale.git
11+
cd FlagScale/
12+
```
13+
14+
### Setup Conda Environment
15+
16+
Create a new conda environment for robotics training:
17+
18+
```sh
19+
conda create -n flagos-robo python=3.12
20+
conda activate flagos-robo
21+
```
22+
23+
Install FlagScale and training dependencies:
24+
25+
```sh
26+
cd FlagScale/
27+
pip install ".[cuda-train]" --verbose
28+
```
29+
30+
Install additional dependencies for downloading models/datasets:
31+
32+
```sh
33+
# For HuggingFace Hub
34+
pip install huggingface_hub
35+
36+
# For ModelScope (optional)
37+
pip install modelscope
38+
```
39+
40+
## Download Models
41+
42+
Download the base VLM model. Qwen-GR00T supports Qwen3-VL and Qwen2.5-VL as the VLM backbone:
43+
44+
**Using HuggingFace Hub:**
45+
46+
```sh
47+
cd FlagScale/
48+
python examples/pi0/download.py \
49+
--repo_id Qwen/Qwen3-VL-4B-Instruct \
50+
--output_dir /workspace/models \
51+
--source huggingface
52+
```
53+
54+
**Using ModelScope:**
55+
56+
```sh
57+
cd FlagScale/
58+
python examples/pi0/download.py \
59+
--repo_id Qwen/Qwen3-VL-4B-Instruct \
60+
--output_dir /workspace/models \
61+
--source modelscope
62+
```
63+
64+
The model will be downloaded to (example with `/workspace/models`):
65+
- `/workspace/models/Qwen/Qwen3-VL-4B-Instruct`
66+
67+
68+
## Training
69+
70+
### Prepare Dataset
71+
72+
FlagScale uses the **LeRobotDataset v3.0** format. For detailed information about the format structure, see the [LeRobotDataset v3.0 documentation](https://huggingface.co/docs/lerobot/en/lerobot-dataset-v3).
73+
74+
For example, to download the `libero_goal` dataset:
75+
76+
**Using HuggingFace Hub:**
77+
78+
```sh
79+
cd FlagScale/
80+
python examples/pi0/download.py \
81+
--repo_id IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \
82+
--output_dir /workspace/datasets \
83+
--repo_type dataset \
84+
--source huggingface
85+
```
86+
87+
**Using ModelScope:**
88+
89+
```sh
90+
cd FlagScale/
91+
python examples/pi0/download.py \
92+
--repo_id IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \
93+
--output_dir /workspace/datasets \
94+
--repo_type dataset \
95+
--source modelscope
96+
```
97+
98+
The dataset will be downloaded to (example with `/workspace/datasets`):
99+
- `/workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot`
100+
101+
### Edit Config
102+
103+
FlagScale uses a two-level configuration system:
104+
105+
1. **Experiment-level config** (`examples/qwen_gr00t/conf/train.yaml`): Defines experiment settings, environment variables, and resource allocation
106+
2. **Task-level config** (`examples/qwen_gr00t/conf/train/qwen_gr00t.yaml`): Defines model, dataset, and training hyperparameters
107+
108+
#### Experiment-Level Config
109+
110+
Edit the experiment-level config for multi-GPU training:
111+
112+
```sh
113+
cd FlagScale/
114+
vim examples/qwen_gr00t/conf/train.yaml
115+
```
116+
117+
Configure the following fields:
118+
119+
- `experiment.envs.CUDA_VISIBLE_DEVICES` - GPU devices to use (default: `"0,1,2,3,4,5,6,7"` for 8 GPUs)
120+
- `experiment.envs.CUDA_DEVICE_MAX_CONNECTIONS` - Connection limit (typically `1`)
121+
- `experiment.exp_name` - Experiment name
122+
- `experiment.exp_dir` - Output directory for checkpoints and logs
123+
124+
#### Task-Level Config
125+
126+
Edit the task-level config for model and training settings:
127+
128+
```sh
129+
cd FlagScale/
130+
vim examples/qwen_gr00t/conf/train/qwen_gr00t.yaml
131+
```
132+
133+
Configure the following fields:
134+
135+
**System settings** (training hyperparameters):
136+
- `system.batch_size` - Batch size per GPU (default: `16`)
137+
- `system.train_steps` - Total training steps (default: `30000`)
138+
- `system.grad_clip_norm` - Gradient clipping norm (default: `1.0`)
139+
- `system.use_amp` - Whether to use automatic mixed precision (default: `true`)
140+
- `system.shuffle` - Whether to shuffle training data (default: `true`)
141+
- `system.num_workers` - Number of data loading workers (default: `4`)
142+
- `system.checkpoint.save_checkpoint` - Whether to save checkpoints (default: `true`)
143+
- `system.checkpoint.save_freq` - Steps between checkpoints (default: `1000`)
144+
- `system.checkpoint.output_directory` - Checkpoint output directory (default: `${experiment.exp_dir}`)
145+
146+
**Model settings**:
147+
- `model.model_name` - Model name: `"qwen_gr00t"`
148+
- `model.checkpoint_dir` - Path to the pretrained base VLM model (e.g., `/workspace/models/Qwen/Qwen3-VL-4B-Instruct`)
149+
- `model.vlm.type` - VLM backbone type: `"qwen3-vl"` or `"qwen2.5-vl"`
150+
- `model.qwenvl.base_vlm` - Path to the base VLM (same as `model.checkpoint_dir`)
151+
- `model.qwenvl.attn_implementation` - Attention implementation (default: `"flash_attention_2"`)
152+
- `model.qwenvl.vl_hidden_dim` - VLM hidden dimension (default: `2048`)
153+
- `model.dino.dino_backbone` - DINOv2 backbone variant (default: `"dinov2_vits14"`)
154+
- `model.action_model.use_state` - Whether to condition the action model on proprioceptive state (default: `false`)
155+
- `model.action_model.type` - Action model type (default: `"flow_matching"`)
156+
- `model.action_model.action_model_type` - DiT variant (default: `"DiT-B"`)
157+
- `model.action_model.action_dim` - Action dimension (default: `7`)
158+
- `model.action_model.state_dim` - State dimension (default: `7`)
159+
- `model.action_model.future_action_window_size` - Future action window (default: `7`)
160+
- `model.action_model.action_horizon` - Action horizon (default: `8`)
161+
- `model.action_model.num_inference_timesteps` - Inference diffusion steps (default: `4`)
162+
- `model.reduce_in_full_precision` - Whether to reduce gradients in FP32 (default: `true`)
163+
164+
**Optimizer settings**:
165+
- `model.optimizer.name` - Optimizer name (default: `"AdamW"`)
166+
- `model.optimizer.lr` - Base learning rate (default: `2.5e-5`)
167+
- `model.optimizer.betas` - Optimizer betas (default: `[0.9, 0.95]`)
168+
- `model.optimizer.eps` - Optimizer epsilon (default: `1.0e-8`)
169+
- `model.optimizer.weight_decay` - Weight decay (default: `1.0e-8`)
170+
- `model.optimizer.param_groups` - Per-module learning rates:
171+
```yaml
172+
param_groups:
173+
vlm:
174+
lr: 1.0e-05
175+
action_model:
176+
lr: 1.0e-04
177+
```
178+
- `model.optimizer.scheduler.name` - Scheduler name (default: `"cosine_with_min_lr"`)
179+
- `model.optimizer.scheduler.warmup_steps` - Warmup steps (default: `5000`)
180+
- `model.optimizer.scheduler.scheduler_kwargs.min_lr` - Minimum learning rate (default: `1.0e-6`)
181+
182+
**Module freezing** (optional):
183+
```yaml
184+
model:
185+
freeze:
186+
# Freeze VLM, train only action head
187+
freeze_patterns:
188+
- "qwen_vl_interface\\..*"
189+
# Optionally keep specific modules trainable
190+
keep_patterns:
191+
- "qwen_vl_interface\\.model\\.visual\\.merger\\..*"
192+
```
193+
194+
**Data settings**:
195+
- `data.data_path` - Path to LeRobot dataset root (e.g., `/workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot`)
196+
- `data.vla_data.data_mix` - Dataset mix name (e.g., `"libero_goal_old"`)
197+
- `data.vla_data.action_type` - Action type (e.g., `"delta_qpos"`)
198+
- `data.vla_data.default_image_resolution` - Image resolution `[C, H, W]` (default: `[3, 224, 224]`)
199+
- `data.vla_data.obs` - Observation image keys (default: `["image_0"]`)
200+
- `data.observation_delta_indices` - Observation delta indices (default: `[0]`)
201+
- `data.action_delta_indices` - Action delta indices (default: `[0,1,2,3,4,5,6,7]`)
202+
- `data.preprocessor` - Preprocessor pipeline configuration
203+
- `data.postprocessor` - Postprocessor pipeline configuration
204+
205+
### Start Training
206+
```sh
207+
cd FlagScale/
208+
flagscale train qwen_gr00t -c ./examples/qwen_gr00t/conf/train.yaml
209+
```
210+
211+
Training logs are saved to `outputs/<exp_name>/logs/host_0_localhost.output` by default.
212+
213+
Checkpoints are saved to `${experiment.exp_dir}/checkpoints`.
214+
215+
### Stop Training
216+
```sh
217+
cd FlagScale/
218+
flagscale train qwen_gr00t --stop
219+
```
220+
221+
## Inference
222+
223+
### Prepare Inference Inputs
224+
225+
You can extract inference inputs (images, state, task) from a dataset using the provided script:
226+
227+
```sh
228+
cd FlagScale/
229+
python examples/pi0/dump_dataset_inputs.py \
230+
--dataset_root /workspace/datasets/IPEC-COMMUNITY/libero_goal_no_noops_1.0.0_lerobot \
231+
--output_dir ./qwen_gr00t_inference_inputs \
232+
--frame_index 100
233+
```
234+
235+
This will create:
236+
- `frame_100_observation_images_*.jpg` - Image files
237+
- `frame_100_state.pt` - State tensor
238+
- `frame_100_task.txt` - Task prompt
239+
- `extraction_summary.json` - Summary of extracted files
240+
241+
### Edit Config
242+
243+
```sh
244+
cd FlagScale/
245+
vim examples/qwen_gr00t/conf/inference/qwen_gr00t.yaml
246+
```
247+
248+
Configure the following fields:
249+
250+
**Engine settings:**
251+
- `engine.model_variant` - Model variant (default: `"QwenGr00t"`)
252+
- `engine.model` - Path to trained checkpoint (e.g., `/workspace/outputs/qwen_gr00t_train/checkpoints/last`)
253+
- `engine.device` - Device to use (e.g., `"cuda"`)
254+
255+
**Generate settings:**
256+
- `generate.images` - Dictionary mapping image keys to file paths:
257+
```yaml
258+
images:
259+
observation.images.wrist_image: /path/to/wrist_image.jpg
260+
observation.images.image: /path/to/image.jpg
261+
```
262+
- `generate.state_path` - Path to state tensor file (`.pt` file)
263+
- `generate.task_path` - Path to task prompt file (`.txt` file)
264+
265+
### Run Inference
266+
267+
```sh
268+
cd FlagScale/
269+
flagscale inference qwen_gr00t -c ./examples/qwen_gr00t/conf/inference.yaml
270+
```
271+
272+
Inference logs are saved to `outputs/qwen_gr00t_inference/inference_logs/host_0_localhost.output` by default.
273+
274+
The predicted action tensor is printed to the console and saved in the log file.
275+
276+
## Serving
277+
278+
### Edit Config
279+
280+
```sh
281+
cd FlagScale/
282+
vim examples/qwen_gr00t/conf/serve/qwen_gr00t.yaml
283+
```
284+
285+
Configure the following fields:
286+
287+
**Engine arguments:**
288+
- `engine_args.host` - Server host (default: `"0.0.0.0"`)
289+
- `engine_args.port` - Server port (default: `5000`)
290+
- `engine_args.model_variant` - Model variant (default: `"QwenGr00t"`)
291+
- `engine_args.model` - Path to trained checkpoint (e.g., `/workspace/outputs/qwen_gr00t_train/checkpoints/last`)
292+
- `engine_args.device` - Device to use (e.g., `"cuda"`)
293+
294+
### Run Serving
295+
296+
```sh
297+
cd FlagScale/
298+
flagscale serve qwen_gr00t -c ./examples/qwen_gr00t/conf/serve.yaml
299+
```
300+
301+
Serving logs are saved to `outputs/<exp_name>/logs/host_0_localhost.output` by default.
302+
303+
### Stop Serving
304+
305+
```sh
306+
cd FlagScale/
307+
flagscale serve qwen_gr00t --stop
308+
```
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
defaults:
2+
- _self_
3+
- inference: qwen_gr00t
4+
5+
experiment:
6+
exp_name: qwen_gr00t_inference
7+
exp_dir: outputs/${experiment.exp_name}
8+
model: /models/qwen_gr00t
9+
task:
10+
type: inference
11+
backend: vllm # TODO: Remove this restriction
12+
entrypoint: flagscale/inference/inference_qwen_gr00t.py
13+
runner:
14+
hostfile: null
15+
cmds:
16+
before_start: null
17+
envs:
18+
CUDA_VISIBLE_DEVICES: 0
19+
CUDA_DEVICE_MAX_CONNECTIONS: 1
20+
# Optionally, set HF_HOME and HF_ENDPOINT
21+
22+
action: run
23+
24+
hydra:
25+
run:
26+
dir: ${experiment.exp_dir}/hydra
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
engine:
2+
model_variant: "QwenGr00t"
3+
model: /workspace/models/qwen_gr00t_train/checkpoints/last
4+
device: "cuda"
5+
6+
generate:
7+
images:
8+
observation.images.wrist_image: qwen_gr00t_inference_inputs/frame_100_observation_images_wrist_image.jpg
9+
observation.images.image: qwen_gr00t_inference_inputs/frame_100_observation_images_image.jpg
10+
state_path: qwen_gr00t_inference_inputs/frame_100_state.pt
11+
task_path: qwen_gr00t_inference_inputs/frame_100_task.txt
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
defaults:
2+
- _self_
3+
- serve: qwen_gr00t
4+
5+
experiment:
6+
exp_name: qwen_gr00t_serve
7+
exp_dir: outputs/${experiment.exp_name}
8+
task:
9+
type: serve
10+
entrypoint: flagscale/serve/run_serve_qwen_gr00t.py
11+
runner:
12+
hostfile: null
13+
deploy:
14+
use_fs_serve: true
15+
envs:
16+
CUDA_VISIBLE_DEVICES: 0
17+
CUDA_DEVICE_MAX_CONNECTIONS: 1
18+
19+
action: run
20+
21+
hydra:
22+
run:
23+
dir: ${experiment.exp_dir}/hydra
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
- serve_id: vllm_model # Not in use
2+
engine_args:
3+
host: 0.0.0.0
4+
port: 5000
5+
model_variant: QwenGr00t
6+
model: /workspace/models/qwen_gr00t_train/checkpoints/last
7+
device: "cuda"

0 commit comments

Comments
 (0)