feat(examples): add OSWorld GRPO training example by hehui0226 · Pull Request #1326 · areal-project/AReaL

hehui0226 · 2026-05-11T07:07:40Z

Description

Adds an end-to-end example (examples/osworld/) for GRPO training on OSWorld desktop-control tasks with Qwen3-VL-4B-Instruct. The example targets environments that lack Docker/KVM (typical GPU containers) by routing OSWorld VMs through a vendor-neutral remote sandbox cluster behind an HTTPS gateway.

Key pieces:

OSWorldWorkflow: multi-turn rollout that drives the VLM↔desktop loop through ArealOpenAI, captures screenshots as image_url data URIs, and dispatches pyautogui actions to env.step.
VL training tensor bridge (_attach_vl_tensor_dicts): re-runs the HF processor on each turn's prefix and writes mm_token_type_ids + multi_modal_input (pixel_values, image_grid_thw) into the cached training tensor dict, satisfying FSDPEngine._prepare_mb_list's is_qwen_vl_model branch without engine changes.
gateway_sandbox.py: DesktopEnv subclass that proxies controller + setup_controller calls through an HTTPS gateway. The vendor SDK is imported via try/except with a documented protocol so users can plug in any compatible cluster client; the bundled retry decorator parks on 429 quota errors instead of killing trajectories.
remote_desktop_env.py + remote_server.py: alternative self-hosted bridge for users who can run OSWorld on a separate Docker host.
text_only smoke ablation: strips screenshots from the workflow so a text-only base model (e.g. Qwen3-4B-Instruct) can drive the full PPO loop end-to-end without the VL training path.
run_train.sh stages (smoke / smoke-text / full) with a NO_PROXY allowlist for local SGLang health checks and required env vars (OSWORLD_SANDBOX_ENDPOINT, OSWORLD_SANDBOX_TOKEN).
apply_env_patches.sh: idempotent SGLang JIT C++20 detection and pydrive→pydrive2 shim required by OSWorld imports.

Tested end-to-end on Qwen3-VL-4B-Instruct: 1 episode on a Chrome task, 2-step max, real evaluator reward (0.0 — agent didn't solve, evaluator returned a real number), 1 PPO step done with mm_token_type_ids in the batch (n_tokens=8726 including image tokens, behave_approx_kl/max=1.07 confirming forward pass on multimodal input).

Related Issue

N/A — pure example addition; no issue to fix.

Type of Change

✨ New feature

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --files <staged files>; mdformat + ruff applied auto-fixes, second pass clean)
Relevant tests pass; new tests added for new functionality — N/A: example requires GPU + a sandbox provider; no automated tests bundled
Documentation updated (new examples/osworld/README.md)
Branch is up to date with main (feat/osworld-grpo-example branched off origin/main)
Self-reviewed via /review-pr command — manual review done
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Additional Context

The example exercises a fork-readiness path that requires _wait_for_fork_ready timeout in areal/infra/scheduler/local.py to be >=600s. Forked rollout/eval-rollout subprocesses re-import torch + sglang + megatron + transformers from scratch, which on slow shared filesystems takes 3+ minutes; the current 60s default fires before imports finish. A separate follow-up PR will raise that core default. The example documents the workaround in its README troubleshooting section as an interim measure.

Multi-turn vision-language GRPO/PPO training on OSWorld desktop-control tasks with Qwen3-VL-4B-Instruct, designed to run inside a GPU container that has no Docker/KVM access by routing OSWorld VMs through a vendor-neutral remote sandbox cluster behind an HTTPS gateway. Components - OSWorldWorkflow: multi-turn rollout that drives the VLM<->desktop loop through ArealOpenAI, captures screenshots as image_url data URIs, and dispatches pyautogui actions to env.step. - VL training tensor bridge (_attach_vl_tensor_dicts): re-runs the HF processor on each turn's prefix and writes mm_token_type_ids + multi_modal_input (pixel_values, image_grid_thw) into the cached training tensor dict, satisfying FSDPEngine._prepare_mb_list's is_qwen_vl_model branch without needing changes to the engine. - gateway_sandbox.py: DesktopEnv subclass that proxies controller + setup_controller calls through an HTTPS gateway. The vendor SDK is imported via try/except with a documented protocol so users can plug in any compatible cluster client; the bundled retry decorator parks on 429 quota errors instead of killing trajectories. - remote_desktop_env.py + remote_server.py: alternative self-hosted bridge for users who can run OSWorld on a separate Docker host. - text_only smoke ablation: strips screenshots from the workflow so a text-only base model (e.g. Qwen3-4B-Instruct) can drive the full PPO loop end-to-end without the VL training path. - run_train.sh stages (smoke / smoke-text / full) with a NO_PROXY allowlist for local SGLang health checks and required env vars (OSWORLD_SANDBOX_ENDPOINT, OSWORLD_SANDBOX_TOKEN). - apply_env_patches.sh: idempotent SGLang JIT C++20 detection and pydrive->pydrive2 shim required by OSWorld imports. Tested end-to-end (Plan B): one episode on chrome/bb5e4c0d task, 2-step max, reward=0.0 (real evaluator signal), one PPO step done with mm_token_type_ids in the batch (n_tokens=8726 including image tokens, behave_approx_kl/max=1.07). Note: the example exercises a fork-readiness path that requires _wait_for_fork_ready timeout in areal/infra/scheduler/local.py to be >=600s. A separate PR raises that core default.

hehui0226 requested review from CormickKneey, HwVanICI and PrometheusComing as code owners May 11, 2026 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): add OSWorld GRPO training example#1326

feat(examples): add OSWorld GRPO training example#1326
hehui0226 wants to merge 1 commit into
areal-project:mainfrom
hehui0226:feat/osworld-grpo-example

hehui0226 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hehui0226 commented May 11, 2026

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant