Skip to content

feat(examples): add OSWorld GRPO training example#1326

Open
hehui0226 wants to merge 1 commit into
areal-project:mainfrom
hehui0226:feat/osworld-grpo-example
Open

feat(examples): add OSWorld GRPO training example#1326
hehui0226 wants to merge 1 commit into
areal-project:mainfrom
hehui0226:feat/osworld-grpo-example

Conversation

@hehui0226
Copy link
Copy Markdown

Description

Adds an end-to-end example (examples/osworld/) for GRPO training on OSWorld desktop-control tasks with Qwen3-VL-4B-Instruct. The example targets environments that lack Docker/KVM (typical GPU containers) by routing OSWorld VMs through a vendor-neutral remote sandbox cluster behind an HTTPS gateway.

Key pieces:

  • OSWorldWorkflow: multi-turn rollout that drives the VLM↔desktop loop through ArealOpenAI, captures screenshots as image_url data URIs, and dispatches pyautogui actions to env.step.
  • VL training tensor bridge (_attach_vl_tensor_dicts): re-runs the HF processor on each turn's prefix and writes mm_token_type_ids + multi_modal_input (pixel_values, image_grid_thw) into the cached training tensor dict, satisfying FSDPEngine._prepare_mb_list's is_qwen_vl_model branch without engine changes.
  • gateway_sandbox.py: DesktopEnv subclass that proxies controller + setup_controller calls through an HTTPS gateway. The vendor SDK is imported via try/except with a documented protocol so users can plug in any compatible cluster client; the bundled retry decorator parks on 429 quota errors instead of killing trajectories.
  • remote_desktop_env.py + remote_server.py: alternative self-hosted bridge for users who can run OSWorld on a separate Docker host.
  • text_only smoke ablation: strips screenshots from the workflow so a text-only base model (e.g. Qwen3-4B-Instruct) can drive the full PPO loop end-to-end without the VL training path.
  • run_train.sh stages (smoke / smoke-text / full) with a NO_PROXY allowlist for local SGLang health checks and required env vars (OSWORLD_SANDBOX_ENDPOINT, OSWORLD_SANDBOX_TOKEN).
  • apply_env_patches.sh: idempotent SGLang JIT C++20 detection and pydrivepydrive2 shim required by OSWorld imports.

Tested end-to-end on Qwen3-VL-4B-Instruct: 1 episode on a Chrome task, 2-step max, real evaluator reward (0.0 — agent didn't solve, evaluator returned a real number), 1 PPO step done with mm_token_type_ids in the batch (n_tokens=8726 including image tokens, behave_approx_kl/max=1.07 confirming forward pass on multimodal input).

Related Issue

N/A — pure example addition; no issue to fix.

Type of Change

  • ✨ New feature

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --files <staged files>; mdformat + ruff applied auto-fixes, second pass clean)
  • Relevant tests pass; new tests added for new functionality — N/A: example requires GPU + a sandbox provider; no automated tests bundled
  • Documentation updated (new examples/osworld/README.md)
  • Branch is up to date with main (feat/osworld-grpo-example branched off origin/main)
  • Self-reviewed via /review-pr command — manual review done
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Additional Context

The example exercises a fork-readiness path that requires _wait_for_fork_ready timeout in areal/infra/scheduler/local.py to be >=600s. Forked rollout/eval-rollout subprocesses re-import torch + sglang + megatron + transformers from scratch, which on slow shared filesystems takes 3+ minutes; the current 60s default fires before imports finish. A separate follow-up PR will raise that core default. The example documents the workaround in its README troubleshooting section as an interim measure.

Multi-turn vision-language GRPO/PPO training on OSWorld desktop-control
tasks with Qwen3-VL-4B-Instruct, designed to run inside a GPU container
that has no Docker/KVM access by routing OSWorld VMs through a
vendor-neutral remote sandbox cluster behind an HTTPS gateway.

Components
- OSWorldWorkflow: multi-turn rollout that drives the VLM<->desktop loop
  through ArealOpenAI, captures screenshots as image_url data URIs, and
  dispatches pyautogui actions to env.step.
- VL training tensor bridge (_attach_vl_tensor_dicts): re-runs the HF
  processor on each turn's prefix and writes mm_token_type_ids +
  multi_modal_input (pixel_values, image_grid_thw) into the cached
  training tensor dict, satisfying FSDPEngine._prepare_mb_list's
  is_qwen_vl_model branch without needing changes to the engine.
- gateway_sandbox.py: DesktopEnv subclass that proxies controller +
  setup_controller calls through an HTTPS gateway. The vendor SDK is
  imported via try/except with a documented protocol so users can plug
  in any compatible cluster client; the bundled retry decorator parks
  on 429 quota errors instead of killing trajectories.
- remote_desktop_env.py + remote_server.py: alternative self-hosted
  bridge for users who can run OSWorld on a separate Docker host.
- text_only smoke ablation: strips screenshots from the workflow so a
  text-only base model (e.g. Qwen3-4B-Instruct) can drive the full PPO
  loop end-to-end without the VL training path.
- run_train.sh stages (smoke / smoke-text / full) with a NO_PROXY
  allowlist for local SGLang health checks and required env vars
  (OSWORLD_SANDBOX_ENDPOINT, OSWORLD_SANDBOX_TOKEN).
- apply_env_patches.sh: idempotent SGLang JIT C++20 detection and
  pydrive->pydrive2 shim required by OSWorld imports.

Tested end-to-end (Plan B): one episode on chrome/bb5e4c0d task,
2-step max, reward=0.0 (real evaluator signal), one PPO step done
with mm_token_type_ids in the batch (n_tokens=8726 including image
tokens, behave_approx_kl/max=1.07).

Note: the example exercises a fork-readiness path that requires
_wait_for_fork_ready timeout in areal/infra/scheduler/local.py to be
>=600s. A separate PR raises that core default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant