feat(examples): add OSWorld GRPO training example#1326
Open
hehui0226 wants to merge 1 commit into
Open
Conversation
Multi-turn vision-language GRPO/PPO training on OSWorld desktop-control tasks with Qwen3-VL-4B-Instruct, designed to run inside a GPU container that has no Docker/KVM access by routing OSWorld VMs through a vendor-neutral remote sandbox cluster behind an HTTPS gateway. Components - OSWorldWorkflow: multi-turn rollout that drives the VLM<->desktop loop through ArealOpenAI, captures screenshots as image_url data URIs, and dispatches pyautogui actions to env.step. - VL training tensor bridge (_attach_vl_tensor_dicts): re-runs the HF processor on each turn's prefix and writes mm_token_type_ids + multi_modal_input (pixel_values, image_grid_thw) into the cached training tensor dict, satisfying FSDPEngine._prepare_mb_list's is_qwen_vl_model branch without needing changes to the engine. - gateway_sandbox.py: DesktopEnv subclass that proxies controller + setup_controller calls through an HTTPS gateway. The vendor SDK is imported via try/except with a documented protocol so users can plug in any compatible cluster client; the bundled retry decorator parks on 429 quota errors instead of killing trajectories. - remote_desktop_env.py + remote_server.py: alternative self-hosted bridge for users who can run OSWorld on a separate Docker host. - text_only smoke ablation: strips screenshots from the workflow so a text-only base model (e.g. Qwen3-4B-Instruct) can drive the full PPO loop end-to-end without the VL training path. - run_train.sh stages (smoke / smoke-text / full) with a NO_PROXY allowlist for local SGLang health checks and required env vars (OSWORLD_SANDBOX_ENDPOINT, OSWORLD_SANDBOX_TOKEN). - apply_env_patches.sh: idempotent SGLang JIT C++20 detection and pydrive->pydrive2 shim required by OSWorld imports. Tested end-to-end (Plan B): one episode on chrome/bb5e4c0d task, 2-step max, reward=0.0 (real evaluator signal), one PPO step done with mm_token_type_ids in the batch (n_tokens=8726 including image tokens, behave_approx_kl/max=1.07). Note: the example exercises a fork-readiness path that requires _wait_for_fork_ready timeout in areal/infra/scheduler/local.py to be >=600s. A separate PR raises that core default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds an end-to-end example (
examples/osworld/) for GRPO training on OSWorld desktop-control tasks withQwen3-VL-4B-Instruct. The example targets environments that lack Docker/KVM (typical GPU containers) by routing OSWorld VMs through a vendor-neutral remote sandbox cluster behind an HTTPS gateway.Key pieces:
OSWorldWorkflow: multi-turn rollout that drives the VLM↔desktop loop throughArealOpenAI, captures screenshots asimage_urldata URIs, and dispatchespyautoguiactions toenv.step._attach_vl_tensor_dicts): re-runs the HF processor on each turn's prefix and writesmm_token_type_ids+multi_modal_input(pixel_values,image_grid_thw) into the cached training tensor dict, satisfyingFSDPEngine._prepare_mb_list'sis_qwen_vl_modelbranch without engine changes.gateway_sandbox.py:DesktopEnvsubclass that proxies controller + setup_controller calls through an HTTPS gateway. The vendor SDK is imported viatry/exceptwith a documented protocol so users can plug in any compatible cluster client; the bundled retry decorator parks on 429 quota errors instead of killing trajectories.remote_desktop_env.py+remote_server.py: alternative self-hosted bridge for users who can run OSWorld on a separate Docker host.text_onlysmoke ablation: strips screenshots from the workflow so a text-only base model (e.g.Qwen3-4B-Instruct) can drive the full PPO loop end-to-end without the VL training path.run_train.shstages (smoke/smoke-text/full) with aNO_PROXYallowlist for local SGLang health checks and required env vars (OSWORLD_SANDBOX_ENDPOINT,OSWORLD_SANDBOX_TOKEN).apply_env_patches.sh: idempotent SGLang JIT C++20 detection andpydrive→pydrive2shim required by OSWorld imports.Tested end-to-end on
Qwen3-VL-4B-Instruct: 1 episode on a Chrome task, 2-step max, real evaluator reward (0.0 — agent didn't solve, evaluator returned a real number), 1 PPO step done withmm_token_type_idsin the batch (n_tokens=8726including image tokens,behave_approx_kl/max=1.07confirming forward pass on multimodal input).Related Issue
N/A — pure example addition; no issue to fix.
Type of Change
Checklist
pre-commit run --files <staged files>; mdformat + ruff applied auto-fixes, second pass clean)examples/osworld/README.md)main(feat/osworld-grpo-examplebranched offorigin/main)/review-prcommand — manual review done/create-prAdditional Context
The example exercises a fork-readiness path that requires
_wait_for_fork_readytimeout inareal/infra/scheduler/local.pyto be>=600s. Forked rollout/eval-rollout subprocesses re-importtorch+sglang+megatron+transformersfrom scratch, which on slow shared filesystems takes 3+ minutes; the current 60s default fires before imports finish. A separate follow-up PR will raise that core default. The example documents the workaround in its README troubleshooting section as an interim measure.