[examples] feat: add end-to-end SWE-bench RL training recipe (swe_agent)#52
Conversation
Adds examples/swe_agent/, an end-to-end recipe for training a SWE-bench
coding agent with fully-async RL (Megatron actors + vLLM rollout) and
Modal swe-rex sandboxes. It stitches together the existing building
blocks (examples/data_preprocess/swe_rebench.py + swe_bench_verified.py
for data, uni_agent.reward.swe_rebench / swe_bench for reward,
uni_agent.agent_loop.UniAgentLoop for rollout) into a runnable launch
script + configs + README, mirroring the structure of
examples/search_agent/.
Reference config trains Qwen3-235B-A22B-Instruct-2507 with GRPO on a
12-node (8 train + 4 rollout) x 4-GPU topology, but everything is
env-overridable to scale down.
Files:
- train_qwen3_235b_swebench.sh : ray job submit + full GRPO / Megatron
/ vLLM config. Topology, paths, and addresses are env vars.
- agent_config.yaml : UniAgentLoop config (tools, Modal
deployment, rollout concurrency, reward).
- runtime_env.yaml : Ray runtime-env template (placeholders
for Modal / W&B tokens and checkout paths).
- README.md : dataset -> runtime_env -> launch ->
monitor, plus tuning notes.
The script header / README capture a few non-obvious settings learned
from running this at scale: max_response_length=128K (SWE-bench
trajectories are long), tool_parser=hermes for Qwen3-235B,
moe_token_dispatcher_type=alltoall, VLLM_USE_DEEP_GEMM=0 for vLLM 0.21
EP init, and the expandable_segments / CuMemAllocator incompatibility.
No secrets or environment-specific paths are committed; runtime_env.yaml
ships placeholders only.
This PR includes AI assistance (Claude Code). The submitting human
(@aoshen02) reviewed every line.
Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces an end-to-end recipe for training a SWE-bench coding agent using the Uni-Agent framework with fully-async RL and Modal sandboxes. It includes a README, agent loop configuration, Ray runtime environment template, and a launch script for training Qwen3-235B-A22B-Instruct-2507. The review feedback suggests correcting the default dataset paths in the launch script to include the _modal suffix, and double-quoting several variables and arguments (such as paths, addresses, and parameters containing square brackets) to prevent word splitting and globbing issues in Bash.
| DATA_ROOT=${DATA_ROOT:-/path/to/data-root} | ||
| EXAMPLE_DIR=${EXAMPLE_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)} | ||
| # Qwen3-235B-A22B-Instruct-2507 snapshot (first/only snapshot under the HF cache). | ||
| MODEL_PATH=${MODEL_PATH:-$(ls -d ${DATA_ROOT}/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)} |
There was a problem hiding this comment.
Double-quote ${DATA_ROOT} inside the command substitution to prevent word splitting if the path contains spaces.
| MODEL_PATH=${MODEL_PATH:-$(ls -d ${DATA_ROOT}/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)} | |
| MODEL_PATH=${MODEL_PATH:-$(ls -d "${DATA_ROOT}"/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)} |
| TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered.parquet} | ||
| TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified.parquet} |
There was a problem hiding this comment.
The preprocessing scripts swe_rebench.py and swe_bench_verified.py append the deployment implementation suffix (e.g., _modal.parquet) to the output filenames. Since this training script is configured for Modal, the default paths should include the _modal suffix to avoid file-not-found errors out of the box.
| TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered.parquet} | |
| TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified.parquet} | |
| TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered_modal.parquet} | |
| TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified_modal.parquet} |
| actor_rollout_ref.model.mtp.enable_rollout=False | ||
| ) | ||
|
|
||
| CHECKPOINT_CONTENTS=['model','hf_model','extra'] |
There was a problem hiding this comment.
|
|
||
| CHECKPOINT_CONTENTS=['model','hf_model','extra'] | ||
|
|
||
| ray job submit --no-wait --address=$RAY_ADDRESS --runtime-env $RUNTIME_ENV \ |
There was a problem hiding this comment.
Double-quote $RAY_ADDRESS and $RUNTIME_ENV to prevent word splitting and globbing if the paths or addresses contain spaces or special characters.
| ray job submit --no-wait --address=$RAY_ADDRESS --runtime-env $RUNTIME_ENV \ | |
| ray job submit --no-wait --address="$RAY_ADDRESS" --runtime-env "$RUNTIME_ENV" \ |
| -- python3 -m verl.experimental.fully_async_policy.fully_async_main \ | ||
| --config-path=config \ | ||
| --config-name='fully_async_ppo_megatron_trainer.yaml' \ | ||
| hydra.searchpath=[pkg://verl.trainer.config] \ |
There was a problem hiding this comment.
| actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \ | ||
| actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \ | ||
| actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \ | ||
| actor_rollout_ref.model.path=${MODEL_PATH} \ |
| actor_rollout_ref.actor.optim.clip_grad=1.0 \ | ||
| actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \ | ||
| actor_rollout_ref.actor.checkpoint.async_save=False \ | ||
| actor_rollout_ref.actor.checkpoint.save_contents=${CHECKPOINT_CONTENTS} \ |
There was a problem hiding this comment.
| actor_rollout_ref.rollout.multi_turn.enable=True \ | ||
| actor_rollout_ref.rollout.multi_turn.max_parallel_calls=1 \ | ||
| actor_rollout_ref.rollout.agent.num_workers=8 \ | ||
| actor_rollout_ref.rollout.agent.agent_loop_config_path=${AGENT_CONFIG_PATH} \ |
There was a problem hiding this comment.
| algorithm.rollout_correction.rollout_is=${rollout_is} \ | ||
| algorithm.rollout_correction.rollout_rs=${rollout_rs} \ | ||
| algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \ | ||
| trainer.logger=['console','wandb'] \ |
There was a problem hiding this comment.
The trainer.logger argument contains square brackets [ and ] which are globbing characters in bash. It should be double-quoted to prevent accidental pathname expansion if files matching the pattern exist in the working directory.
| trainer.logger=['console','wandb'] \ | |
| trainer.logger="['console','wandb']" \ |
What does this PR do?
Adds
examples/swe_agent/— an end-to-end recipe for training aSWE-bench coding agent with fully-async RL (Megatron actors + vLLM
rollout on separate nodes) and Modal swe-rex sandboxes.
It stitches the existing building blocks into something runnable,
mirroring
examples/search_agent/:examples/data_preprocess/swe_rebench.py+swe_bench_verified.pyuni_agent.reward.swe_rebench/swe_benchuni_agent.agent_loop.UniAgentLoop(Modal swe-rex)Reference config trains Qwen3-235B-A22B-Instruct-2507 with GRPO on a
12-node (8 train + 4 rollout) × 4-GPU topology; everything is
env-overridable to scale down.
Checklist Before Starting
gh pr list --repo verl-project/uni-agent --state open→ no SWE-bench training example (PR init commit for external agent framework+gateway #25 is an unrelated agent-framework/gateway)examples/swe*dir[examples] feat: ...Test
This is a recipe (scripts + configs + docs), not library code:
bash -n train_qwen3_235b_swebench.sh— OKpython -c "import yaml; yaml.safe_load(...)"on both YAMLs — OKpre-commit run --files examples/swe_agent/*— pass (compile-all; ruff/mypy skip non-py)shellcheck— clean except style-only SC2206 on the hydra arg-array append, consistent with the repo's other launch scriptsFull end-to-end training was run internally on the reference topology;
the committed files are the scrubbed/generalized form of that setup
(no secrets or site-specific paths —
runtime_env.yamlshipsplaceholders only).
Files
train_qwen3_235b_swebench.shray job submit+ full GRPO / Megatron / vLLM config; topology & paths are env varsagent_config.yamlruntime_env.yamlREADME.mdNotes captured for reproducibility
Non-obvious settings learned running this at scale (documented in the
script header / README):
max_response_length=128K— SWE-bench trajectories are long (mean ~70K tokens, ~90 turns); 32K truncates ~halftool_parser: hermesfor Qwen3-235B (wrong parser silently breaks tool calls)moe_token_dispatcher_type=alltoall— portable MoE dispatchVLLM_USE_DEEP_GEMM=0— vLLM 0.21 EP/CUTLASS init workaroundexpandable_segments:True(incompatible with vLLM sleep-mode CuMemAllocator, pytorch#147851)Checklist Before Submitting
pre-commit run --files examples/swe_agent/*passed