Skip to content

[examples] feat: add end-to-end SWE-bench RL training recipe (swe_agent)#52

Merged
yyDing1 merged 2 commits into
verl-project:mainfrom
aoshen02:examples/swe-agent-recipe
Jun 1, 2026
Merged

[examples] feat: add end-to-end SWE-bench RL training recipe (swe_agent)#52
yyDing1 merged 2 commits into
verl-project:mainfrom
aoshen02:examples/swe-agent-recipe

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds examples/swe_agent/ — an end-to-end recipe for training a
SWE-bench coding agent with fully-async RL (Megatron actors + vLLM
rollout on separate nodes) and Modal swe-rex sandboxes.

It stitches the existing building blocks into something runnable,
mirroring examples/search_agent/:

  • data: examples/data_preprocess/swe_rebench.py + swe_bench_verified.py
  • reward: uni_agent.reward.swe_rebench / swe_bench
  • rollout: uni_agent.agent_loop.UniAgentLoop (Modal swe-rex)

Reference config trains Qwen3-235B-A22B-Instruct-2507 with GRPO on a
12-node (8 train + 4 rollout) × 4-GPU topology; everything is
env-overridable to scale down.

Checklist Before Starting

  • Search for similar PRs/issues:
  • Format the PR title as [examples] feat: ...

Test

This is a recipe (scripts + configs + docs), not library code:

  • bash -n train_qwen3_235b_swebench.sh — OK
  • python -c "import yaml; yaml.safe_load(...)" on both YAMLs — OK
  • pre-commit run --files examples/swe_agent/* — pass (compile-all; ruff/mypy skip non-py)
  • shellcheck — clean except style-only SC2206 on the hydra arg-array append, consistent with the repo's other launch scripts

Full end-to-end training was run internally on the reference topology;
the committed files are the scrubbed/generalized form of that setup
(no secrets or site-specific paths — runtime_env.yaml ships
placeholders only).

Files

File Purpose
train_qwen3_235b_swebench.sh ray job submit + full GRPO / Megatron / vLLM config; topology & paths are env vars
agent_config.yaml UniAgentLoop config: tools, Modal deployment, rollout concurrency, reward
runtime_env.yaml Ray runtime-env template (placeholders for Modal / W&B tokens + checkout paths)
README.md dataset → runtime_env → launch → monitor + tuning notes

Notes captured for reproducibility

Non-obvious settings learned running this at scale (documented in the
script header / README):

  • max_response_length=128K — SWE-bench trajectories are long (mean ~70K tokens, ~90 turns); 32K truncates ~half
  • tool_parser: hermes for Qwen3-235B (wrong parser silently breaks tool calls)
  • moe_token_dispatcher_type=alltoall — portable MoE dispatch
  • VLLM_USE_DEEP_GEMM=0 — vLLM 0.21 EP/CUTLASS init workaround
  • do not set expandable_segments:True (incompatible with vLLM sleep-mode CuMemAllocator, pytorch#147851)

Checklist Before Submitting

  • Read the Contribute Guide
  • pre-commit run --files examples/swe_agent/* passed
  • No new library code → no unit tests; recipe validated via syntax/lint + internal end-to-end run
  • AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every line
  • No secrets / site-specific paths committed

Adds examples/swe_agent/, an end-to-end recipe for training a SWE-bench
coding agent with fully-async RL (Megatron actors + vLLM rollout) and
Modal swe-rex sandboxes. It stitches together the existing building
blocks (examples/data_preprocess/swe_rebench.py + swe_bench_verified.py
for data, uni_agent.reward.swe_rebench / swe_bench for reward,
uni_agent.agent_loop.UniAgentLoop for rollout) into a runnable launch
script + configs + README, mirroring the structure of
examples/search_agent/.

Reference config trains Qwen3-235B-A22B-Instruct-2507 with GRPO on a
12-node (8 train + 4 rollout) x 4-GPU topology, but everything is
env-overridable to scale down.

Files:
  - train_qwen3_235b_swebench.sh : ray job submit + full GRPO / Megatron
      / vLLM config. Topology, paths, and addresses are env vars.
  - agent_config.yaml            : UniAgentLoop config (tools, Modal
      deployment, rollout concurrency, reward).
  - runtime_env.yaml             : Ray runtime-env template (placeholders
      for Modal / W&B tokens and checkout paths).
  - README.md                    : dataset -> runtime_env -> launch ->
      monitor, plus tuning notes.

The script header / README capture a few non-obvious settings learned
from running this at scale: max_response_length=128K (SWE-bench
trajectories are long), tool_parser=hermes for Qwen3-235B,
moe_token_dispatcher_type=alltoall, VLLM_USE_DEEP_GEMM=0 for vLLM 0.21
EP init, and the expandable_segments / CuMemAllocator incompatibility.

No secrets or environment-specific paths are committed; runtime_env.yaml
ships placeholders only.

This PR includes AI assistance (Claude Code). The submitting human
(@aoshen02) reviewed every line.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an end-to-end recipe for training a SWE-bench coding agent using the Uni-Agent framework with fully-async RL and Modal sandboxes. It includes a README, agent loop configuration, Ray runtime environment template, and a launch script for training Qwen3-235B-A22B-Instruct-2507. The review feedback suggests correcting the default dataset paths in the launch script to include the _modal suffix, and double-quoting several variables and arguments (such as paths, addresses, and parameters containing square brackets) to prevent word splitting and globbing issues in Bash.

DATA_ROOT=${DATA_ROOT:-/path/to/data-root}
EXAMPLE_DIR=${EXAMPLE_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}
# Qwen3-235B-A22B-Instruct-2507 snapshot (first/only snapshot under the HF cache).
MODEL_PATH=${MODEL_PATH:-$(ls -d ${DATA_ROOT}/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Double-quote ${DATA_ROOT} inside the command substitution to prevent word splitting if the path contains spaces.

Suggested change
MODEL_PATH=${MODEL_PATH:-$(ls -d ${DATA_ROOT}/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)}
MODEL_PATH=${MODEL_PATH:-$(ls -d "${DATA_ROOT}"/hf-models/hub/models--Qwen--Qwen3-235B-A22B-Instruct-2507/snapshots/*/ 2>/dev/null | head -1)}

Comment on lines +50 to +51
TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered.parquet}
TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified.parquet}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The preprocessing scripts swe_rebench.py and swe_bench_verified.py append the deployment implementation suffix (e.g., _modal.parquet) to the output filenames. Since this training script is configured for Modal, the default paths should include the _modal suffix to avoid file-not-found errors out of the box.

Suggested change
TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered.parquet}
TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified.parquet}
TRAIN_FILE=${TRAIN_FILE:-${DATA_ROOT}/data/swe_agent/swe_rebench_filtered_modal.parquet}
TEST_FILE=${TEST_FILE:-${DATA_ROOT}/data/swe_agent/swe_bench_verified_modal.parquet}

actor_rollout_ref.model.mtp.enable_rollout=False
)

CHECKPOINT_CONTENTS=['model','hf_model','extra']

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The CHECKPOINT_CONTENTS variable contains square brackets [ and ] which are globbing characters in bash. It should be quoted to prevent accidental shell expansion.

Suggested change
CHECKPOINT_CONTENTS=['model','hf_model','extra']
CHECKPOINT_CONTENTS="['model','hf_model','extra']"


CHECKPOINT_CONTENTS=['model','hf_model','extra']

ray job submit --no-wait --address=$RAY_ADDRESS --runtime-env $RUNTIME_ENV \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Double-quote $RAY_ADDRESS and $RUNTIME_ENV to prevent word splitting and globbing if the paths or addresses contain spaces or special characters.

Suggested change
ray job submit --no-wait --address=$RAY_ADDRESS --runtime-env $RUNTIME_ENV \
ray job submit --no-wait --address="$RAY_ADDRESS" --runtime-env "$RUNTIME_ENV" \

-- python3 -m verl.experimental.fully_async_policy.fully_async_main \
--config-path=config \
--config-name='fully_async_ppo_megatron_trainer.yaml' \
hydra.searchpath=[pkg://verl.trainer.config] \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hydra.searchpath argument contains square brackets [ and ] which are globbing characters in bash. It should be double-quoted to prevent accidental pathname expansion.

Suggested change
hydra.searchpath=[pkg://verl.trainer.config] \
hydra.searchpath="[pkg://verl.trainer.config]" \

actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path=${MODEL_PATH} \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Double-quote ${MODEL_PATH} to prevent word splitting if the model path contains spaces or special characters.

Suggested change
actor_rollout_ref.model.path=${MODEL_PATH} \
actor_rollout_ref.model.path="${MODEL_PATH}" \

actor_rollout_ref.actor.optim.clip_grad=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.checkpoint.async_save=False \
actor_rollout_ref.actor.checkpoint.save_contents=${CHECKPOINT_CONTENTS} \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ${CHECKPOINT_CONTENTS} variable should be double-quoted to prevent globbing and word splitting by the shell.

Suggested change
actor_rollout_ref.actor.checkpoint.save_contents=${CHECKPOINT_CONTENTS} \
actor_rollout_ref.actor.checkpoint.save_contents="${CHECKPOINT_CONTENTS}" \

actor_rollout_ref.rollout.multi_turn.enable=True \
actor_rollout_ref.rollout.multi_turn.max_parallel_calls=1 \
actor_rollout_ref.rollout.agent.num_workers=8 \
actor_rollout_ref.rollout.agent.agent_loop_config_path=${AGENT_CONFIG_PATH} \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Double-quote ${AGENT_CONFIG_PATH} to prevent word splitting if the path contains spaces or special characters.

Suggested change
actor_rollout_ref.rollout.agent.agent_loop_config_path=${AGENT_CONFIG_PATH} \
actor_rollout_ref.rollout.agent.agent_loop_config_path="${AGENT_CONFIG_PATH}" \

algorithm.rollout_correction.rollout_is=${rollout_is} \
algorithm.rollout_correction.rollout_rs=${rollout_rs} \
algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \
trainer.logger=['console','wandb'] \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The trainer.logger argument contains square brackets [ and ] which are globbing characters in bash. It should be double-quoted to prevent accidental pathname expansion if files matching the pattern exist in the working directory.

Suggested change
trainer.logger=['console','wandb'] \
trainer.logger="['console','wandb']" \

@yyDing1 yyDing1 merged commit bdf8ae8 into verl-project:main Jun 1, 2026
3 checks passed
@yyDing1 yyDing1 deleted the examples/swe-agent-recipe branch June 1, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants