Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions examples/swe_agent_235b/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# SWE-bench Agent Training Example

End-to-end recipe for training a **SWE-bench coding agent** with the **Uni-Agent** framework, using **fully-async RL** (Megatron actors + vLLM rollout replicas on separate nodes) and **Modal swe-rex sandboxes** for safe, parallel code execution during rollout.

The reference configuration trains **Qwen3-235B-A22B-Instruct-2507** with GRPO, but the launch script is fully parameterized — point it at a smaller model and shrink the topology to reproduce on fewer GPUs.

The agent solves each SWE-bench task by iteratively calling three tools inside a per-task Modal sandbox:

- `str_replace_editor` — view / edit repository files
- `execute_bash` — run shell commands (build, run tests, inspect)
- `submit` — submit the final patch for evaluation

The reward is computed by running the task's test suite against the submitted patch (`uni_agent.reward.swe_rebench` for training, `uni_agent.reward.swe_bench` for SWE-bench Verified).

---

## Prerequisites

- A Ray cluster with GPU nodes (the reference uses 12 nodes × 4 GPU: 8 train + 4 rollout). A working verl + Megatron + vLLM install on every node.
- A [Modal](https://modal.com) account and API token. The rollout spins up one swe-rex sandbox per in-flight trajectory, so size your concurrency against your Modal workspace's sandbox quota (see `agent_config.yaml`).
- A Weights & Biases account (or change `trainer.logger` in the launch script).

## Step 1: Prepare the datasets

Build the train (SWE-reBench) and validation (SWE-bench Verified) parquet files with the existing preprocessing scripts:

```bash
python examples/data_preprocess/swe_rebench.py --local-save-dir ~/data/swe_agent
python examples/data_preprocess/swe_bench_verified.py --local-save-dir ~/data/swe_agent
```

These write `swe_rebench_filtered_*.parquet` and `swe_bench_verified_*.parquet` into `--local-save-dir`. Make that directory reachable from every Ray node (shared filesystem or copied), then point `TRAIN_FILE` / `TEST_FILE` at the exact files produced (see Step 3).

## Step 2: Configure the runtime env

Copy `runtime_env.yaml` and fill in the placeholders:

- `working_dir` and `PYTHONPATH` → your uni-agent and verl checkouts.
- `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` → your Modal token (`modal token new`, or `modal token set --profile=<team>` for a team workspace). Alternatively leave them unset and rely on `~/.modal.toml` on every node.
- `WANDB_API_KEY` → your W&B key (or run `wandb login` on the nodes and remove it).

The file also documents two settings worth keeping:

- **Do not** set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` — it is incompatible with vLLM's sleep-mode `CuMemAllocator` (pytorch/pytorch#147851).
- Set `CUDA_HOME` — some MoE expert-parallel kernels JIT-compile at runtime and need it to locate `nvcc`.

## Step 3: Launch training

```bash
export RAY_ADDRESS=http://<ray-head>:8265
export DATA_ROOT=/path/to/data-root # holds hf-models/ and data/swe_agent/
export TRAIN_FILE=$DATA_ROOT/data/swe_agent/swe_rebench_filtered_<impl>.parquet
export TEST_FILE=$DATA_ROOT/data/swe_agent/swe_bench_verified_<impl>.parquet

bash examples/swe_agent_235b/train_qwen3_235b_swebench.sh
```

Topology and parallelism are env-overridable, e.g.:

```bash
NNODES_TRAIN=8 NNODES_ROLLOUT=4 NGPUS_PER_NODE=4 \
ACTOR_TP=2 ACTOR_CP=2 ACTOR_PP=8 ACTOR_EP=4 ACTOR_ETP=1 \
INFER_TP=4 \
bash examples/swe_agent_235b/train_qwen3_235b_swebench.sh
```

Notable settings baked into the script (see its header for the full rationale):

- `max_response_length=128K` — SWE-bench trajectories are long (empirically mean ~70K tokens, ~90 turns); a 32K cap truncates roughly half of them.
- `tool_parser: hermes` (in `agent_config.yaml`) — Qwen3-235B-A22B uses the Hermes tool-call template; the wrong parser silently breaks tool calls.
- `moe_token_dispatcher_type=alltoall` — portable MoE dispatch (no extra expert-parallel comm library required).
- `VLLM_USE_DEEP_GEMM=0` — works around a vLLM 0.21 EP/CUTLASS init issue.
- `performance_mode=interactivity` — favoured throughput in our rollout concurrency sweep for this model.

## Step 4: Monitor

- W&B: reward curve under the configured `project_name` / `experiment_name`.
- Optional Prometheus rollout metrics: set `ENABLE_PROMETHEUS_MONITORING=true` and `PROMETHEUS_CONFIG_FILE=...`; verl rewrites the scrape targets to the live vLLM replicas automatically.
- Per-trajectory agent logs: `log_dir` in `agent_config.yaml` (default `/tmp/swe_agent_rollout_logs/<run_id>/run.log`).

## Tuning notes

- **Rollout concurrency** (`concurrency` in `agent_config.yaml`) is the main throughput/stability knob. Too high vs. the vLLM KV budget causes a preemption cascade; too high vs. your Modal quota causes sandbox-create failures. Start around `20 × (rollout replicas)` and ramp up once steady.
- **Checkpoint storage**: `save_freq=1` + `max_actor_ckpt_to_keep=2` keeps only the two most recent checkpoints; raise `save_freq` if I/O-bound.

## Files

| File | Purpose |
|---|---|
| `train_qwen3_235b_swebench.sh` | Ray job submit + full GRPO / Megatron / vLLM config |
| `agent_config.yaml` | UniAgentLoop config: tools, Modal deployment, concurrency, reward |
| `runtime_env.yaml` | Ray runtime env template (fill in tokens / paths) |
52 changes: 52 additions & 0 deletions examples/swe_agent_235b/agent_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Agent-loop config for SWE-bench RL training with Modal swe-rex sandboxes.
#
# Referenced from the launch script via
# actor_rollout_ref.rollout.agent.agent_loop_config_path=<this file>
#
# The `concurrency` field is the total number of in-flight trajectories across
# the whole rollout fleet (it is divided by rollout.agent.num_workers to get a
# per-worker semaphore). Tune it against your rollout KV budget and Modal
# sandbox quota:
# - Too high relative to KV cache -> vLLM preemption cascade at high KV usage.
# - Too high relative to your Modal workspace cap -> sandbox-create failures.
# A safe starting point is ~20 x (number of rollout replicas); ramp up once the
# run is steady with no preemption. SWE-bench trajectories are long
# (max_response up to 128K), so leave generous headroom for the long tail.
- name: swe_agent

_target_: uni_agent.agent_loop.UniAgentLoop

concurrency: 80
log_dir: /tmp/swe_agent_rollout_logs
mask_abnormal_exit_traj: false
# Qwen3-235B-A22B uses the Hermes tool-call template. Using the wrong parser
# silently mis-parses tool calls and breaks training — match the parser to
# your model's chat template.
tool_parser: hermes

interaction:
action_timeout: 300
max_turns: 300

env:
deployment:
type: modal
startup_timeout: 300
runtime_timeout: 300
deployment_timeout: 3600
env_variables:
PIP_PROGRESS_BAR: "off"
PIP_CACHE_DIR: "~/.cache/pip"
PAGER: "cat"
MANPAGER: "cat"
LESS: "-R"
TQDM_DISABLE: "1"
GIT_PAGER: "cat"

tools:
- name: str_replace_editor
- name: execute_bash
- name: submit

reward:
eval_timeout: 300
51 changes: 51 additions & 0 deletions examples/swe_agent_235b/runtime_env.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Ray runtime env for the SWE-bench fully-async RL training example.
#
# This is a TEMPLATE. Fill in the placeholders (<...>) before launching, or
# export the corresponding variables in your shell and drop them here.
#
# `working_dir` is uploaded to every Ray worker, so point it at your uni-agent
# checkout. PYTHONPATH must include both your verl checkout and uni-agent so the
# `verl.experimental.fully_async_policy` entrypoint and `uni_agent.agent_loop`
# resolve on the workers.

working_dir: <PATH_TO_YOUR_UNI_AGENT_CHECKOUT>
excludes: ["/.git/", "/.venv/", "/__pycache__/"]

pip:
- loguru
- pydantic
- pydantic_settings
- swebench
- modal
- swe-rex
- boto3
- aiohttp

env_vars:
PYTHONPATH: "<PATH_TO_YOUR_VERL_CHECKOUT>:<PATH_TO_YOUR_UNI_AGENT_CHECKOUT>"
TORCH_NCCL_AVOID_RECORD_STREAMS: "1"
CUDA_DEVICE_MAX_CONNECTIONS: "1"
# Some MoE EP kernels JIT-compile at runtime and read $CUDA_HOME to locate
# nvcc. If your container leaves it unset, the compile fails with
# "/bin/nvcc not found"; set it explicitly.
CUDA_HOME: "/usr/local/cuda"
# IMPORTANT: do NOT set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
# vLLM's CuMemAllocator (sleep mode / weight transfer) is incompatible with
# expandable segments and will assert at startup
# (vllm/device_allocator/cumem.py). See pytorch/pytorch#147851.
#
# NCCL settings below are tuned for NVLink-rich multi-GPU nodes (e.g. GB200
# NVL72). Adjust or remove for other fabrics.
NCCL_CUMEM_ENABLE: "1"
NCCL_NVLS_ENABLE: "1"
NCCL_MNNVL_ENABLE: "1"
VLLM_USE_NCCL_SYMM_MEM: "1"
# Modal credentials for the swe-rex sandboxes used by the SWE-bench rollout.
# Create a token with `modal token new` (or `modal token set --profile=<name>`
# for a team workspace) and paste the id/secret here, or leave these unset and
# rely on ~/.modal.toml on every Ray node.
MODAL_TOKEN_ID: "<YOUR_MODAL_TOKEN_ID>"
MODAL_TOKEN_SECRET: "<YOUR_MODAL_TOKEN_SECRET>"
# Weights & Biases. Prefer `wandb login` on the nodes; only set this if you
# must pass the key through the runtime env.
WANDB_API_KEY: "<YOUR_WANDB_API_KEY>"
Loading
Loading