Skip to content

feat: enable v2 training pipeline with controller parity#1327

Draft
garrett4wade wants to merge 2 commits into
mainfrom
fw/rl3
Draft

feat: enable v2 training pipeline with controller parity#1327
garrett4wade wants to merge 2 commits into
mainfrom
fw/rl3

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

Description

Bring GatewayTrainController and RolloutControllerV2 to full parity with v1 controllers, enabling the v2 training pipeline for RL training paths.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Key Changes

V2 Controller Parity

  • Route sglang_remote and vllm_remote to RolloutControllerV2 when config._version == "v2"
  • Add thread-safe version management, connect_engine guard address support, and clear_batches RTensor storage eviction to GatewayTrainController
  • Direct config_perf_tracer calls to individual workers instead of gateway relay
  • Pass staleness_manager to WorkflowExecutor in RolloutControllerV2

AsyncRewardWrapper Lifecycle

  • Replace weakref finalization + instance counting with atexit shutdown for all shared executors
  • Simplify retry logic and executor recreation with compare-and-swap guard
  • Reuse AsyncRewardWrapper instances in math agent workflows instead of creating per-call

HTTP Client Unification

  • Use create_httpx_client consistently in workflow_context.py
  • Add sock_connect/connect timeouts to aiohttp sessions
  • Unify HTTP client session usage across inference/training controllers

Example Configs

  • Add agent: section (mode: inline, export_style: individual, turn_discount: 1.0) to all example YAML configs
  • Switch default workflow from RLVRWorkflow to MathAgent in gsm8k_rl.py
  • Add max_tokens to generation config

Cleanup

  • Remove obsolete get_custom_reward_fn and VALID_REWARD_FN from areal/reward/__init__.py
  • Remove gateway HTTP helper tests superseded by unified client

Risk Areas

  • Breaking: get_custom_reward_fn removed from reward public API — callers using this function will need to import reward functions directly
  • Force push: Branch history rewritten (rebase + squash onto latest main)

Checklist

  • Pre-commit hooks pass (pre-commit run --all-files)
  • New tests added (tests/test_async_reward_wrapper.py)
  • Branch is up to date with main
  • This PR was created by a coding agent via /create-pr

Test Commands

uv run pytest tests/test_async_reward_wrapper.py
uv run pytest tests/experimental/inference_service/test_controller_version.py
uv run pytest tests/test_examples.py

Skipped suites: GPU/distributed tests (tests/grpo/, tests/torchrun/) — require multi-GPU hardware not available locally.

Bring GatewayTrainController and RolloutControllerV2 to full
parity with v1 controllers for RL training paths.

Key changes:
- Route to RolloutControllerV2 when config._version=="v2"
- Add version management, connect_engine, clear_batches to GatewayTrainController
- Simplify AsyncRewardWrapper lifecycle with atexit shutdown
- Unify HTTP client sessions across inference/training controllers
- Switch default workflow to MathAgent in example configs
- Add agent config section to all example YAML files
- Remove obsolete get_custom_reward_fn from reward module
- Add async reward wrapper tests
Partial groups produce inconsistent training data. Reject the
entire group if any _run_one call raises, instead of silently
returning the successful subset with 0.0 rewards for failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant