MARL Phase 1: add per-agent reward group and examples by luzai · Pull Request #1129 · areal-project/AReaL

luzai · 2026-04-01T23:10:54Z

Description

This PR implements Phase 1 of the Reasoning & MARL Infrastructure roadmap as outlined in #1114. It establishes the data pipelines and specialized verifiers for math reasoning tasks, and provides infrastructure support for Multi-Agent Reinforcement Learning (MARL) workflows where agents share a single inference backend.

Key Changes:

Dataset Integration & RL Support:
- Added support for MATH-500 and amc12 datasets in areal/dataset.
- Special thanks to the edev2000/amc12-full dataset on Hugging Face for providing the processed AMC12 data.
- Implemented RL-specific dataloaders that format prompts for reasoning tasks (e.g., enforcing \boxed{} output format).
Infrastructure for MARL:
- Introduced the norm_group field in InteractionWithTokenLogpReward to support agent-wise group-based reward normalization.
- Updated GroupedRolloutWorkflow to perform deterministic merging of interaction results based on norm_group. This ensures that the turns generated by each specific agent in the multi-agent system are correctly ordered for the shared-backend MAS setting, so the agent-wise GRPO proceeds properly.
Math Verification Support:
- MathMultipleChoiceVerifyWorker: A verifier for multiple-choice datasets (like AMC12) that uses regex for LaTeX extraction and fallback string matching.
- MathVerifyWorker: Added verify_for_math500 with automated canonicalization for \boxed answers and <think> tag stripping.
Reference Implementations:
- Added example training scripts (train_math_marti_shared.py and train_math_marti_single.py) demonstrating a Generator-Verifier-Refiner (Marti) multi-agent reasoning loop.
- Provided corresponding YAML configurations for GSM8K, MATH-500, and AMC12.

Setup

Hardware Environment: 1x Node equipped with 16x Huawei Ascend 910C NPUs.

Methods (CoA vs. Single-Agent):

Baseline: A standard Single-Agent GRPO setup.
Proposed (CoA): A Chain-of-Agents architecture utilizing Per-Agent Reward Grouping

Benchmarking Datasets:

GSM8K: Basic multi-step grade school math.
MATH-500: A representative challenging subset of the MATH dataset.
AMC12: High-school competition-level multiple-choice problems, utilizing the MathMultipleChoiceVerifyWorker.

How to Run

Training (GSM8K):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-gsm8k.yaml

Training (MATH-500):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-math500.yaml

Training (AMC12):

python -m examples.openai_agents.train_math_marti_shared --config examples/openai_agents/config_marti_grpo-amc.yaml

Evaluation Results

On the GSM8K dataset, the Chain-of-Agents (CoA) approach exhibits significantly higher stability. In contrast, the Single-Agent baseline frequently experiences reward collapse during training on sample sets.

On the MATH-500 dataset, CoA demonstrates improved performance compared to the Single-Agent baseline, in both the rewards on evaluation and training samples.

Rewards on training samples:

Rewards on evaluation samples:

Related Issue

Fixes #1114

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):
N/A, it is backward compatible

Additional Context

Phase 1 focuses on establishing the evaluation and data pipeline for reasoning tasks. Future phases will build upon this infrastructure for heterogeneous MARL.

gemini-code-assist

Code Review

This pull request introduces support for the MATH-500 and AMC12 datasets and enables multi-agent reinforcement learning (MARL) by incorporating normalization group logic into the inference engine. It also adds specialized reward functions and verifiers for mathematical tasks, including a new multiple-choice verifier. Review feedback identifies several critical improvements for the verifiers, such as catching TimeoutException to prevent crashes, fixing case-sensitivity bugs in regex-based answer extraction, and ensuring the multiple-choice verifier dynamically uses the configured choice set rather than hardcoding values.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

github-actions · 2026-05-08T02:41:30Z

This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days.

Please add a comment or push new commits to keep it active.

Thank you for your contribution!

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

luzai force-pushed the marl-phase-1 branch 5 times, most recently from 4aca2a9 to 0eae11f Compare April 6, 2026 16:33

luzai marked this pull request as ready for review April 6, 2026 17:21

luzai force-pushed the marl-phase-1 branch 6 times, most recently from f551138 to 41c5b5d Compare April 10, 2026 21:33

luzai force-pushed the marl-phase-1 branch from 41c5b5d to e91ff86 Compare April 14, 2026 20:31

luzai requested a review from garrett4wade as a code owner April 14, 2026 20:31

luzai force-pushed the marl-phase-1 branch from e91ff86 to c88b780 Compare April 15, 2026 17:35

luzai and others added 7 commits April 17, 2026 15:08

MARL Phase 1: add per-agent reward group and examples

73086ae

Update areal/reward/__init__.py

4ef400d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/reward/__init__.py

78d39f9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update areal/reward/__init__.py

e888246

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

apply gemini advice on exception

3574c45

update regexp

7b97541

remove redundant script

6f0a0da

luzai force-pushed the marl-phase-1 branch from c88b780 to 6f0a0da Compare April 17, 2026 22:08

github-actions Bot added the stale label May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MARL Phase 1: add per-agent reward group and examples#1129

MARL Phase 1: add per-agent reward group and examples#1129
luzai wants to merge 7 commits into
areal-project:mainfrom
luzai:marl-phase-1

luzai commented Apr 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luzai commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes:

Setup

How to Run

Evaluation Results

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luzai commented Apr 1, 2026 •

edited

Loading