MARL Phase 1: add per-agent reward group and examples#1129
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the MATH-500 and AMC12 datasets and enables multi-agent reinforcement learning (MARL) by incorporating normalization group logic into the inference engine. It also adds specialized reward functions and verifiers for mathematical tasks, including a new multiple-choice verifier. Review feedback identifies several critical improvements for the verifiers, such as catching TimeoutException to prevent crashes, fixing case-sensitivity bugs in regex-based answer extraction, and ensuring the multiple-choice verifier dynamically uses the configured choice set rather than hardcoding values.
4aca2a9 to
0eae11f
Compare
f551138 to
41c5b5d
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
This pull request has been automatically marked as stale because it has not had recent activity within the last 14 days. Please add a comment or push new commits to keep it active. Thank you for your contribution! |
Description
This PR implements Phase 1 of the Reasoning & MARL Infrastructure roadmap as outlined in #1114. It establishes the data pipelines and specialized verifiers for math reasoning tasks, and provides infrastructure support for Multi-Agent Reinforcement Learning (MARL) workflows where agents share a single inference backend.
Key Changes:
Dataset Integration & RL Support:
MATH-500andamc12datasets inareal/dataset.\boxed{}output format).Infrastructure for MARL:
Math Verification Support:
MathMultipleChoiceVerifyWorker: A verifier for multiple-choice datasets (like AMC12) that uses regex for LaTeX extraction and fallback string matching.MathVerifyWorker: Addedverify_for_math500with automated canonicalization for\boxedanswers and<think>tag stripping.Reference Implementations:
train_math_marti_shared.pyandtrain_math_marti_single.py) demonstrating a Generator-Verifier-Refiner (Marti) multi-agent reasoning loop.Setup
Hardware Environment: 1x Node equipped with 16x Huawei Ascend 910C NPUs.
Methods (CoA vs. Single-Agent):
Benchmarking Datasets:
How to Run
Training (GSM8K):
Training (MATH-500):
Training (AMC12):
Evaluation Results
On the GSM8K dataset, the Chain-of-Agents (CoA) approach exhibits significantly higher stability. In contrast, the Single-Agent baseline frequently experiences reward collapse during training on sample sets.
On the MATH-500 dataset, CoA demonstrates improved performance compared to the Single-Agent baseline, in both the rewards on evaluation and training samples.
Rewards on training samples:

Rewards on evaluation samples:

Related Issue
Fixes #1114
Type of Change
Checklist
pre-commit run --all-files)./docs/build_all.sh)main/review-prcommand/create-prBreaking Change Details (if applicable):
N/A, it is backward compatible
Additional Context
Phase 1 focuses on establishing the evaluation and data pipeline for reasoning tasks. Future phases will build upon this infrastructure for heterogeneous MARL.