This repository serves as the official implementation of the paper "Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification".
Reinforcement Learning with Verifiable Rewards (RLVR) often suffers from Recursive Space Contraction (RSC), where the policy irreversibly collapses into narrow reasoning paths, sacrificing diversity (Pass@K) for efficiency. Standard KL regularization fails to address this effectively due to rigid "Shape Matching" constraints.
Anchored Policy Optimization (APO) introduces a remedy: shifting from Shape Matching to Support Coverage.
- Safe Manifold: We utilize the reference model's high-confidence support (Top-K) as a "Safe Manifold".
-
Dual-Force Mechanism:
-
Push (
$\lambda$ ): Aggressively suppresses incorrect responses. -
Pull (
$\beta$ ): Selectively pulls the policy back towards the Safe Manifold when errors occur, enabling Elastic Recovery.
-
Push (
This approach allows APO to break the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
- Exclusive APO Implementation: Full support for the APO objective with Ratio Rectification.
- Efficient & Stable: Designed to work within the PPO Trust Region without destabilizing training.
- Better Baselines: Includes implementations of KL-on-Wrong (Conditional KL) and standard GRPO for rigorous comparison.
- Scalable: Built on top of
verl, supporting FSDP and Megatron for training large-scale models.
Option 1: Automated Setup (Recommended)
bash uv_verl.shOption 2: Manual Setup
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
# Install package in editable mode
pip install --no-deps -e .
# Add project root to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)To enable APO, you must set the loss_mode to apo_ratio in your training configuration. The core hyperparameters control the mechanism described in the paper:
python -m verl.trainer.main_ppo \
data.train_files="['data/DAPO/dapo-math-17k_dedup.parquet']" \
actor_rollout_ref.model.path="Qwen/Qwen2.5-7B" \
# ... basic settings ...
# [CRITICAL] Enable APO
actor_rollout_ref.actor.policy_loss.loss_mode="apo_ratio" \
# [APO Hyperparameters]
# Push Coefficient (lambda): Intensity of error suppression.
# Paper Default: 1.05
actor_rollout_ref.actor.policy_loss.apo_push_coeff=1.05 \
# Pull Coefficient (beta): Intensity of the restoring anchor force.
# Paper Default: 0.1
actor_rollout_ref.actor.policy_loss.apo_pull_coeff=0.1 \
# Safe Manifold Size (K): Size of the reference Top-K support.
# Paper Default: 8
actor_rollout_ref.actor.policy_loss.apo_topk=8 \
# Exclusive Anchoring: Exclude the error token itself from the anchor set.
actor_rollout_ref.actor.policy_loss.apo_exclude_sampled=trueWe provide implementations for key baselines to allow for ablation studies:
| Algorithm | Configuration | Description |
|---|---|---|
| APO (Ours) | loss_mode="apo_ratio" |
Dual-force optimization (Push + Pull). |
| GRPO (Vanilla) | loss_mode="vanilla" |
Standard Group Relative Policy Optimization. |
| KL on Wrong | use_kl_loss_on_wrong=True |
Conditional KL penalty applied only on negative advantages. |
| Standard KL | use_kl_loss=True |
Global KL penalty (Shape Matching constraint). |
We provide pre-configured scripts for reproducing the paper's main results:
# Train Qwen2.5-7B with APO
bash run_scripts/run_apo_7B.sh
# Train Llama-3 models
bash run_scripts/run_apo_llama.sh
# Run the KL-on-Wrong Baseline
bash run_scripts/run_kl_on_wrong_7B.shEvaluate your trained models using the provided scripts:
# Full Model Evaluation
bash scripts/eval_model.sh outputs/<experiment_name>
# Pass@K Analysis
bash scripts/eval_pass_at_k.sh outputs/<experiment_name>If you find APO useful for your research, please cite our paper:
@misc{wang2026anchoredpolicyoptimizationmitigating,
title={Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification},
author={Tianyi Wang and Long Li and Hongcan Guo and Yibiao Chen and Yixia Li and Yong Wang and Yun Chen and Guanhua Chen},
year={2026},
eprint={2602.05717},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.05717},
}