Skip to content

The official repository for Anchored Policy Optimization: Mitigating Exploration Collapse via Support-Constrained Rectification

License

Notifications You must be signed in to change notification settings

1BIMU/APO_OFFICAL

Repository files navigation

Anchored Policy Optimization (APO)

arXiv

This repository serves as the official implementation of the paper "Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification".

💡 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) often suffers from Recursive Space Contraction (RSC), where the policy irreversibly collapses into narrow reasoning paths, sacrificing diversity (Pass@K) for efficiency. Standard KL regularization fails to address this effectively due to rigid "Shape Matching" constraints.

Anchored Policy Optimization (APO) introduces a remedy: shifting from Shape Matching to Support Coverage.

  • Safe Manifold: We utilize the reference model's high-confidence support (Top-K) as a "Safe Manifold".
  • Dual-Force Mechanism:
    • Push ($\lambda$): Aggressively suppresses incorrect responses.
    • Pull ($\beta$): Selectively pulls the policy back towards the Safe Manifold when errors occur, enabling Elastic Recovery.

This approach allows APO to break the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

🚀 Key Features

  • Exclusive APO Implementation: Full support for the APO objective with Ratio Rectification.
  • Efficient & Stable: Designed to work within the PPO Trust Region without destabilizing training.
  • Better Baselines: Includes implementations of KL-on-Wrong (Conditional KL) and standard GRPO for rigorous comparison.
  • Scalable: Built on top of verl, supporting FSDP and Megatron for training large-scale models.

🛠️ Quick Start

Installation

Option 1: Automated Setup (Recommended)

bash uv_verl.sh

Option 2: Manual Setup

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip

# Install package in editable mode
pip install --no-deps -e .

# Add project root to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)

⚙️ Configuration & Usage

1. Training with APO

To enable APO, you must set the loss_mode to apo_ratio in your training configuration. The core hyperparameters control the mechanism described in the paper:

python -m verl.trainer.main_ppo \
    data.train_files="['data/DAPO/dapo-math-17k_dedup.parquet']" \
    actor_rollout_ref.model.path="Qwen/Qwen2.5-7B" \
    # ... basic settings ...
    
    # [CRITICAL] Enable APO
    actor_rollout_ref.actor.policy_loss.loss_mode="apo_ratio" \
    
    # [APO Hyperparameters]
    # Push Coefficient (lambda): Intensity of error suppression. 
    # Paper Default: 1.05
    actor_rollout_ref.actor.policy_loss.apo_push_coeff=1.05 \
    
    # Pull Coefficient (beta): Intensity of the restoring anchor force.
    # Paper Default: 0.1
    actor_rollout_ref.actor.policy_loss.apo_pull_coeff=0.1 \
    
    # Safe Manifold Size (K): Size of the reference Top-K support.
    # Paper Default: 8
    actor_rollout_ref.actor.policy_loss.apo_topk=8 \
    
    # Exclusive Anchoring: Exclude the error token itself from the anchor set.
    actor_rollout_ref.actor.policy_loss.apo_exclude_sampled=true

2. Supported Algorithms (Baselines)

We provide implementations for key baselines to allow for ablation studies:

Algorithm Configuration Description
APO (Ours) loss_mode="apo_ratio" Dual-force optimization (Push + Pull).
GRPO (Vanilla) loss_mode="vanilla" Standard Group Relative Policy Optimization.
KL on Wrong use_kl_loss_on_wrong=True Conditional KL penalty applied only on negative advantages.
Standard KL use_kl_loss=True Global KL penalty (Shape Matching constraint).

3. Running Experiments

We provide pre-configured scripts for reproducing the paper's main results:

# Train Qwen2.5-7B with APO
bash run_scripts/run_apo_7B.sh

# Train Llama-3 models
bash run_scripts/run_apo_llama.sh

# Run the KL-on-Wrong Baseline
bash run_scripts/run_kl_on_wrong_7B.sh

📊 Evaluation

Evaluate your trained models using the provided scripts:

# Full Model Evaluation
bash scripts/eval_model.sh outputs/<experiment_name>

# Pass@K Analysis
bash scripts/eval_pass_at_k.sh outputs/<experiment_name>

📜 Citation

If you find APO useful for your research, please cite our paper:

@misc{wang2026anchoredpolicyoptimizationmitigating,
      title={Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification}, 
      author={Tianyi Wang and Long Li and Hongcan Guo and Yibiao Chen and Yixia Li and Yong Wang and Yun Chen and Guanhua Chen},
      year={2026},
      eprint={2602.05717},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.05717}, 
}

About

The official repository for Anchored Policy Optimization: Mitigating Exploration Collapse via Support-Constrained Rectification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published