Anchored Policy Optimization (APO)

This repository serves as the official implementation of the paper "Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification".

💡 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) often suffers from Recursive Space Contraction (RSC), where the policy irreversibly collapses into narrow reasoning paths, sacrificing diversity (Pass@K) for efficiency. Standard KL regularization fails to address this effectively due to rigid "Shape Matching" constraints.

Anchored Policy Optimization (APO) introduces a remedy: shifting from Shape Matching to Support Coverage.

Safe Manifold: We utilize the reference model's high-confidence support (Top-K) as a "Safe Manifold".
Dual-Force Mechanism:
- Push ($\lambda$): Aggressively suppresses incorrect responses.
- Pull ($\beta$): Selectively pulls the policy back towards the Safe Manifold when errors occur, enabling Elastic Recovery.

This approach allows APO to break the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

🚀 Key Features

Exclusive APO Implementation: Full support for the APO objective with Ratio Rectification.
Efficient & Stable: Designed to work within the PPO Trust Region without destabilizing training.
Better Baselines: Includes implementations of KL-on-Wrong (Conditional KL) and standard GRPO for rigorous comparison.
Scalable: Built on top of verl, supporting FSDP and Megatron for training large-scale models.

🛠️ Quick Start

Installation

Option 1: Automated Setup (Recommended)

bash uv_verl.sh

Option 2: Manual Setup

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip

# Install package in editable mode
pip install --no-deps -e .

# Add project root to PYTHONPATH
export PYTHONPATH=$PYTHONPATH:$(pwd)

⚙️ Configuration & Usage

1. Training with APO

To enable APO, you must set the loss_mode to apo_ratio in your training configuration. The core hyperparameters control the mechanism described in the paper:

python -m verl.trainer.main_ppo \
    data.train_files="['data/DAPO/dapo-math-17k_dedup.parquet']" \
    actor_rollout_ref.model.path="Qwen/Qwen2.5-7B" \
    # ... basic settings ...
    
    # [CRITICAL] Enable APO
    actor_rollout_ref.actor.policy_loss.loss_mode="apo_ratio" \
    
    # [APO Hyperparameters]
    # Push Coefficient (lambda): Intensity of error suppression. 
    # Paper Default: 1.05
    actor_rollout_ref.actor.policy_loss.apo_push_coeff=1.05 \
    
    # Pull Coefficient (beta): Intensity of the restoring anchor force.
    # Paper Default: 0.1
    actor_rollout_ref.actor.policy_loss.apo_pull_coeff=0.1 \
    
    # Safe Manifold Size (K): Size of the reference Top-K support.
    # Paper Default: 8
    actor_rollout_ref.actor.policy_loss.apo_topk=8 \
    
    # Exclusive Anchoring: Exclude the error token itself from the anchor set.
    actor_rollout_ref.actor.policy_loss.apo_exclude_sampled=true

2. Supported Algorithms (Baselines)

We provide implementations for key baselines to allow for ablation studies:

Algorithm	Configuration	Description
APO (Ours)	`loss_mode="apo_ratio"`	Dual-force optimization (Push + Pull).
GRPO (Vanilla)	`loss_mode="vanilla"`	Standard Group Relative Policy Optimization.
KL on Wrong	`use_kl_loss_on_wrong=True`	Conditional KL penalty applied only on negative advantages.
Standard KL	`use_kl_loss=True`	Global KL penalty (Shape Matching constraint).

3. Running Experiments

We provide pre-configured scripts for reproducing the paper's main results:

# Train Qwen2.5-7B with APO
bash run_scripts/run_apo_7B.sh

# Train Llama-3 models
bash run_scripts/run_apo_llama.sh

# Run the KL-on-Wrong Baseline
bash run_scripts/run_kl_on_wrong_7B.sh

📊 Evaluation

Evaluate your trained models using the provided scripts:

# Full Model Evaluation
bash scripts/eval_model.sh outputs/<experiment_name>

# Pass@K Analysis
bash scripts/eval_pass_at_k.sh outputs/<experiment_name>

📜 Citation

If you find APO useful for your research, please cite our paper:

@misc{wang2026anchoredpolicyoptimizationmitigating,
      title={Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification}, 
      author={Tianyi Wang and Long Li and Hongcan Guo and Yibiao Chen and Yixia Li and Yong Wang and Yun Chen and Guanhua Chen},
      year={2026},
      eprint={2602.05717},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.05717}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/DAPO		data/DAPO
run_scripts		run_scripts
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
merge.py		merge.py
requirements.txt		requirements.txt
setup.py		setup.py
setup_chat_template.py		setup_chat_template.py
test_filter_groups_fix.py		test_filter_groups_fix.py
transform_qwen3.sh		transform_qwen3.sh
uv_verl.sh		uv_verl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anchored Policy Optimization (APO)

💡 Abstract

🚀 Key Features

🛠️ Quick Start

Installation

⚙️ Configuration & Usage

1. Training with APO

2. Supported Algorithms (Baselines)

3. Running Experiments

📊 Evaluation

📜 Citation

About

Uh oh!

Releases

Packages

Languages

License

1BIMU/APO_OFFICAL

Folders and files

Latest commit

History

Repository files navigation

Anchored Policy Optimization (APO)

💡 Abstract

🚀 Key Features

🛠️ Quick Start

Installation

⚙️ Configuration & Usage

1. Training with APO

2. Supported Algorithms (Baselines)

3. Running Experiments

📊 Evaluation

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages