Skip to content

sail-sg/Precision-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Paper Github

Overview

Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods of Lora; and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).

Figure 2: Evaluation comparisons between BF16 and FP16 across various frameworks, algorithms, datasets and training regimes.

Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16.

Figure 4: Comparisons between various algorithms based on FP16.

Figure 5: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mu$) and the training policy ($\pi$).

Reproduce the Sanity Test 🎯

OAT

Find related code in oat folder.

VeRL

We provide a minimal patch for VeRL to enable FP16 training.

Find related code in Precision-RL-verl folder to reproduce our experiments.

Data

The sanity test dataset for DeepSeek-R1-Distill-Qwen-1.5B is included in this folder. The data processing script will be released soon.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2025precisionrl,
  title={Defeating the Training-Inference Mismatch via FP16},
  author={Qi, Penghui and Liu, Zichen and Zhou, Xiangxin and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
  journal={arXiv preprint arXiv:2510.26788},
  year={2025}
}

About

Defeating the Training-Inference Mismatch via FP16

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •