Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods of Lora; and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).
Figure 2: Evaluation comparisons between BF16 and FP16 across various frameworks, algorithms, datasets and training regimes.
Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16.
Figure 4: Comparisons between various algorithms based on FP16.
Figure 5: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mu$) and the training policy ($\pi$).
Find related code in oat folder.
We provide a minimal patch for VeRL to enable FP16 training.
Find related code in Precision-RL-verl folder to reproduce our experiments.
The sanity test dataset for DeepSeek-R1-Distill-Qwen-1.5B is included in this folder. The data processing script will be released soon.
If you find our works useful for your research, please consider citing:
@article{qi2025precisionrl,
title={Defeating the Training-Inference Mismatch via FP16},
author={Qi, Penghui and Liu, Zichen and Zhou, Xiangxin and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
journal={arXiv preprint arXiv:2510.26788},
year={2025}
}