Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

Overview

Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods of Lora; and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).

Figure 2: Evaluation comparisons between BF16 and FP16 across various frameworks, algorithms, datasets and training regimes.

Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16.

Figure 4: Comparisons between various algorithms based on FP16.

Figure 5: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mu$) and the training policy ($\pi$).

Reproduce the Sanity Test 🎯

OAT

Find related code in oat folder.

VeRL

We provide a minimal patch for VeRL to enable FP16 training.

Find related code in Precision-RL-verl folder to reproduce our experiments.

Data

The sanity test dataset for DeepSeek-R1-Distill-Qwen-1.5B is included in this folder. The data processing script will be released soon.

Citation

If you find our works useful for your research, please consider citing:

@article{qi2025precisionrl,
  title={Defeating the Training-Inference Mismatch via FP16},
  author={Qi, Penghui and Liu, Zichen and Zhou, Xiangxin and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min},
  journal={arXiv preprint arXiv:2510.26788},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Precision-RL-verl @ efc7fa7		Precision-RL-verl @ efc7fa7
figures		figures
oat		oat
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
verl_fp16.patch		verl_fp16.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Overview

Reproduce the Sanity Test 🎯

OAT

VeRL

Data

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

sail-sg/Precision-RL

Folders and files

Latest commit

History

Repository files navigation

Defeating the Training-Inference Mismatch via FP16

Overview

Reproduce the Sanity Test 🎯

OAT

VeRL

Data

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages