[reward] feat: deterministic reward inference and VLM reward model support#115
[reward] feat: deterministic reward inference and VLM reward model support#115KaisennHu wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for deterministic reward model inference by overriding sampling parameters and environment variables when deterministic mode is enabled. It also adds a reproducibility test suite for the deterministic reward manager. The review feedback highlights a critical maintainability hazard due to duplicated server-launching logic, which can be avoided by dynamically proxying the server class options. Additionally, a robustness improvement is suggested to wrap configuration modifications in OmegaConf.open_dict to prevent potential struct-lock errors.
| self.reward_model_tokenizer = reward_model_tokenizer | ||
|
|
||
| deterministic = config.reward.reward_model.get("deterministic", False) | ||
| self._genrm_sampling_params = {"temperature": 0.0, "top_p": 1.0} if deterministic else None |
There was a problem hiding this comment.
I think we can simply set "temperature": 0.0, "top_p": 1.0 in the shell scripts to enable reward determinism. It will looks clear.
WDYT?
There was a problem hiding this comment.
hi @zhtmike, shell scripts can set sampling server defaults for name=vllm RM servers, but can't cover:
- Per-request sampling — vllm_omni RM servers ignore override_generation_config; only sampling_params kwarg works.
- Env vars + VLM scoring auto-set — must be injected at Ray actor launch time, can't be done from CLI args.
deterministic toggle covers all three levels.
There was a problem hiding this comment.
OK, btw why do we use vllm_omni for VLM scoring? it should be vllm for VLM?
There was a problem hiding this comment.
You're right. VLM RM scoring uses name=vllm, not vllm_omni. For name=vllm, override_generation_config already works. The sampling_params per-request override is an extra safeguard, and the only effective path for name=vllm_omni RM servers (future scenario). Shell scripts still can't scope env vars to RM-only actors or auto-set VLM scoring — both require the deterministic toggle.
There was a problem hiding this comment.
Understood. Then let us focus on vLLM VLM now, pls drop unnecessary file change
70b4d76 to
e696dba
Compare
…pport Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
e696dba to
8ca8bcc
Compare
SamitHuang
left a comment
There was a problem hiding this comment.
Looks clear to me! Pls supplement the test results for 1) multiple runs with same seed, 2) multiple runs with different seeds
|
It's a good starting point for deterministic flowgrpo. It's better to add a doc to illustarte how to config and enable determinism. |
What does this PR do?
Add
deterministicandseedfields underreward_modelin YAML config, ensuring VLM reward models produce reproducible scores for identical inputs. Three gaps addressed:temperature=0.7, top_p=0.8in every/v1/chat/completionsrequest, with no seed control.When
deterministic=true, full determinism is applied to RM actors viaenable_full_determinism(seed)+PYTHONHASHSEEDinjected viaruntime_env;seedpropagates to GenRM HTTP requests. Sampling params (temperature/do_sample/top_k/top_p) are entirely user-configured in YAML. Rollout actors remain unaffected.Related Issues
Closes #97 and #111
Test
Added
test_deterministic_reward_reproducibility— callscompute_rm_scoretwice on same data withdeterministic=true, assertsrm_scoresexactly equal andgenrm_responseidentical across both runs.API and Usage Example
When RM enabled + no
custom_reward_function.path,compute_score_ocris auto-set for VLM visual scoring.Design & Propagation Chain
1. Floating-point determinism — Two enforcement levels work together.
PYTHONHASHSEEDmust be set before Python process startup, so it is injected into RM actorruntime_envvia_DeterministicServerProxy._DeterministicRMHttpServer(vLLMHttpServer)subclass calls verl'senable_full_determinism(seed)in_post_init, covering all in-process determinism: env vars, Python/NumPy/PyTorch seeds, deterministic algorithms, cuDNN config. Rollout actors unaffected (is_reward_modelguard).PYTHONHASHSEED=str(seed)runtime_env(before process startup)CUBLAS_WORKSPACE_CONFIG=:16:8enable_full_determinism(in-process)FLASH_ATTENTION_DETERMINISTIC=1enable_full_determinism(in-process)VLLM_BATCH_INVARIANT=1enable_full_determinism(in-process)NCCL_LAUNCH_MODE=GROUPenable_full_determinism(in-process)NCCL_PROTO=Simpleenable_full_determinism(in-process)random/np/torch seedenable_full_determinism(in-process)torch deterministic algorithmsenable_full_determinism(in-process)cudnn deterministic configenable_full_determinism(in-process)2. GenRM sampling + seed —
VisualRewardManagerpasses RM rollout config sampling params tocompute_score_ocrviasampling_paramskwarg, replacing hardcodedDEFAULT_SAMPLING_PARAMS. Whendeterministic=true,seedis included; whendeterministic=false,seedis excluded.3. VLM scoring path — When RM enabled + no
custom_reward_function.path, auto-set path tocompute_score_ocr.verl_omni/trainer/config/reward/reward.yamldeterministic: falseandseed: 42underreward_modelverl_omni/trainer/diffusion/ray_diffusion_trainer.py_DeterministicRMHttpServer+ monkey-patchvLLMReplicaverl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py_DETERMINISTIC_RM_ENV_VARSverl_omni/reward_loop/reward_manager/visual.pysampling_params; include/excludeseedverl_omni/utils/reward_score/genrm_ocr.pysampling_paramsparamtests/reward_loop/test_visual_reward_manager.pytest_deterministic_reward_reproducibilityChecklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always