[reward] feat: deterministic reward inference and VLM reward model support by KaisennHu · Pull Request #115 · verl-project/verl-omni

KaisennHu · 2026-05-28T03:29:45Z

What does this PR do?

Add deterministic and seed fields under reward_model in YAML config, ensuring VLM reward models produce reproducible scores for identical inputs. Three gaps addressed:

No floating-point determinism — flash attention, CUDA matmul, NCCL all-reduce introduce nondeterminism across RM inference runs.
No configurable GenRM sampling — hardcoded temperature=0.7, top_p=0.8 in every /v1/chat/completions request, with no seed control.
No VLM scoring path — no built-in path for visual inputs without a custom reward function.

When deterministic=true, full determinism is applied to RM actors via enable_full_determinism(seed) + PYTHONHASHSEED injected via runtime_env; seed propagates to GenRM HTTP requests. Sampling params (temperature/do_sample/top_k/top_p) are entirely user-configured in YAML. Rollout actors remain unaffected.

Related Issues

Closes #97 and #111

Test

Added test_deterministic_reward_reproducibility — calls compute_rm_score twice on same data with deterministic=true, asserts rm_scores exactly equal and genrm_response identical across both runs.

API and Usage Example

reward:
  reward_model:
    enable: True
    deterministic: true
    seed: 42               # propagated to GenRM when deterministic=true
    model_path: ~/models/qwen3-vl
    rollout:
      name: vllm
      temperature: 1.0     # user-configured sampling (defaults from rollout config)
      top_p: 1
      do_sample: true
      top_k: -1
      tensor_model_parallel_size: 2

When RM enabled + no custom_reward_function.path, compute_score_ocr is auto-set for VLM visual scoring.

Design & Propagation Chain

reward.yaml (deterministic: true/false, seed: 42, rollout.temperature/do_sample/top_k/top_p)
  → DiffusionRayTrainer._init_online_rollout_stack()
      ├─ deterministic=true: monkey-patch vLLMReplica → _DeterministicServerProxy
      │     ├─ inject PYTHONHASHSEED into runtime_env
      │     └─ swap server_class → _DeterministicRMHttpServer (calls enable_full_determinism in _post_init)
      ├─ RM enabled + no custom_reward_function: auto-set → compute_score_ocr
  → VisualRewardManager
      ├─ sampling_params from rollout config (temperature, do_sample, top_k, top_p)
      ├─ deterministic=true: include seed from reward_model.seed
      ├─ deterministic=false: exclude seed
      → compute_score_ocr(sampling_params=...)
          → HTTP /v1/chat/completions request uses user-configured params

1. Floating-point determinism — Two enforcement levels work together. PYTHONHASHSEED must be set before Python process startup, so it is injected into RM actor runtime_env via _DeterministicServerProxy. _DeterministicRMHttpServer(vLLMHttpServer) subclass calls verl's enable_full_determinism(seed) in _post_init, covering all in-process determinism: env vars, Python/NumPy/PyTorch seeds, deterministic algorithms, cuDNN config. Rollout actors unaffected (is_reward_model guard).

Setting	Set by
`PYTHONHASHSEED=str(seed)`	`runtime_env` (before process startup)
`CUBLAS_WORKSPACE_CONFIG=:16:8`	`enable_full_determinism` (in-process)
`FLASH_ATTENTION_DETERMINISTIC=1`	`enable_full_determinism` (in-process)
`VLLM_BATCH_INVARIANT=1`	`enable_full_determinism` (in-process)
`NCCL_LAUNCH_MODE=GROUP`	`enable_full_determinism` (in-process)
`NCCL_PROTO=Simple`	`enable_full_determinism` (in-process)
`random/np/torch seed`	`enable_full_determinism` (in-process)
`torch deterministic algorithms`	`enable_full_determinism` (in-process)
`cudnn deterministic config`	`enable_full_determinism` (in-process)

2. GenRM sampling + seed — VisualRewardManager passes RM rollout config sampling params to compute_score_ocr via sampling_params kwarg, replacing hardcoded DEFAULT_SAMPLING_PARAMS. When deterministic=true, seed is included; when deterministic=false, seed is excluded.

3. VLM scoring path — When RM enabled + no custom_reward_function.path, auto-set path to compute_score_ocr.

File	Change
`verl_omni/trainer/config/reward/reward.yaml`	Add `deterministic: false` and `seed: 42` under `reward_model`
`verl_omni/trainer/diffusion/ray_diffusion_trainer.py`	`_DeterministicRMHttpServer` + monkey-patch `vLLMReplica`
`verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py`	Remove `_DETERMINISTIC_RM_ENV_VARS`
`verl_omni/reward_loop/reward_manager/visual.py`	Pass `sampling_params`; include/exclude `seed`
`verl_omni/utils/reward_score/genrm_ocr.py`	Add `sampling_params` param
`tests/reward_loop/test_visual_reward_manager.py`	`test_deterministic_reward_reproducibility`

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code.

gemini-code-assist

Code Review

This pull request introduces support for deterministic reward model inference by overriding sampling parameters and environment variables when deterministic mode is enabled. It also adds a reproducibility test suite for the deterministic reward manager. The review feedback highlights a critical maintainability hazard due to duplicated server-launching logic, which can be avoided by dynamically proxying the server class options. Additionally, a robustness improvement is suggested to wrap configuration modifications in OmegaConf.open_dict to prevent potential struct-lock errors.

zhtmike · 2026-05-28T06:00:19Z

        self.reward_model_tokenizer = reward_model_tokenizer

+        deterministic = config.reward.reward_model.get("deterministic", False)
+        self._genrm_sampling_params = {"temperature": 0.0, "top_p": 1.0} if deterministic else None


I think we can simply set "temperature": 0.0, "top_p": 1.0 in the shell scripts to enable reward determinism. It will looks clear.

WDYT?

hi @zhtmike, shell scripts can set sampling server defaults for name=vllm RM servers, but can't cover:

Per-request sampling — vllm_omni RM servers ignore override_generation_config; only sampling_params kwarg works.

Env vars + VLM scoring auto-set — must be injected at Ray actor launch time, can't be done from CLI args.

deterministic toggle covers all three levels.

OK, btw why do we use vllm_omni for VLM scoring? it should be vllm for VLM?

You're right. VLM RM scoring uses name=vllm, not vllm_omni. For name=vllm, override_generation_config already works. The sampling_params per-request override is an extra safeguard, and the only effective path for name=vllm_omni RM servers (future scenario). Shell scripts still can't scope env vars to RM-only actors or auto-set VLM scoring — both require the deterministic toggle.

Understood. Then let us focus on vLLM VLM now, pls drop unnecessary file change

…pport Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

SamitHuang

Looks clear to me! Pls supplement the test results for 1) multiple runs with same seed, 2) multiple runs with different seeds

SamitHuang · 2026-06-02T03:59:35Z

It's a good starting point for deterministic flowgrpo. It's better to add a doc to illustarte how to config and enable determinism.

KaisennHu requested review from SamitHuang and zhtmike as code owners May 28, 2026 03:29

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py Outdated

Comment thread verl_omni/trainer/diffusion/ray_diffusion_trainer.py Outdated

KaisennHu changed the title ~~[reward] feat: deterministic reward inference and VLM reward model su…~~ [reward] feat: deterministic reward inference and VLM reward model support May 28, 2026

zhtmike reviewed May 28, 2026

View reviewed changes

KaisennHu force-pushed the feat/deterministic-reward branch 5 times, most recently from 70b4d76 to e696dba Compare June 2, 2026 03:09

[reward] feat: deterministic reward inference and VLM reward model su…

8ca8bcc

…pport Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu force-pushed the feat/deterministic-reward branch from e696dba to 8ca8bcc Compare June 2, 2026 03:19

SamitHuang reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reward] feat: deterministic reward inference and VLM reward model support#115

[reward] feat: deterministic reward inference and VLM reward model support#115
KaisennHu wants to merge 1 commit into
verl-project:mainfrom
KaisennHu:feat/deterministic-reward

KaisennHu commented May 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

zhtmike May 28, 2026 •

edited

Loading

Uh oh!

KaisennHu May 28, 2026

Uh oh!

zhtmike May 28, 2026

Uh oh!

KaisennHu May 28, 2026

Uh oh!

zhtmike May 28, 2026

Uh oh!

SamitHuang left a comment

Uh oh!

SamitHuang commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KaisennHu commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issues

Test

API and Usage Example

Design & Propagation Chain

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

zhtmike May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KaisennHu May 28, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike May 28, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu May 28, 2026

Choose a reason for hiding this comment

Uh oh!

zhtmike May 28, 2026

Choose a reason for hiding this comment

Uh oh!

SamitHuang left a comment

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KaisennHu commented May 28, 2026 •

edited

Loading

zhtmike May 28, 2026 •

edited

Loading