Skip to content

[reward] feat: deterministic reward inference and VLM reward model support#115

Open
KaisennHu wants to merge 1 commit into
verl-project:mainfrom
KaisennHu:feat/deterministic-reward
Open

[reward] feat: deterministic reward inference and VLM reward model support#115
KaisennHu wants to merge 1 commit into
verl-project:mainfrom
KaisennHu:feat/deterministic-reward

Conversation

@KaisennHu

@KaisennHu KaisennHu commented May 28, 2026

Copy link
Copy Markdown

What does this PR do?

Add deterministic and seed fields under reward_model in YAML config, ensuring VLM reward models produce reproducible scores for identical inputs. Three gaps addressed:

  1. No floating-point determinism — flash attention, CUDA matmul, NCCL all-reduce introduce nondeterminism across RM inference runs.
  2. No configurable GenRM sampling — hardcoded temperature=0.7, top_p=0.8 in every /v1/chat/completions request, with no seed control.
  3. No VLM scoring path — no built-in path for visual inputs without a custom reward function.

When deterministic=true, full determinism is applied to RM actors via enable_full_determinism(seed) + PYTHONHASHSEED injected via runtime_env; seed propagates to GenRM HTTP requests. Sampling params (temperature/do_sample/top_k/top_p) are entirely user-configured in YAML. Rollout actors remain unaffected.

Related Issues

Closes #97 and #111

Test

Added test_deterministic_reward_reproducibility — calls compute_rm_score twice on same data with deterministic=true, asserts rm_scores exactly equal and genrm_response identical across both runs.

API and Usage Example

reward:
  reward_model:
    enable: True
    deterministic: true
    seed: 42               # propagated to GenRM when deterministic=true
    model_path: ~/models/qwen3-vl
    rollout:
      name: vllm
      temperature: 1.0     # user-configured sampling (defaults from rollout config)
      top_p: 1
      do_sample: true
      top_k: -1
      tensor_model_parallel_size: 2

When RM enabled + no custom_reward_function.path, compute_score_ocr is auto-set for VLM visual scoring.

Design & Propagation Chain

reward.yaml (deterministic: true/false, seed: 42, rollout.temperature/do_sample/top_k/top_p)
  → DiffusionRayTrainer._init_online_rollout_stack()
      ├─ deterministic=true: monkey-patch vLLMReplica → _DeterministicServerProxy
      │     ├─ inject PYTHONHASHSEED into runtime_env
      │     └─ swap server_class → _DeterministicRMHttpServer (calls enable_full_determinism in _post_init)
      ├─ RM enabled + no custom_reward_function: auto-set → compute_score_ocr
  → VisualRewardManager
      ├─ sampling_params from rollout config (temperature, do_sample, top_k, top_p)
      ├─ deterministic=true: include seed from reward_model.seed
      ├─ deterministic=false: exclude seed
      → compute_score_ocr(sampling_params=...)
          → HTTP /v1/chat/completions request uses user-configured params

1. Floating-point determinism — Two enforcement levels work together. PYTHONHASHSEED must be set before Python process startup, so it is injected into RM actor runtime_env via _DeterministicServerProxy. _DeterministicRMHttpServer(vLLMHttpServer) subclass calls verl's enable_full_determinism(seed) in _post_init, covering all in-process determinism: env vars, Python/NumPy/PyTorch seeds, deterministic algorithms, cuDNN config. Rollout actors unaffected (is_reward_model guard).

Setting Set by
PYTHONHASHSEED=str(seed) runtime_env (before process startup)
CUBLAS_WORKSPACE_CONFIG=:16:8 enable_full_determinism (in-process)
FLASH_ATTENTION_DETERMINISTIC=1 enable_full_determinism (in-process)
VLLM_BATCH_INVARIANT=1 enable_full_determinism (in-process)
NCCL_LAUNCH_MODE=GROUP enable_full_determinism (in-process)
NCCL_PROTO=Simple enable_full_determinism (in-process)
random/np/torch seed enable_full_determinism (in-process)
torch deterministic algorithms enable_full_determinism (in-process)
cudnn deterministic config enable_full_determinism (in-process)

2. GenRM sampling + seedVisualRewardManager passes RM rollout config sampling params to compute_score_ocr via sampling_params kwarg, replacing hardcoded DEFAULT_SAMPLING_PARAMS. When deterministic=true, seed is included; when deterministic=false, seed is excluded.

3. VLM scoring path — When RM enabled + no custom_reward_function.path, auto-set path to compute_score_ocr.

File Change
verl_omni/trainer/config/reward/reward.yaml Add deterministic: false and seed: 42 under reward_model
verl_omni/trainer/diffusion/ray_diffusion_trainer.py _DeterministicRMHttpServer + monkey-patch vLLMReplica
verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py Remove _DETERMINISTIC_RM_ENV_VARS
verl_omni/reward_loop/reward_manager/visual.py Pass sampling_params; include/exclude seed
verl_omni/utils/reward_score/genrm_ocr.py Add sampling_params param
tests/reward_loop/test_visual_reward_manager.py test_deterministic_reward_reproducibility

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for deterministic reward model inference by overriding sampling parameters and environment variables when deterministic mode is enabled. It also adds a reproducibility test suite for the deterministic reward manager. The review feedback highlights a critical maintainability hazard due to duplicated server-launching logic, which can be avoided by dynamically proxying the server class options. Additionally, a robustness improvement is suggested to wrap configuration modifications in OmegaConf.open_dict to prevent potential struct-lock errors.

Comment thread verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py Outdated
Comment thread verl_omni/trainer/diffusion/ray_diffusion_trainer.py Outdated
@KaisennHu KaisennHu changed the title [reward] feat: deterministic reward inference and VLM reward model su… [reward] feat: deterministic reward inference and VLM reward model support May 28, 2026
self.reward_model_tokenizer = reward_model_tokenizer

deterministic = config.reward.reward_model.get("deterministic", False)
self._genrm_sampling_params = {"temperature": 0.0, "top_p": 1.0} if deterministic else None

@zhtmike zhtmike May 28, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply set "temperature": 0.0, "top_p": 1.0 in the shell scripts to enable reward determinism. It will looks clear.

WDYT?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @zhtmike, shell scripts can set sampling server defaults for name=vllm RM servers, but can't cover:

  1. Per-request sampling — vllm_omni RM servers ignore override_generation_config; only sampling_params kwarg works.
  2. Env vars + VLM scoring auto-set — must be injected at Ray actor launch time, can't be done from CLI args.

deterministic toggle covers all three levels.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, btw why do we use vllm_omni for VLM scoring? it should be vllm for VLM?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. VLM RM scoring uses name=vllm, not vllm_omni. For name=vllm, override_generation_config already works. The sampling_params per-request override is an extra safeguard, and the only effective path for name=vllm_omni RM servers (future scenario). Shell scripts still can't scope env vars to RM-only actors or auto-set VLM scoring — both require the deterministic toggle.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Then let us focus on vLLM VLM now, pls drop unnecessary file change

@KaisennHu KaisennHu force-pushed the feat/deterministic-reward branch 5 times, most recently from 70b4d76 to e696dba Compare June 2, 2026 03:09
…pport

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu KaisennHu force-pushed the feat/deterministic-reward branch from e696dba to 8ca8bcc Compare June 2, 2026 03:19

@SamitHuang SamitHuang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks clear to me! Pls supplement the test results for 1) multiple runs with same seed, 2) multiple runs with different seeds

@SamitHuang

Copy link
Copy Markdown
Collaborator

It's a good starting point for deterministic flowgrpo. It's better to add a doc to illustarte how to config and enable determinism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Q3 Road Map (Looking for requests and feedbacks)

3 participants