Skip to content

feat: update default loss to dppo+kl#2187

Open
mikasenghaas wants to merge 2 commits intomainfrom
feat/dppo-kl-default-loss
Open

feat: update default loss to dppo+kl#2187
mikasenghaas wants to merge 2 commits intomainfrom
feat/dppo-kl-default-loss

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented Apr 2, 2026

Summary

  • Condition the trust region mask on advantage sign (DPPO-style) instead of masking both directions unconditionally
  • Rename ipo_mask_{low,high}dppo_mask_{low,high} in config and internals -- this is BREAKING
  • Based on hendrycks-sanity ablation results showing DPPO+KL outperforms IPO
Screenshot 2026-04-02 at 3 51 25 PM Screenshot 2026-04-02 at 3 52 26 PM

🤖 Generated with Claude Code


Note

Medium Risk
Changes the core RL objective by altering the default trust-region masking behavior and renaming config fields, which can materially impact training dynamics and is a breaking config change.

Overview
Updates the default RL loss to a DPPO+KL-style objective by conditioning the trust-region token mask on the sign of the advantage instead of masking both probability-increase and probability-decrease violations unconditionally.

Renames loss config thresholds from ipo_mask_{low,high} to dppo_mask_{low,high} (breaking for existing configs) and updates unit tests to use the new fields.

Written by Cursor Bugbot for commit 8595ae4. This will update automatically on new commits. Configure here.

Condition the trust region mask on advantage sign instead of masking
both directions unconditionally. Rename ipo_mask -> dppo_mask in config
and internal variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@mikasenghaas mikasenghaas changed the title switch default loss to DPPO+KL feat: update default loss to dppo+kl Apr 2, 2026
@mikasenghaas mikasenghaas marked this pull request as ready for review April 2, 2026 14:56
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

ipo_mask_low: Annotated[float, Field(ge=0, description="The low threshold for masking tokens.")] = 0.2
ipo_mask_high: Annotated[float, Field(ge=0, description="The high threshold for masking tokens.")] = 0.2
dppo_mask_low: Annotated[float, Field(ge=0, description="The low threshold for masking tokens.")] = 0.2
dppo_mask_high: Annotated[float, Field(ge=0, description="The high threshold for masking tokens.")] = 0.2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG not updated for breaking config rename

Low Severity

The ipo_mask_low and ipo_mask_high fields in DefaultLossConfig were renamed to dppo_mask_low and dppo_mask_high — a breaking config change — but CHANGELOG.md was not updated. The existing entry at line 127 still references the old ipo_mask_* names. This violates the rule requiring any PR that modifies configuration structures in src/prime_rl/*/config.py to update the changelog.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants