Skip to content

feat: reward-gated length shaping for correct rollouts#2169

Draft
hallerite wants to merge 2 commits intomainfrom
feat/reward-gated-length-shaping
Draft

feat: reward-gated length shaping for correct rollouts#2169
hallerite wants to merge 2 commits intomainfrom
feat/reward-gated-length-shaping

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

  • Replace GR³ length scaling with a simpler correctness-gated approach: correct rollout rewards are attenuated by L_min / L_i, where L_min is the shortest correct completion per problem
  • Shortest correct rollout keeps full reward (1.0), longer correct ones get proportionally less, incorrect rollouts are untouched
  • Remove length_shaping_alpha: float in favor of length_shaping: bool (no hyperparameter needed) and drop the online difficulty filtering requirement

Breaking changes

  • orchestrator.advantage.length_shaping_alphaorchestrator.advantage.length_shaping (float|None → bool)
  • length_shaping no longer requires online_difficulty_filtering = true

🤖 Generated with Claude Code

Replace GR³ multiplicative length scaling (applied to all rollouts,
gated on online difficulty filtering) with a simpler approach: attenuate
correct rollout rewards by L_min/L_i, where L_min is the shortest
correct completion per problem. Incorrect rollouts are untouched.

- No hyperparameter needed (was `length_shaping_alpha`, now `length_shaping: bool`)
- No dependency on online difficulty filtering
- Shortest correct rollout keeps reward=1, longer ones get proportionally less

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@hallerite hallerite force-pushed the feat/reward-gated-length-shaping branch 2 times, most recently from ba1aef7 to fac1a2c Compare April 1, 2026 17:41
…", "gr3")

Both length shaping strategies are now available via a single config field:
- length_shaping = "off" (default): no length shaping
- length_shaping = "brevity_bonus": reward-gated L_min/L_i (correct rollouts only)
- length_shaping = "gr3": GR³ multiplicative (1 + alpha * L_i/L_mean)^-1 (all rollouts)

length_shaping_alpha controls the GR³ coefficient (default 0.33, only used with "gr3").

Replaces the old length_shaping_alpha-only API and removes the ODF coupling.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@hallerite hallerite force-pushed the feat/reward-gated-length-shaping branch from fac1a2c to c8ebf0c Compare April 1, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant