feat: add option to filter zero advantage samples#2192
Conversation
Add config option to filter out training samples with zero advantage. This is a simple postprocessing step in the orchestrator that filters samples before sending them to the trainer.
Keep simple filter_zero_advantages config option but use the filter abstraction internally. ZeroAdvantageFilter is created and applied after advantages are computed, masking zero-advantage rollouts.
bd0a0ae to
fd012a6
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
|
||
| zero_advantage_filter = ZeroAdvantageFilter(name="zero_advantage", enforce=True) | ||
| zero_advantage_metrics = apply_filters([zero_advantage_filter], train_rollouts) | ||
| filter_metrics.update(zero_advantage_metrics) |
There was a problem hiding this comment.
Aggregate filter metrics overwritten by second apply_filters call
Medium Severity
The filter_metrics.update(zero_advantage_metrics) call overwrites the filter/total_detected_rate and filter/total_enforced_rate keys that were set by the first apply_filters call (for gibberish/repetition filters). Since apply_filters always emits these two aggregate keys, the second invocation's values silently replace the originals, making the logged totals reflect only zero-advantage detections rather than the combined total across all filters.
Additional Locations (1)
| Field( | ||
| description="Whether to filter out training samples with zero advantage. If True, samples with advantage == 0.0 are not sent to the trainer.", | ||
| ), | ||
| ] = True |
There was a problem hiding this comment.
New config field defaults to True, changing existing behavior
Medium Severity
filter_zero_advantages defaults to True, which silently changes training behavior for all existing deployments that don't set this field. Every existing config will now have zero-advantage samples masked out, altering training dynamics. Since the PR title describes this as an "option," defaulting to False would be the safer, non-breaking choice.
| Field( | ||
| description="Whether to filter out training samples with zero advantage. If True, samples with advantage == 0.0 are not sent to the trainer.", | ||
| ), | ||
| ] = True |
There was a problem hiding this comment.
Missing CHANGELOG entry for new config field
Low Severity
A new config field filter_zero_advantages was added to src/prime_rl/configs/orchestrator.py but CHANGELOG.md was not updated. Per the project rules, any PR that modifies configuration structures in config files must include a corresponding changelog entry.
Triggered by project rule: BugBot Instructions


Summary
filter_zero_advantagesconfig option to filter out training samples with zero advantageNote
Medium Risk
Changes the training data pipeline by default-filtering rollouts with
advantage == 0.0, which can alter effective batch size and learning dynamics if advantages are frequently zero or computed with floating-point edge cases.Overview
Adds a new
filter_zero_advantagesorchestrator config (default enabled) to drop rollouts that haveadvantage == 0.0before they are converted intoTrainingSamples.Implements a
ZeroAdvantageFilterand updates the orchestrator loop to persist computed advantages onto each rollout, then run the filter via the existingapply_filtersmechanism so filtered rollouts have their completion masks cleared and are tracked in the filter metrics.Written by Cursor Bugbot for commit 64148b0. This will update automatically on new commits. Configure here.