-
Notifications
You must be signed in to change notification settings - Fork 251
feat: add option to filter zero advantage samples #2192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -754,6 +754,14 @@ class OrchestratorConfig(BaseConfig): | |
| # The advantage configuration | ||
| advantage: AdvantageConfig | None = DefaultAdvantageConfig() | ||
|
|
||
| # Filter zero advantages | ||
| filter_zero_advantages: Annotated[ | ||
| bool, | ||
| Field( | ||
| description="Whether to filter out training samples with zero advantage. If True, samples with advantage == 0.0 are not sent to the trainer.", | ||
| ), | ||
| ] = True | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing CHANGELOG entry for new config fieldLow Severity A new config field Triggered by project rule: BugBot Instructions |
||
|
|
||
| # Rollout filters (monitor by default, enforce optionally) | ||
| filters: list[FilterConfig] = [GibberishFilterConfig(), RepetitionFilterConfig()] | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -564,6 +564,18 @@ def _cleanup_env_processes(): | |
| config.advantage, | ||
| ) | ||
|
|
||
| # Store advantages on rollouts for filtering | ||
| for rollout, advantage in zip(train_rollouts, advantages): | ||
| rollout["advantage"] = advantage | ||
|
Comment on lines
+567
to
+569
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should make the |
||
|
|
||
| # Apply zero advantage filter if configured | ||
| if config.filter_zero_advantages: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's make the config the same as the other filters. then we can just |
||
| from prime_rl.orchestrator.filters import ZeroAdvantageFilter | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. import at top |
||
|
|
||
| zero_advantage_filter = ZeroAdvantageFilter(name="zero_advantage", enforce=True) | ||
| zero_advantage_metrics = apply_filters([zero_advantage_filter], train_rollouts) | ||
| filter_metrics.update(zero_advantage_metrics) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aggregate filter metrics overwritten by second apply_filters callMedium Severity The Additional Locations (1) |
||
|
|
||
| # Convert rollouts to training samples | ||
| parallel_preprocess_start = time.perf_counter() | ||
|
|
||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New config field defaults to True, changing existing behavior
Medium Severity
filter_zero_advantagesdefaults toTrue, which silently changes training behavior for all existing deployments that don't set this field. Every existing config will now have zero-advantage samples masked out, altering training dynamics. Since the PR title describes this as an "option," defaulting toFalsewould be the safer, non-breaking choice.