Add explicit RL micro-batch token cap and fix RL token accounting#2183
Open
taivu1998 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
Open
Add explicit RL micro-batch token cap and fix RL token accounting#2183taivu1998 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
taivu1998 wants to merge 1 commit intoPrimeIntellect-ai:mainfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "optimizer-step or checkpoint semantics." | ||
| ), | ||
| ), | ||
| ] = None |
There was a problem hiding this comment.
Missing CHANGELOG entry for new config field
Low Severity
A new config field trainer.micro_batch_max_tokens was added to src/prime_rl/configs/trainer.py (with validation in validate_micro_batch_max_tokens) but CHANGELOG.md has no corresponding entry. Per project rules, any PR modifying configuration structures in src/prime_rl/*/config.py or src/prime_rl/configs/trainer.py must update CHANGELOG.md.
Triggered by project rule: BugBot Instructions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Closes #1514.
This adds an explicit RL trainer control for local micro-batch packing without reintroducing the old SFT-style
micro_batch_sizesemantics.The new API is
trainer.micro_batch_max_tokens, which caps how many text tokens the RL trainer packs into each local micro batch while preserving the existing RL step semantics:trainer.data.fake.batch_sizeMotivation
Issue #1514 asks for RL-side gradient accumulation control similar to other training paths. In practice, RL already accumulates gradients implicitly across the packed local micro batches that make up one trainer step, but that behavior was only controlled indirectly by packing against
model.seq_len.This change makes that control explicit and safer by:
micro_batch_sizemodel.seq_len) separate from local packing capacity (micro_batch_max_tokens)What Changed
Config
trainer.micro_batch_max_tokens: int | NoneNone, which preserves current behavior by falling back tomodel.seq_lenmodel.seq_lendata.fake.batch_sizeRL batching and packing
micro_batch_max_tokensthrough the real RL data loader and packerssample_countmetadata on packed micro batches so sample accounting remains correct after packing and paddingTrainer accounting and logging
micro_batch_max_tokensDocs and tests
micro-batch-sizeflagsmicro_batch_max_tokensDesign Notes
A key goal here is to stay aligned with the current RL architecture instead of importing SFT semantics wholesale.
This PR intentionally does not:
micro_batch_sizeThat keeps the implementation small and makes the new knob do exactly one thing: lower per-forward memory pressure by reducing the token budget of each local RL micro batch.
Verification
I added focused unit coverage for the new config and packing behavior.
Local command attempts:
On this machine, the repo lockfile only supports Linux environments, so
uvtest execution was blocked on macOS. I still verified the patch with:python3 -m py_compileon all changed Python filesgit diff --checkNote
Medium Risk
Touches RL trainer batching/packing and progress/throughput accounting; mistakes could skew metrics or change effective gradient accumulation, though validation and unit tests reduce risk.
Overview
Adds an explicit RL configuration knob,
trainer.micro_batch_max_tokens, to cap tokens packed into each local micro-batch (defaulting tomodel.seq_len) and validates it can’t exceedmodel.seq_lenand can’t be used with fake RL data.Threads this cap through real RL packing (
prepare_batch/packers) while tracking a newsample_countonMicroBatchto keep sample/token accounting correct across packing and dummy padding. Updates the RL training loop to compute throughput/progress from the actual packed micro-batches (tokens, loss tokens, samples, micro-batch count) and logs these new metrics; docs and unit tests are updated accordingly.Written by Cursor Bugbot for commit dcd22f8. This will update automatically on new commits. Configure here.