chore: refactor env group config into Env wrapper classes by mikasenghaas · Pull Request #2193 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-04-03T13:55:57Z

Motivation

Env setup spread across ~200 lines of procedural orchestrator code (parallel lists, dicts, helper functions)
EnvGroup concatenated all env datasets at init, requiring matching column names across envs — can easily break with heterogeneous env mixes
Group scoring bolted on via "deferred scoring" mode — this executes environment scoring locally on the orchestrator (not on the external env server)

This PR deprecates EnvGroup in favor of per-env Env wrapper classes that own their state and lifecycle, and flattens the config so each environment is configured independently.

Summary

Introduces Env, TrainEnv, EvalEnv wrapper classes that encapsulate per-env state (config, spawn/connect/shutdown, rollout generation, sampling args, group scoring detection)
TrainEnvs / EvalEnvs are typed generic containers with bulk spawn/connect/shutdown and deprecate vf.EnvGroup. This also means we stop hijacking vf task field but use a resolved env name as primary env identifier
Flattens env group config: moves buffer.env_ratios to per-env ratio field on EnvConfig
Auto-detects group rubric envs and uses run_group instead of deferred group scoring -- this resolves the issue
Removes temperature scheduling (sampling args can now be computed once at init for train/eval envs)
Removes ValConfig and validation loop — users can register train envs as eval envs to get the same functionality
Removes utils/envs.py and trainer/envs.py env-var modules — inlines os.environ reads into World (mainly to free up envs.py module on orchestrator but also removes some unnecessary bloat)
Removes some dead code (e.g. parse_num_completion_tokens and parse_is_truncated_completions helpers)

Breaking changes

[orchestrator.val] config section removed
buffer.env_ratios config field removed — use per-env ratio field instead
max_concurrent config field removed
Buffer checkpoint hash_keys default changed from ["task", "prompt"] to ["env_name", "prompt"]
Temperature scheduling no longer runs per-step

🤖 Generated with Claude Code

Note

High Risk
Large refactor of the RL orchestrator’s core env lifecycle, sampling/buffer semantics, and scheduler request handling, plus multiple breaking config removals; regressions could affect rollout generation, checkpoint resume, and eval metrics.

Overview
Refactors the orchestrator to stop using vf.EnvGroup and instead manage per-environment wrapper objects (Env/TrainEnv/EvalEnv) with container classes (TrainEnvs/EvalEnvs) that handle spawn/connect/shutdown, sampling args, and group-scoring detection.

Updates config and runtime behavior to match the new model: removes [orchestrator.val], orchestrator.max_concurrent, orchestrator.verification, and temperature scheduling; moves buffer.env_ratios to per-env ratio; requires per-eval-env num_examples/rollouts_per_example; changes buffer checkpoint hashing default to ['env_name','prompt']; and changes eval/*/failed_rollouts to log a ratio.

Reworks buffer and scheduler logic around env_name (no longer overloading task) with per-env datasets/pools, ratio-aware sampling, updated checkpoint save/load, and new group-scoring execution via run_group; updates monitors/W&B tables to include env_name, and removes the env-var helper modules in favor of direct os.environ reads in World.

^{Reviewed by Cursor Bugbot for commit 6b01f9d. Bugbot is set up for automated code reviews on this repo. Configure here.}

…field - Remove ValConfig and all val-related orchestrator code (not part of env group refactor) - Move env_ratios from BufferConfig to per-env ratio field on EnvConfig - Fix merge artifacts: train_env_names -> train_envs.names, env_cfg.server.address -> env_cfg.address - Add validator ensuring env ratios are all-or-nothing across envs - Update tests and example configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Extract spawn/connect/shutdown into Envs class with atexit cleanup - Auto-detect group rubric envs and use run_group instead of deferred scoring - Remove deferred_group_scoring_tasks, max_concurrent from EnvConfig - Fix stale eval references (eval_env_set, eval_env_names, config.eval.num_examples) - Use Sequence[EnvConfig] for covariant type compatibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Env wraps vf.Environment + config with per-env state (group scoring, max_retries, ratio, spawn/connect/shutdown, generate_rollout/generate_group) - EvalEnv extends Env with num_examples, rollouts_per_example, and evaluate() - Envs is now a thin container over Env instances - Scheduler uses env.uses_group_scoring and env.generate_* directly instead of maintaining parallel dicts/sets - Buffer iterates Env instances directly for datasets and ratios - evaluate_and_log replaces evaluate_env, taking an EvalEnv directly - Tests use make_env/make_envs helpers for cleaner Env construction Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- sampling_args set once at env init via Envs.set_sampling_args() - Remove per-step compute_temperature + set_sampling_args from training loop - get_sampling_args no longer takes temperature param - Add TrainEnv/EvalEnv subclasses with distinct get_dataset behavior - EvalEnv.get_dataset calls get_eval_dataset, base Env uses get_dataset - Buffer no longer needs eval flag — uses env.get_dataset() polymorphically - Raise on connect if address not configured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- TrainEnvs takes SamplingConfig and computes train sampling args once - EvalEnvs takes EvalSamplingConfig and computes eval sampling args once - Remove set_sampling_args, get_eval_sampling_args calls from orchestrator loop - evaluate_and_log uses eval_env.sampling_args directly - Access max_retries and ratio through config instead of duplicated attrs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Inline run_rollout/run_group logic into Env.generate_rollout/generate_group - Make uses_group_scoring a property on Env (reads rubric directly) - Rename get_sampling_args -> get_train_sampling_args - Move get_eval_sampling_args from eval_utils to utils (next to train variant) - Rename vf_env -> _env Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_group - Remove sampling config from TrainEnv/EvalEnv constructors - Orchestrator computes sampling args and sets via Envs.set_sampling_args() - Rename generate_rollout -> run_rollout, generate_group -> run_group Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove dead vf_utils functions: run_rollout, spawn_env_server, setup_env_client, wait_for_env_servers, task_uses_group_scoring - Inline spawn logic (ZMQEnvServer.run_server) directly into Env.spawn() - Inline connect logic (ZMQEnvClient) directly into Env.connect() - Convert name and uses_group_scoring from properties to attributes - Clean up unused imports from vf_utils Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- EvalEnv.evaluate() now contains the full eval pipeline (get inputs, round-robin clients, gather groups, collect results) - Remove run_group, generate, evaluate functions from vf_utils - Remove dead constants (DEFAULT_RETRIES, DEFAULT_STATE_COLUMNS, REQUIRED_STATE_COLUMNS) from vf_utils Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- EvalEnv.evaluate() now handles rollout generation, metrics, and logging - Support group scoring in evaluate() via requires_group_scoring branch - Rename uses_group_scoring -> requires_group_scoring - Remove semaphore (set_semaphore, get_semaphore) and max_concurrent config - Remove log_eval_results, evaluate_and_log — eval_utils down to ckpt_step + pass@k - Remove strip_env_version, add EnvConfig.stripped_id property - Use model_name consistently Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Resolve train env num_workers in OrchestratorConfig from max_inflight_rollouts - Resolve eval env num_workers in EvalEnvConfig from num_examples * rollouts_per_example - Remove resolve_num_workers from vf_utils, max_concurrent from Env.spawn() - Rename by_task -> by_env in scheduler - Add EnvConfig.stripped_id property, remove strip_env_version function - Use debug-level logs for spawn/connect - Use prime-rl logger in process utils instead of loguru - Fix eval_envs unbound error when no eval config - Add multi_reverse_text test config without external server Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nv_name - Remove [orchestrator.val] sections from all config files (ValConfig removed) - Fix test_scheduler: task= -> env_name= in InflightRolloutInfo - Warn when eval env defaults to 4 workers due to num_examples=-1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Env.run() dispatches to run_rollout or run_group based on requires_group_scoring and rollouts_per_example, always returns list - Scheduler and EvalEnv.evaluate() both use env.run() — no more branching - Unified result handling in scheduler (single path for both modes) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ainEnvs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…internal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove example["task"] = env.name override in buffer - Change default hash_keys from ["task", "prompt"] to ["env_name", "prompt"] - Add env_name column to wandb and prime monitor logging - task field now purely vf-internal (set by verifiers, not overridden) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… cleanups - Restore `if eval_ckpt_step is not None` guard so evals run at interval steps only - Use `self.verification.enabled and env.score_rollouts` so global flag is respected - Fix stray `)` in spawn log message - Fix `update_pools` return type annotation (str, not list[str]) - Use `get_logger().warning()` instead of `warnings.warn()` in EvalEnvConfig Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove unused parse_num_completion_tokens and parse_is_truncated_completions - Fix ratio description: ratios are relative weights, not required to sum to 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Delete utils/envs.py and trainer/envs.py (97 lines of env-var parsing ceremony) - Inline os.environ.get() calls directly in World.__init__ - Remove dead parse helpers and fix ratio field description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Delete TemperatureSchedulerConfig and temp_scheduling module - Make SamplingConfig.temperature a plain float (default 1.0) - Use tuple(dict) + random.choice for buffer sampling instead of materializing a list of values on every call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…spawn - Env.start() spawns (if needed) and connects in one call - Envs.start() does the same for all envs in parallel - Add requires_env_server_spawn property for readability - Move spawn logic to private _spawn() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Env.run_rollout() returns a single RolloutOutput - Env.run_group() returns list[RolloutOutput] for group-scoring envs - EvalEnv.evaluate() dispatches correctly: n*k individual run_rollout calls for non-group envs, n run_group(k) calls for group-scoring envs - Replace _get_eval_inputs with own logic (get_dataset + flat repeat) - Scheduler normalizes single/list results via isinstance check - Progress bar tracks n*k rollouts in both eval modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Scheduler passes raw dict instead of vf.RolloutInput
- Scheduler now converts examples to vf.RolloutInput (stripping env_name) before calling env.run_rollout/run_group.
✅ Fixed: Missing CHANGELOG entry for new score_rollouts config field
- Added CHANGELOG entry documenting the new orchestrator.env.score_rollouts field and default.
✅ Fixed: Group-scoring envs re-schedule individual rollouts incorrectly
- Group-scoring now discards partial groups and re-runs the full group if any rollout fails, avoiding mixed scoring contexts.

Or push these changes by commenting:

@cursor push c42003423a

Preview (c42003423a)

diff --git a/CHANGELOG.md b/CHANGELOG.md
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,7 @@
 - **`orchestrator.max_concurrent` removed**: Concurrency limiting via `max_concurrent` and the global semaphore have been removed. Existing configs must delete this field. (2026-04-05)
 - **`orchestrator.buffer.hash_keys` default changed**: Default changed from `["task", "prompt"]` to `["env_name", "prompt"]`. The `task` field is no longer overridden by the orchestrator for env identification; `env_name` is used instead. Buffer checkpoints using the old default may not resume correctly. (2026-04-05)
 - **`orchestrator.eval.env[].num_examples` / `rollouts_per_example` no longer fall through**: `num_examples` and `rollouts_per_example` are now required per eval env and no longer inherit from the top-level `orchestrator.eval` section. (2026-04-05)
+- **`orchestrator.env.score_rollouts` (NEW)**: Added per-env boolean to control rubric scoring on the env server (default: `true`). Also gated by `orchestrator.verification.enabled`. (2026-04-05)
 - **`orchestrator.eval.env[].failed_rollouts` metric is now a ratio**: The `eval/{name}/failed_rollouts` metric now reports a ratio (0.0–1.0) instead of a raw count. Dashboards keying on this metric should be updated. (2026-04-05)
 - **`orchestrator.sampling.temp_scheduler` removed**: Temperature scheduling (`TemperatureSchedulerConfig`) has been removed. `sampling.temperature` is now a required `float` (default `1.0`). Existing configs using `temp_scheduler` must replace it with a fixed `temperature` value. (2026-04-05)
 - **`log.file` and `log.env_worker_logs` removed**: Removed `log.file` (from `LogConfig` and `SharedLogConfig`) and `log.env_worker_logs` (from `LogConfig`). Python file logging is replaced by deployment-level capture. Existing configs using these fields must delete them. Log paths unified: `.stdout` files renamed to `.log`, SLURM logs moved from `slurm/` to `logs/`. (2026-03-31)

diff --git a/src/prime_rl/orchestrator/scheduler.py b/src/prime_rl/orchestrator/scheduler.py
--- a/src/prime_rl/orchestrator/scheduler.py
+++ b/src/prime_rl/orchestrator/scheduler.py
@@ -196,20 +196,25 @@
         if env.requires_group_scoring:
             rollouts_per_example = group.rollouts_to_schedule
             group.rollouts_to_schedule = 0
+            # Convert example dict to vf.RolloutInput, stripping scheduler-internal keys
+            clean_example_dict = {k: v for k, v in group.example.items() if k != "env_name"}
+            rollout_input = vf.RolloutInput(**clean_example_dict)
             task = asyncio.create_task(
                 env.run_group(
                     client=client_config,
-                    example=group.example,
+                    example=rollout_input,
                     model_name=self.model_name,
                     rollouts_per_example=rollouts_per_example,
                 )
             )
         else:
             group.rollouts_to_schedule -= 1
+            clean_example_dict = {k: v for k, v in group.example.items() if k != "env_name"}
+            rollout_input = vf.RolloutInput(**clean_example_dict)
             task = asyncio.create_task(
                 env.run_rollout(
                     client=client_config,
-                    example=group.example,
+                    example=rollout_input,
                     model_name=self.model_name,
                 )
             )
@@ -410,17 +415,23 @@
 
                     # Check for empty/errored rollouts and reschedule
                     valid_rollouts = []
+                    had_invalid = False
+                    requires_group_scoring = self.envs.get(env_name).requires_group_scoring
                     for rollout in rollouts:
                         if len(rollout["trajectory"]) == 0:
                             self.empty_rollouts_by_env[env_name] += 1
-                            group.rollouts_to_schedule += 1
+                            had_invalid = True
+                            if not requires_group_scoring:
+                                group.rollouts_to_schedule += 1
                             self.logger.warning(
                                 f"Empty trajectory in group {group_id} ({env_name}), re-scheduling "
                                 f"({len(group.completed_rollouts)}/{self.rollouts_per_example} complete)"
                             )
                         elif rollout["error"] is not None:
                             self.errored_rollouts_by_env[env_name] += 1
-                            group.rollouts_to_schedule += 1
+                            had_invalid = True
+                            if not requires_group_scoring:
+                                group.rollouts_to_schedule += 1
                             self.logger.warning(
                                 f"Rollout error in group {group_id} ({env_name}), re-scheduling "
                                 f"({len(group.completed_rollouts)}/{self.rollouts_per_example} complete): "
@@ -430,6 +441,13 @@
                             rollout["env_name"] = env_name
                             valid_rollouts.append(rollout)
 
+                    # For group-scoring envs, treat each run_group attempt atomically:
+                    # if any rollout in the group fails, discard the entire attempt and re-run full group.
+                    if requires_group_scoring and had_invalid:
+                        group.completed_rollouts.clear()
+                        group.rollouts_to_schedule = self.rollouts_per_example
+                        continue
+
                     group.completed_rollouts.extend(valid_rollouts)
                     if len(group.completed_rollouts) < self.rollouts_per_example:
                         continue

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

src/prime_rl/orchestrator/scheduler.py

src/prime_rl/configs/orchestrator.py

src/prime_rl/orchestrator/scheduler.py

…dule - Remove score_rollouts field from EnvConfig (was internal, not user-facing) - Fix group-scoring envs: discard entire group on partial failure and reschedule from scratch (group rewards require all rollouts together) - Fix "Eval environments ready" -> "Eval environment(s) ready" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Delete VerificationConfig class and verification field from OrchestratorConfig - Remove validate_verification_config validator - Stop setting score_rollouts in extra_env_kwargs (defaults to True in verifiers) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

run_rollout returns a single RolloutOutput, not a list. Without wrapping, iterating over the dict yields keys instead of the rollout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/envs.py

src/prime_rl/orchestrator/buffer.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Convert from NamedTuple to dataclass for direct mutation of off_policy_steps, and rename to better reflect what it represents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/scheduler.py

inflight_rollout_count, cancel_inflight_rollouts, and drop_group all counted tasks (1 per group request) instead of actual rollouts. This caused the capacity check to allow up to rollouts_per_example × more concurrent rollouts than max_inflight_rollouts intended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/prime_rl/orchestrator/orchestrator.py

src/prime_rl/configs/orchestrator.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6b01f9d. Configure here.}

cursor · 2026-04-05T23:11:25Z

src/prime_rl/orchestrator/envs.py

+            max_retries=self.config.max_retries,
+            state_columns=REQUIRED_STATE_COLUMNS,
+            env_client=self.env_client,
+        )


Shared mutable dict passed to concurrent rollouts

Medium Severity

Env.run_group builds the input list as [example for _ in range(rollouts_per_example)], passing N references to the same mutable dict. Similarly, Env.run_rollout passes group.example directly from the scheduler. The old wrappers in vf_utils.py created a fresh shallow copy via vf.RolloutInput(**example) for each rollout/group element. If vf.Environment.run_rollout or run_group mutates the input dict (e.g. setting task), all concurrent rollouts in a group share and clobber the same object. The eval path correctly creates copies with vf.RolloutInput(**example), making this inconsistency more evident.

Additional Locations (1)

src/prime_rl/orchestrator/scheduler.py#L212-L219

^{Reviewed by Cursor Bugbot for commit 6b01f9d. Configure here.}

cursor · 2026-04-05T23:11:25Z

src/prime_rl/orchestrator/scheduler.py

    def inflight_sample_count(self) -> int:
-        return self.inflight_rollout_count + sum(g.rollouts_to_schedule for g in self.groups.values())
+        pending = sum(g.rollouts_to_schedule for g in self.groups.values())
+        return self.inflight_rollout_count + pending


Group scheduling can overshoot max inflight rollouts limit

Low Severity

_schedule_next_request checks remaining_capacity > 0 before calling schedule_rollout, but for group-scoring envs, schedule_rollout consumes all rollouts_to_schedule at once (setting rollout_count = group.rollouts_to_schedule). When remaining capacity is less than rollouts_per_example, this overshoots max_inflight_rollouts by up to rollouts_per_example - 1. The old code scheduled one rollout per task, respecting the capacity precisely.

Additional Locations (1)

src/prime_rl/orchestrator/scheduler.py#L198-L209

^{Reviewed by Cursor Bugbot for commit 6b01f9d. Configure here.}

mikasenghaas and others added 30 commits March 31, 2026 11:13

ckpt

5a2da30

Merge branch 'main' into chore/env-group-refactor

9761e0b

chore: type-narrow TrainEnv constructor to TrainEnvConfig

18b8336

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: revert math_group config changes

40e444a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove commented-out TrainEnvConfig stub

53508bd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: make Envs generic over env type for correct type narrowing

051517f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: rename EnvBuffer -> Buffer, BufferSet -> Buffers, narrow to Tr…

c4b0c32

…ainEnvs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: Buffer takes TrainEnv directly, gets its own dataset

2b79f91

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: merge Buffer naming — Buffer is the public API, _EnvBuffer is …

ce8509b

…internal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into chore/env-group-refactor

ce74ea9

chore: add changelog entries for env group refactor breaking changes

1db0424

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: improve EnvConfig and EvalEnvConfig field descriptions

7bf7876

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: silence noisy eval num_workers warning in config validator

20af2df

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mikasenghaas and others added 10 commits April 5, 2026 20:27

chore: simplify evaluate() dispatch into single run_with_progress

bf19583

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: define run_with_progress per branch to avoid misleading param

395a376

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: move get_dataset to TrainEnv, inline get_eval_dataset in evaluate

fadb7d1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: load eval dataset once in EvalEnv constructor

cf8e771

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove redundant total_count variable in evaluate

f1fe16a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mini

edf343a

update config

fe7e35e

mikasenghaas requested review from faresobeid, hallerite and samsja April 5, 2026 21:07

mikasenghaas marked this pull request as ready for review April 5, 2026 21:08

cursor bot reviewed Apr 5, 2026

View reviewed changes

src/prime_rl/orchestrator/scheduler.py Show resolved Hide resolved

src/prime_rl/configs/orchestrator.py Outdated Show resolved Hide resolved

src/prime_rl/orchestrator/scheduler.py Show resolved Hide resolved

mikasenghaas and others added 3 commits April 5, 2026 21:23

fix: wrap single run_rollout result in list in scheduler

c949c11

run_rollout returns a single RolloutOutput, not a list. Without wrapping, iterating over the dict yields keys instead of the rollout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Apr 5, 2026

View reviewed changes

src/prime_rl/orchestrator/envs.py Outdated Show resolved Hide resolved

src/prime_rl/orchestrator/buffer.py Show resolved Hide resolved

mikasenghaas and others added 3 commits April 5, 2026 21:40

chore: rename n/k to descriptive names in evaluate, fix k shadowing

0ed00b9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove unused num_rollouts param from update_pools

9f19477

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: rename InflightRolloutInfo to InflightRequest dataclass

7f595cb

Convert from NamedTuple to dataclass for direct mutation of off_policy_steps, and rename to better reflect what it represents. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Apr 5, 2026

View reviewed changes

src/prime_rl/orchestrator/scheduler.py Show resolved Hide resolved

cursor bot reviewed Apr 5, 2026

View reviewed changes

src/prime_rl/orchestrator/orchestrator.py Show resolved Hide resolved

src/prime_rl/configs/orchestrator.py Show resolved Hide resolved

chore: document skip_first removal in changelog

6b01f9d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: refactor env group config into Env wrapper classes#2193

chore: refactor env group config into Env wrapper classes#2193
mikasenghaas wants to merge 48 commits intomainfrom
chore/env-group-refactor

mikasenghaas commented Apr 3, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 5, 2026

Uh oh!

cursor bot Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented Apr 3, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary

Breaking changes

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 5, 2026

Choose a reason for hiding this comment

Shared mutable dict passed to concurrent rollouts

Uh oh!

cursor bot Apr 5, 2026

Choose a reason for hiding this comment

Group scheduling can overshoot max inflight rollouts limit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Apr 3, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading