Skip to content
Open
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
5a2da30
ckpt
mikasenghaas Mar 31, 2026
9761e0b
Merge branch 'main' into chore/env-group-refactor
mikasenghaas Apr 3, 2026
3eef697
chore: flatten env group config and move env_ratios to per-env ratio …
mikasenghaas Apr 3, 2026
1956976
chore: simplify orchestrator env setup and auto-detect group scoring
mikasenghaas Apr 3, 2026
ffd7325
chore: introduce Env wrapper class and simplify orchestrator plumbing
mikasenghaas Apr 3, 2026
5c2decc
chore: store sampling_args on Env, remove temp scheduling from orch
mikasenghaas Apr 3, 2026
18b8336
chore: type-narrow TrainEnv constructor to TrainEnvConfig
mikasenghaas Apr 3, 2026
776321d
chore: set sampling args at env init, compute once per orchestrator
mikasenghaas Apr 3, 2026
578b9c6
chore: inline vf_utils into Env, rename get_sampling_args
mikasenghaas Apr 3, 2026
f45ae9c
chore: set sampling_args from orchestrator, rename to run_rollout/run…
mikasenghaas Apr 3, 2026
c6d34f9
chore: remove dead code, inline spawn/connect into Env
mikasenghaas Apr 3, 2026
40e444a
chore: revert math_group config changes
mikasenghaas Apr 3, 2026
53508bd
chore: remove commented-out TrainEnvConfig stub
mikasenghaas Apr 3, 2026
36f15d1
chore: inline evaluate/generate/run_group into EvalEnv
mikasenghaas Apr 5, 2026
0df5d3e
chore: inline eval pipeline into EvalEnv, remove semaphore and dead code
mikasenghaas Apr 5, 2026
5130f5c
chore: resolve num_workers in configs, cleanup naming and logging
mikasenghaas Apr 5, 2026
051517f
chore: make Envs generic over env type for correct type narrowing
mikasenghaas Apr 5, 2026
c844d34
chore: fix tests — remove val config from all TOML files, fix task->e…
mikasenghaas Apr 5, 2026
c4ab1f3
chore: unify rollout dispatch with Env.run() helper
mikasenghaas Apr 5, 2026
c4b0c32
chore: rename EnvBuffer -> Buffer, BufferSet -> Buffers, narrow to Tr…
mikasenghaas Apr 5, 2026
2b79f91
chore: Buffer takes TrainEnv directly, gets its own dataset
mikasenghaas Apr 5, 2026
ce8509b
chore: merge Buffer naming — Buffer is the public API, _EnvBuffer is …
mikasenghaas Apr 5, 2026
7315036
chore: stop hijacking vf task field for env identification
mikasenghaas Apr 5, 2026
021bbde
fix: restore eval interval gate, verification.enabled flag, and minor…
mikasenghaas Apr 5, 2026
ce74ea9
Merge remote-tracking branch 'origin/main' into chore/env-group-refactor
mikasenghaas Apr 5, 2026
1db0424
chore: add changelog entries for env group refactor breaking changes
mikasenghaas Apr 5, 2026
7bf7876
chore: improve EnvConfig and EvalEnvConfig field descriptions
mikasenghaas Apr 5, 2026
20af2df
chore: silence noisy eval num_workers warning in config validator
mikasenghaas Apr 5, 2026
35f1b43
chore: remove dead parse helpers, fix ratio field description
mikasenghaas Apr 5, 2026
731fc07
chore: remove env-var modules, inline os.environ into World
mikasenghaas Apr 5, 2026
f735482
chore: remove temp scheduler, avoid list alloc in buffer sampling
mikasenghaas Apr 5, 2026
ec603e7
chore: merge spawn+connect into Env.start(), add requires_env_server_…
mikasenghaas Apr 5, 2026
d98832b
chore: split run into run_rollout/run_group, fix eval group scoring
mikasenghaas Apr 5, 2026
bf19583
chore: simplify evaluate() dispatch into single run_with_progress
mikasenghaas Apr 5, 2026
395a376
chore: define run_with_progress per branch to avoid misleading param
mikasenghaas Apr 5, 2026
fadb7d1
chore: move get_dataset to TrainEnv, inline get_eval_dataset in evaluate
mikasenghaas Apr 5, 2026
cf8e771
chore: load eval dataset once in EvalEnv constructor
mikasenghaas Apr 5, 2026
f1fe16a
chore: remove redundant total_count variable in evaluate
mikasenghaas Apr 5, 2026
edf343a
mini
mikasenghaas Apr 5, 2026
fe7e35e
update config
mikasenghaas Apr 5, 2026
c3163c5
chore: remove score_rollouts from EnvConfig, fix group scoring resche…
mikasenghaas Apr 5, 2026
a70d485
chore: remove VerificationConfig, rollout scoring is always enabled
mikasenghaas Apr 5, 2026
c949c11
fix: wrap single run_rollout result in list in scheduler
mikasenghaas Apr 5, 2026
0ed00b9
chore: rename n/k to descriptive names in evaluate, fix k shadowing
mikasenghaas Apr 5, 2026
9f19477
chore: remove unused num_rollouts param from update_pools
mikasenghaas Apr 5, 2026
7f595cb
chore: rename InflightRolloutInfo to InflightRequest dataclass
mikasenghaas Apr 5, 2026
469ef6c
fix: count rollouts not tasks for group-scoring capacity and metrics
mikasenghaas Apr 5, 2026
6b01f9d
chore: document skip_first removal in changelog
mikasenghaas Apr 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@

Documenting changes which affect configuration usage patterns (added/moved/removed/renamed fields, notable logic changes).

- **`orchestrator.buffer.env_ratios` → per-env `orchestrator.env[].ratio`**: `buffer.env_ratios` has been removed. Set `ratio` on each `[[orchestrator.env]]` entry instead. Ratios must be all-or-nothing across envs (either all have a ratio or none do). (2026-04-05)
- **`orchestrator.val` removed**: The `[orchestrator.val]` config section (`ValConfig`) has been removed. Validation was a separate concern from the env group refactor. Existing configs must delete this section. (2026-04-05)
- **`orchestrator.max_concurrent` removed**: Concurrency limiting via `max_concurrent` and the global semaphore have been removed. Existing configs must delete this field. (2026-04-05)
- **`orchestrator.buffer.hash_keys` default changed**: Default changed from `["task", "prompt"]` to `["env_name", "prompt"]`. The `task` field is no longer overridden by the orchestrator for env identification; `env_name` is used instead. Buffer checkpoints using the old default may not resume correctly. (2026-04-05)
- **`orchestrator.eval.env[].num_examples` / `rollouts_per_example` no longer fall through**: `num_examples` and `rollouts_per_example` are now required per eval env and no longer inherit from the top-level `orchestrator.eval` section. (2026-04-05)
- **`orchestrator.eval.env[].failed_rollouts` metric is now a ratio**: The `eval/{name}/failed_rollouts` metric now reports a ratio (0.0–1.0) instead of a raw count. Dashboards keying on this metric should be updated. (2026-04-05)
- **`orchestrator.sampling.temp_scheduler` removed**: Temperature scheduling (`TemperatureSchedulerConfig`) has been removed. `sampling.temperature` is now a required `float` (default `1.0`). Existing configs using `temp_scheduler` must replace it with a fixed `temperature` value. (2026-04-05)
- **`orchestrator.verification` removed**: The `[orchestrator.verification]` config section (`VerificationConfig`) has been removed. Rollout scoring is now always enabled. Existing configs must delete this section. (2026-04-05)
- **`log.file` and `log.env_worker_logs` removed**: Removed `log.file` (from `LogConfig` and `SharedLogConfig`) and `log.env_worker_logs` (from `LogConfig`). Python file logging is replaced by deployment-level capture. Existing configs using these fields must delete them. Log paths unified: `.stdout` files renamed to `.log`, SLURM logs moved from `slurm/` to `logs/`. (2026-03-31)
- **`trainer.log.ranks_filter` (NEW)**: Added `ranks_filter: list[int]` to `TrainerLogConfig` (default: `[0]`). Controls which ranks appear in trainer console output via torchrun's `--local-ranks-filter`. (2026-03-31)
- **`wandb.log_extras.sample_ratio` / monitor sample logging defaults**: `wandb.log_extras.sample_ratio` is now actually applied to W&B sample-table logging via the shared monitor sampler (it was previously a no-op for WandB). Separately, the orchestrator no longer hard-caps sample logging to 8 rollouts before monitor-level sampling runs, so when monitor `sample_ratio` is `None`, monitors now receive and may log the full rollout batch for a step instead of at most 8 rollouts. This affects both W&B and Prime monitor sample logging behavior. (2026-03-27)
Expand Down
4 changes: 0 additions & 4 deletions configs/ci/nightly/multimodal_color_codeword.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,6 @@ max_tokens = 64
id = "color-codeword"
args = { images_per_turn = 1, max_turns = 3, num_examples = 1000, seed = 42 }

[orchestrator.val]
interval = 5
num_examples = 100

[trainer]

[trainer.model]
Expand Down
4 changes: 0 additions & 4 deletions configs/hendrycks_math/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,6 @@ args = { dataset_name = "PrimeIntellect/Hendrycks-Math", dataset_subset = "defau
easy_threshold = 1.0
hard_threshold = 0.0

[orchestrator.val]
interval = 5
num_examples = 128

[orchestrator.eval]
interval = 10

Expand Down
11 changes: 4 additions & 7 deletions configs/math_group/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,25 @@ name = "math-group"
name = "Qwen/Qwen3-4B-Instruct-2507"

[orchestrator]
batch_size = 512
rollouts_per_example = 16
oversampling_factor = 1.5
batch_size = 256
rollouts_per_example = 8

[[orchestrator.env]]
id = "math-env"
name = "hendrycks-math"
args = { dataset_name = "PrimeIntellect/Hendrycks-Math", dataset_subset = "default" }
ratio = 0.5

[[orchestrator.env]]
id = "math-env"
name = "acereason-math"
args = { dataset_name = "nvidia/AceReason-Math", dataset_subset = "default", question_key = "problem" }
ratio = 0.5

[orchestrator.buffer]
easy_threshold = 1.0
hard_threshold = 0.0

[orchestrator.val]
interval = 5
num_examples = 128

[orchestrator.eval]
interval = 50

Expand Down
5 changes: 0 additions & 5 deletions configs/multi_reverse_text/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,11 @@ max_tokens = 128

[[orchestrator.env]]
id = "reverse-text"
address = "tcp://127.0.0.1:5000" # requires: uv run env-server --env.id reverse-text --env.address tcp://127.0.0.1:5000

[[orchestrator.env]]
id = "reverse-text"
name = "reverse-text-2"

[orchestrator.val]
interval = 1
num_examples = 16

[orchestrator.eval]
interval = 5

Expand Down
4 changes: 0 additions & 4 deletions configs/multimodal/rl_color_codeword.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,6 @@ max_tokens = 64
id = "color-codeword"
args = { images_per_turn = 1, max_turns = 3, num_examples = 1000, seed = 42 }

[orchestrator.val]
interval = 1
num_examples = 100

[trainer]

[trainer.model]
Expand Down
4 changes: 0 additions & 4 deletions configs/nemotron_4node/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,6 @@ args = { dataset_name = "PrimeIntellect/Hendrycks-Math", dataset_subset = "defau
easy_threshold = 1.0
hard_threshold = 0.0

[orchestrator.val]
interval = 5
num_examples = 128

[orchestrator.eval]
interval = 10

Expand Down
4 changes: 0 additions & 4 deletions configs/nemotron_debug/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,6 @@ args = { dataset_name = "PrimeIntellect/Hendrycks-Math", dataset_subset = "defau
easy_threshold = 1.0
hard_threshold = 0.0

[orchestrator.val]
interval = 5
num_examples = 128

[orchestrator.eval]
interval = 10

Expand Down
8 changes: 6 additions & 2 deletions examples/Intellect-3.1/rl.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,30 +50,34 @@ oversampling_factor = 2
[[orchestrator.env]]
id = "mini-swe-agent-plus"
name = "swe"
ratio = 0.3
args = { max_turns = 200, cpu_cores = 2, memory_gb = 4, disk_size_gb = 4, labels = ["mini-swe-agent-plus"], total_timeout_minutes = 720, sandbox_client_max_workers = 256, max_command_timeouts = 3, sandbox_command_timeout = 30}

[[orchestrator.env]]
id = "deepdive"
name = "deepdive"
ratio = 0.2
args = { finish_with_tool = true, open_max_workers = 128, cache_dir = "/tmp/i3_deepdive_cache_train" }

[[orchestrator.env]]
id = "math-env"
name = "math"
ratio = 0.3
args = { min_avg_reward = 0.0, max_avg_reward = 0.874}

[[orchestrator.env]]
id = "logic-env"
args = { min_avg_reward = 0.0, max_avg_reward = 0.874 }
name = "logic"
ratio = 0.2
args = { min_avg_reward = 0.0, max_avg_reward = 0.874 }

[[orchestrator.env]]
id = "code-env"
name = "code"
ratio = 0.2
args = { pool_size = 512 }

[orchestrator.buffer]
env_ratios = [0.3, 0.2, 0.3, 0.2, 0.2]
easy_threshold = 1.0
online_difficulty_filtering = true
seed = 42
Expand Down
Loading
Loading