Skip to content

Releases: PrimeIntellect-ai/prime-rl

v0.5.0 release

30 Mar 11:48
63331ad

Choose a tag to compare

1. Disaggregated Prefill-Decode Inference

Added support for disaggregated prefill-decode inference with multi-replica support. This architecture separates prefill and decode phases across dedicated GPU pools, improving throughput and latency for large-scale RL training. Includes vLLM router integration for intelligent request routing across replicas.

  • Separate prefill and decode workers for optimal GPU utilization
  • Multi-replica support for horizontal scaling
  • Router URL support for elastic inference pool integration

#2030 — Disaggregated prefill-decode inference with multi-replica support
#2049 — Add router_url support for elastic inference pool

2. New Model Support

GLM5

Added support for GLM5 models, including deep-gemm dependency for FP8 operations and AFMoE flash attention.

#1827 — GLM5 support
#1884 — Add deep_gemm as dependency for GLM5
#1766 — Add flash attention for AFMoE

Qwen3.5 MoE

Added Qwen3.5 MoE model support including EP, VLM weight broadcast, hybrid context parallelism (DeltaNet + attention), and LoRA compatibility.

#1946 — Qwen3.5 support
#2026 — Qwen3.5 MoE model support with EP and VLM weight broadcast
#2080 — Hybrid CP for Qwen3.5 MoE (DeltaNet + attention)
#2019 — Monkeypatch for Qwen3.5 LoRA
#2027 — Patch sharded LoRA slice_lora_a for Qwen3.5 MoE

MiniMax M2.5

Added MiniMax M2.1 MoE model support with LoRA compatibility fixes for vLLM inference.

#1773 — MiniMax M2.1 MoE model support
#1831 — Fix MiniMax M2 LoRA compatibility in vLLM inference

Nemotron-H

Added NemotronH (Nemotron-3-Super-120B-A12B) support, a hybrid Mamba-Transformer model, including tool call parser integration and dimension mismatch fixes for non-120B variants.

#2046 — NemotronH model support
#2089 — Add NemotronH to tool call parser map
#2110 — Fix Nemotron-H Mamba layer dimension mismatch for non-120B models

GPT-OSS

Added GPT-OSS model support with LoRA and DP fixes.

#2073 — GPT-OSS support with LoRA + DP fixes

3. Multi Env Worker

New multi-environment worker architecture that isolates environment execution from scheduling logic. Environment servers now run as separate processes, improving fault isolation and enabling independent scaling of env execution.

#1714 — Env worker integration
#2083 — Multi env worker
#1816 — Env server recovery
#1939 — Fix env server task cancellation
#2028 — Cleanup env subprocesses on orchestrator crash

4. SFT Improvements

SFT LoRA

Added LoRA support for supervised fine-tuning, enabling parameter-efficient training.

#1849 — SFT LoRA support

SFT Distillation

Bug fixes, VLM support, and pretokenization optimization for SFT distillation.

#2053 — SFT distillation: bug fixes, VLM support, and pretokenization optimization

Fused Linear Cross-Entropy Loss

Memory-efficient fused linear cross-entropy loss for SFT, reducing peak memory usage.

#1958 — Fused linear cross-entropy loss for SFT

SFT Validation

Added validation eval with val_data support for monitoring training quality.

#1850 — SFT validation eval with val_data

5. Performance

Selective Activation Checkpointing

Selectively checkpoint activations per layer, reducing memory while preserving compute for layers that don't need recomputation.

#2055 — Selective activation checkpointing

FP8 Weight Transfer

Integrated FP8 weight transfer format for faster model weight synchronization between trainer and inference.

#2038 — Integrate FP8 weight transfer format

Sequence-Chunked Fused LM Head

Switched fused LM head from token chunking to sequence chunking for better memory efficiency.

#1987 — Switch fused LM head to sequence chunking

AFMoE Ring Attention

Added ring attention support for AFMoE (Alternating Full MoE) architectures.

#1848 — AFMoE ring attention support

VLM Performance

Multiple optimizations for vision-language model training: parallel image preprocessing, image deduplication, disk-backed offloading, and async preprocessing.

#1951 — Parallelize VLM image preprocessing across threads
#1935 — Deduplicate VLM image bytes in orchestrator cache
#1923 — Serialize VLM pixel_values as raw bytes for 8-10x faster preprocessing
#2006 — Disk-backed image offloading for VLM memory
#2065 — Run VLM image preprocessing in thread to unblock event loop

Quack Integration

Integrated Quack kernels for RMS norm and SFT loss.

#2102 — Use Quack for RMS norm and SFT loss

6. Infrastructure & Deployment

SLURM Entrypoint

Unified SLURM entrypoint for both RL and SFT with Jinja2 template-based sbatch generation. Supports single-node deployments and configurable pre-run commands.

#1774 — SLURM entrypoint with Jinja2 template
#1832 — SFT SLURM entrypoint
#1859 — Unify RL and SFT SLURM entrypoints
#1988 — Add pre_run_command and common SLURM scheduling options

Arm64 Support

Added aarch64 (arm64) support to Docker builds and dependencies.

#1933 — Arm64 support for Dockerfile and deps

Config Migration to pydantic-config

Migrated the configuration system to pydantic-config with JSON dict CLI support, cleaner discriminated unions, and consolidated config modules.

#1915 — Migrate config system to pydantic_config
#1871 — Consolidate config modules into prime_rl.configs
#1878 — Replace callable discriminator with field values

Platform Integration

Added platform integration for centralized management.

#1896 — Platform integration

Docs Revamp

Revamped documentation and README.

#2116 — Revamp docs and readme

7. Other Improvements

  • Individual Rollouts: Schedule and track rollouts individually for finer-grained control. #1865
  • TITO Default: Text-In-Text-Out (TITO) is now the default inference endpoint. #1851
  • Group Relative Reward Rescaling: Added GRRR as a length penalty option. #2029
  • DP-Rank Routing: Route rollouts based on data-parallel rank. #1940
  • Router Replay: Replay routing decisions for debugging and reproducibility. #1807
  • Gibberish & Repetition Filtering: Filter out low-quality rollouts. #1746
  • Weights-Only Checkpointing: Save just model weights without optimizer state. #2033
  • Per-Env Metrics: Logging and metrics broken down per training environment. #1989, #2070
  • Eval Metrics: Log eval/{env}/failed_rollouts to W&B and add eval/samples table. #2123, #2124
  • EP Inference Support: Expert parallelism at inference time. #1860
  • Multi-Node EP: Support for expert parallelism across multiple nodes. #1894
  • IPO Default Algorithm: Changed default RL algorithm to IPO. [#1930](https://github.com/PrimeIntellect-ai/prime...
Read more

v0.4.0 release

06 Feb 22:29
f870f3c

Choose a tag to compare

1. Bring Your Own Algorithms

Researchers can now plug in custom loss functions and advantage functions without modifying the core training code. Define your own RL objectives and advantage estimators, configure them via TOML, and experiment freely.

  • Custom Loss: provide a per-sequence loss function via LossInputs / LossOutputs dataclasses
  • Custom Advantage: provide a per-problem advantage function via AdvantageInputs / AdvantageOutputs dataclasses
  • Configure everything in your TOML config with type = "custom", import_path and kwargs
# Custom loss
[loss]
type = "custom"
import_path = "my_module.ppo_clip_loss"
kwargs = { clip_eps = 0.2 }

# Custom advantage
[advantage]
type = "custom"
import_path = "my_module.normalized_advantage"
kwargs = { eps = 1e-8 }

See docs/bring-your-own-algorithms.md for full documentation.

#1715 — Bring your own algorithms

2. Multimodal RL Training

Added experimental support for multimodal reinforcement learning training, enabling RL fine-tuning of vision-language models (VLMs). This opens up new possibilities for training models that can reason over both text and images using reinforcement learning.

Key capabilities:

  • Train VLMs with the same GRPO/PPO algorithms used for text-only models
  • Multi-turn conversation support for multi-modal interactions, allowing complex dialogue flows with interleaved images and text
  • Compatible with existing reward functions and verifiers

#1680 — Add multimodal training (experimental)
#1703 — Add multi-turn support for multi-modal RL

3. Performance & Parallelism

Expert Parallelism (EP)

Added support for Expert Parallelism, a distributed training strategy for Mixture of Experts (MoE) models.

#1595 — Expert Parallelism support
#1614 — Add CP and EP to benchmarks

Flash Attention 4

Added FA4 support for fast attention on Blackwell.

#1726 — Flash Attention 4

FA3 Ring-Attention Kernel

Previously our ring attention algorithm was still using the Flash Attention 2 kernel. We now allow using FA3 instead for significant speedup on long context training.

#1727 — Add FA3 ring-attention kernel wrapper and benchmark coverage

Optimizer State CPU Offload

Offload optimizer states (e.g. Adam first and second moments) to CPU memory. Particularly useful to reduce memory usage when doing RL experiments at smaller scale, allowing large MoE models to fit on a couple of training nodes. The performance reduction is negligible in RL because large batch sizes mean many gradient accumulation steps, and the cost of offloading weights to CPU is amortized.

#1694 — Add optimizer state CPU offload

3-Stage Chunked LM Head Loss

Improved memory efficiency for the language model head loss computation via a 3-stage chunked approach. Instead of materializing the full logit tensor, the loss is computed in chunks, reducing peak memory usage. This is especially beneficial for large-vocabulary models where the logit tensor can be a major memory bottleneck during the backward pass.

#1649 — 3-stage logic for chunked lm head loss

4. Other Improvements

  • Elastic Inference Pool: New elastic inference pool with DNS-based service discovery for dynamic scaling of inference servers at runtime. Add or remove servers without restarting the training loop, with automatic health checking and failover. #1617, #1704
  • Temperature Scheduler: Control sampling temperature throughout training with various scheduling strategies, enabling curriculum-style exploration. #1624
  • JSON Structured Logging: JSON structured logging for easier log aggregation and analysis in production. #1681
  • Gemma3 Support: Added native support for Gemma3 models. #1648
  • Worker Rate Limiting: Rate limiting for worker job submissions to control dispatch pace. #1711
  • K8s Health Probes: Health probes for inference and trainer, plus parallel pod management for faster scaling. #1719, #1718
  • Multi-run Checkpointing: Checkpoint support for multiple concurrent training runs. #1593, #1632
  • RunsManager Refactor: Renamed Runs → RunsManager with hook cleanup, and ability to evict runs with bad batches. #1619, #1634

Breaking Changes

  • vLLM upgraded to 0.14: Upgraded vLLM dependency to version 0.14. This may require updating your environment. Token chat preprocessing has been aligned with vLLM 0.14 behavior. #1625, #1637

  • Liger kernel model deprecated: The Liger kernel model implementation has been deprecated. #1691


Bug Fixes

#1717 — Fix race condition
#1725 — Fix int64 JSON serialization in Chinese character metrics
#1720 — Handle empty completion_temperatures in prepare_sample
#1712 — Use stable checkpoints for orchestrator resume
#1702 — Fix eval watcher only picks up checkpoints in increasing order
#1693 — Fix NCCL update
#1690 — Don't create config dir on trainer during config validation
#1686 — Make NCCL broadcast compatible with DP
#1683 — Fix bug where hosted RL rollouts were missing final message
#1670 — Zombie guard on checkpoint
#1678 — Only master clean weight
#1665 — Fix support for NCCL mode when resuming from checkpoint
#1650 — Fix KL mismatch by resetting prefix cache
#1644 — Fix weight update when enforce_eager=True
#1642 — Use discovery in eval
#1636 — Fix CPU offloading
#1630 — Make search for line more robust
#1612 — Fix timeout overcounting
#1609 — Auto-restart env workers on unexpected death
#1596 — Fix trainer crash when all rollouts in a batch fail
#1613 — Use step change instead of batch size to demarcate when to update

Misc

#1722 — Add AMD Instinct MI300X/MI325X peak FLOPS for MFU calculation
#1724 — Strip @Version suffix from env IDs before loading as Python modules
#1700 — Track Chinese characters
#1677 — Wandb async RL inflight
#1671 — Cancel all rollout eval
#1640 — Add mismatch-KL stability checks for nightly math runs
#1635 — Weights reload configuration
#1638 — Add INFO log when orchestrator resumes after checkpoint wait
#1631 — Ensure eval results upload before existing subprocess
#1629 — Assert when only trainer or orchestrator wandb is configured
#1622 — Add retry with exponential backoff for empty training batches
#1601 — Add health endpoint for worker nodes in multi-node training
#1604 — Check for current step based on progress to know what is valid for this step
[#1543](https://githu...

Read more

v0.3.0 release

16 Jan 02:36
32adbba

Choose a tag to compare

Highlights

1. Fused LM head / chunking (logits + loss)

We introduced a fused lm head with selective logprobs, significantly decreasing the peak vram required for the RL loss function. This is now enabled by default and should greatly reduce the vram requirements for doing rl training.

Example on Qwen/Qwen3-0.6B at 16384 sequence length where we reduced the peak vram from 44.2GiB -> 3.3 GiB:
With previous implementation

                                                            Benchmark                                                            
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃    Step ┃             MFU              ┃           Throughput            ┃            Step Time            ┃   Peak Memory    ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│       1 │            7.85%             │             12.12K              │             21.63s              │     44.2 GiB     │
│       2 │            7.89%             │             12.19K              │             21.39s              │     44.2 GiB     │
│       3 │            7.83%             │             12.10K              │             21.99s              │     44.2 GiB     │
│         │                              │                                 │                                 │                  │
│ Overall │ 7.86% ± 0.03% [7.83%, 7.89%] │ 12.13K ± 46.45 [12.10K, 12.19K] │ 21.67s ± 0.30s [21.39s, 21.99s] │ 44.2 GiB (93.1%) │
└─────────┴──────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┴──────────────────┘

With fused chunked lm head

                                                           Benchmark                                                           
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃    Step ┃             MFU              ┃           Throughput            ┃            Step Time            ┃    Peak Memory   ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│       1 │            8.07%             │             12.47K              │             21.02s              │      3.3 GiB     │
│       2 │            8.10%             │             12.51K              │             20.90s              │      3.3 GiB     │
│       3 │            8.14%             │             12.56K              │             20.67s              │      3.3 GiB     │
│         │                              │                                 │                                 │                  │
│ Overall │ 8.10% ± 0.03% [8.07%, 8.14%] │ 12.51K ± 48.19 [12.47K, 12.56K] │ 20.86s ± 0.18s [20.67s, 21.02s] │   3.3 GiB (6.9%) │
└─────────┴──────────────────────────────┴─────────────────────────────────┴─────────────────────────────────┴──────────────────┘

#1525 — Fused LM Head implementation.
#1544 — Default fused_lm_head_chunk_size=2048 for RL.
#1545 — Enable loss chunking for non-custom HF models.

2. On Policy distillation

Added on policy distillation : the student learns from a teacher on the student’s own rollouts, so it stays on-policy while still getting dense, step-by-step guidance. Compared to normal (off-policy) distillation, this reduces the “teacher-only states” mismatch and helps the model learn to recover from its own mistakes instead of only imitating perfect trajectories

Quickstart docs at https://github.com/PrimeIntellect-ai/prime-rl/blob/main/docs/on_policy_distillation.md

#1458 - Add support for On Policy distillation

3. Advanced multi LoRA support

LoRA are now first class citizen in prime-rl. This release, we add some preliminary features to support training multiple separate lora from different runs with the same trainer and inference deployment. We also now support training loras for MoE experts.

#1571 — Update LoRA default alpha to 32.
#1567 — Change LoRA alpha default to 32.
#1526 — MoE LoRA support.
#1520 — Retry load_lora_adapter (NFS delays).

4. New model support

We now natively support AFMoE!

#1515 — AFMoE support

5. Trainer observability / metrics

Prime-rl RL trainer can now optionally expose metrics through a prometheus metrics server

#1547 — Prometheus metrics server for trainer.

5. Refactor logging for environment

We can now redirect the log of a given environment to a specific logging file, we also intercept verifier logger into prime-rl format

#1594
#1561


Breaking changes

  • Config rename: ckpt.keep → ckpt.keep_last (and new ckpt.keep_interval). Update configs that still set ckpt.keep. (2025-12-31)

  • Behavior change / defaults: MultiLoRAMoE / QwenMoE now enables training expert LoRAs by default via target_modules changes.

  • Behavior change (RL defaults): RL training auto-sets model.fused_lm_head_chunk_size=2048 when unset (except impl='liger_kernel'). This can change memory/throughput characteristics vs v0.2. (2026-01-05)

  • Default change: model.lora.alpha default changed 16.0 → 32.0 (impacts effective LoRA scaling if you relied on the old default). (2026-01-10)


Bug fixes

#1568 — Unique rollout request IDs to avoid collisions.
#1546 — Detect dead worker process in collect_responses.
#1563 — Fix orchestrator null-batch handling.
#1537 — Fix checkpoint cleanup on resume + cancelled rollout metric.
#1531 — Fix NCCL handshake.
#1529 — Fix W&B integration.
#1520 — Retry load_lora_adapter for NFS delays.

misc

#1551 — TrainingSample reward: adds reward to TrainingSample for logging/consumption in training pipelines.
#1521 — Checkpoint retention policy: adds keep_interval to keep periodic checkpoints in addition to “last N”.
#1536 — Blackwell kernels: enables grouped_mm on Blackwell GPUs.
#1557 — Cumsum dtype: switches multilinear cumsum dtype to int32 (avoids wider dtype overhead).
#1571 — Update LoRA layer default alpha from 16.0 to 32.0.
#1567 — Change LoRA alpha default to 32.
#1538 — Add step param to Monitor.log() interface.
#1518 — Refactor online eval.
#1533 — README updates.
#1550 — Remove PR template.
#1522 — Docs/changelog entry for ckpt.keep_last + ckpt.keep_interval.
#1506 — Inference readiness handshake (later reverted).
#1528 — Revert inference readiness handshake from #1506.
#1516 — Remove step usage in W&B monitor (later reverted).
#1530 — Revert W&B monitor “step” removal from #1516.
#1580 — add fa3 dependency

v0.2

31 Dec 08:16
017b3fe

Choose a tag to compare

Second major release of prime-rl.

This release include all the major redesign of the library that was used to train Intellect-3

Prime-rl is entering a more stable phase where we validated most of our design at scale and believe are maintainable in the long run. We also adopted nightly tests that run a diverse set training run for many hours including single turn, multi turn and agentic workflow. This allow us to catch any regression in performance or convergence.

prime-rl will now adopt a regular release schedule.

v0.1

11 Jul 19:48
8c44bb8

Choose a tag to compare

Pre v1 refacto release. First and last release before v1 betas