Skip to content

[misc] fix: synthesize zero routed_experts for empty trajectories under rout…#57

Merged
yyDing1 merged 3 commits into
mainfrom
fix/routed-experts-empty-output
Jun 7, 2026
Merged

[misc] fix: synthesize zero routed_experts for empty trajectories under rout…#57
yyDing1 merged 3 commits into
mainfrom
fix/routed-experts-empty-output

Conversation

@yyDing1

@yyDing1 yyDing1 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

When enable_rollout_routing_replay (R3) is on, successful trajectories carry a routed_experts tensor while failed/empty ones returned None. verl's _postprocess decides inclusion from inputs[0] only and torch.cat's the whole batch, so a mixed None/tensor batch either crashes or silently drops replay data. Emit a shape-consistent zero routed_experts (num_hidden_layers, num_experts_per_tok, int64) for empty outputs; the row is traj_masked so zeros never reach the loss.

Also add examples/agent_train/train_qwen3_moe_rc.sh: Qwen3 MoE decoupled-PPO + rollout-correction training script.

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs or issues and paste at least one relevant link here: ...
  • Format the PR title as [{modules}] {type}: {description} (checked by CI)
    • {modules} may include core, interaction, model, env, tools, deployment, reward, dashboard, docs, examples, data, train, ci, build, deps, misc
    • If this PR involves multiple modules, separate them with , like [interaction, tools, docs]
    • {type} must be one of feat, fix, refactor, chore, test
    • If this PR breaks an API, config contract, workflow, or other compatibility boundary, add [BREAKING] to the beginning of the title
    • For a stacked PR series, you may prepend a progress marker such as [1/N]
    • Example: [BREAKING][deployment, docs] feat: simplify runtime env configuration

Test

List the checks you ran. If CI coverage is not practical for this change, describe the manual validation or experiment results.

API and Usage Example

Show any public interface changes or updated usage examples if relevant.

# Add a short example here when the PR changes public behavior

Design & Code Changes

Summarize the approach for non-trivial changes and call out important implementation details or trade-offs.

Checklist Before Submitting

  • Read the Contribute Guide
  • Run pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add or update docs/examples for user-facing changes
  • Add tests or explain why tests are not practical
  • Confirm the PR title matches the required format
  • Confirm the placeholder text in this template has been replaced with real content

yyDing1 and others added 2 commits June 8, 2026 04:05
…er replay

When enable_rollout_routing_replay (R3) is on, successful trajectories carry a
routed_experts tensor while failed/empty ones returned None. verl's _postprocess
decides inclusion from inputs[0] only and torch.cat's the whole batch, so a mixed
None/tensor batch either crashes or silently drops replay data. Emit a
shape-consistent zero routed_experts (num_hidden_layers, num_experts_per_tok,
int64) for empty outputs; the row is traj_masked so zeros never reach the loss.

Also add examples/agent_train/train_qwen3_moe_rc.sh: Qwen3 MoE decoupled-PPO +
rollout-correction training script.

Co-authored-by: Cursor <cursoragent@cursor.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a training script for Qwen3 MoE with rollout correction and updates the agent loop to support router replay (R3) by synthesizing zero-filled routed experts for failed trajectories. It also includes the reward score in the saved interaction results. The feedback highlights a critical performance issue where a blocking synchronous I/O call (AutoConfig.from_pretrained) is executed inside the asyncio event loop, which should be offloaded to a separate thread. Additionally, there are redundant and conflicting variable definitions for RAY_DATA_HOME, NNODES_ROLLOUT, and NNODES_TRAIN in the training script that need to be cleaned up.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

uni_agent/agent_loop.py (169-210)

high

The method _get_routing_replay_shape performs a synchronous, blocking network/disk I/O call (AutoConfig.from_pretrained) inside the asyncio event loop. This blocks the entire event loop thread, which can cause timeouts, heartbeat failures, or severe performance degradation for other concurrent agent loops running on the same worker process.

To resolve this, make both _synth_failed_routed_experts and _get_routing_replay_shape asynchronous, and run the blocking AutoConfig.from_pretrained call in a separate thread using asyncio.to_thread.

            routed_experts=await self._synth_failed_routed_experts(dummy_response_length),
            multi_modal_data={},
            reward_score=0,
            num_turns=0,
            metrics={},
            extra_fields=extra_fields,
        )

    async def _synth_failed_routed_experts(self, length: int) -> np.ndarray | None:
        """Synthesize a zero ``routed_experts`` of shape ``(length, num_layers, top_k)``."""
        shape = await self._get_routing_replay_shape()
        if shape is None:
            return None
        num_layers, top_k = shape
        return np.zeros((length, num_layers, top_k), dtype=np.int64)

    async def _get_routing_replay_shape(self) -> tuple[int, int] | None:
        """Resolve and cache ``(num_hidden_layers, num_experts_per_tok)`` for the rollout
        model. Returns ``None`` if rollout routing replay is off or the model has no
        experts. The HF config is loaded at most once per worker process."""
        rollout_cfg = self.config.actor_rollout_ref.rollout
        if not bool(getattr(rollout_cfg, "enable_rollout_routing_replay", False)):
            return None
        cls = UniAgentLoop
        if not cls._routing_replay_resolved:
            from transformers import AutoConfig

            model_path = self.config.actor_rollout_ref.model.path
            model_cfg = await asyncio.to_thread(
                AutoConfig.from_pretrained, model_path, trust_remote_code=True
            )
            # Newer Qwen3 nests MoE fields under ``text_config``; older configs keep them
            # at the top level. ``... or 0`` guards against fields explicitly set to None.
            text_cfg = getattr(model_cfg, "text_config", None) or model_cfg
            num_layers = int(getattr(text_cfg, "num_hidden_layers", 0) or 0) or int(
                getattr(model_cfg, "num_hidden_layers", 0) or 0
            )
            top_k = int(getattr(text_cfg, "num_experts_per_tok", 0) or 0) or int(
                getattr(model_cfg, "num_experts_per_tok", 0) or 0
            )
            cls._routing_replay_shape = (num_layers, top_k) if num_layers > 0 and top_k > 0 else None
            cls._routing_replay_resolved = True
            self.logger.info(f"routed_experts replay shape resolved: {cls._routing_replay_shape}")
        return cls._routing_replay_shape

examples/agent_train/train_qwen3_moe_rc.sh (4-12)

medium

The variable RAY_DATA_HOME is unconditionally assigned on line 4 (RAY_DATA_HOME=/mnt/hdfs/yyding), which makes the conditional assignment on line 12 (RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}) dead code. If you want to allow overriding RAY_DATA_HOME via environment variables, you should use conditional assignment (${RAY_DATA_HOME:-...}) from the start.

RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
NNODES_ROLLOUT=${NNODES_ROLLOUT:-16}
NNODES_TRAIN=${NNODES_TRAIN:-4}
GEN_TP=${GEN_TP:-2}

project_name=${PROJECT_NAME:-'Uni-Agent-Qwen3-Coder-30B-megatron'}
exp_name=${EXP_NAME:-"$(date +%Y%m%d%H)_exp"}

examples/agent_train/train_qwen3_moe_rc.sh (73-75)

medium

The variables NNODES_ROLLOUT and NNODES_TRAIN are already unconditionally defined at the top of the script. These duplicate conditional assignments are redundant and can be removed to avoid confusion about the default values (e.g., 16 vs 12 for NNODES_ROLLOUT).

NGPUS_PER_NODE=${NGPUS_PER_NODE:-8}

@yyDing1 yyDing1 merged commit 75788ab into main Jun 7, 2026
3 checks passed
@yyDing1 yyDing1 deleted the fix/routed-experts-empty-output branch June 7, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant