Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
61fece9
feat(phase7): LLaDA2.0 Real Model with MoE + Block Attention + GPU Tests
AlonKellner-RedHat May 5, 2026
5d1bf7e
fix: add trust_remote_code=True for LLaDA2.0 model loading
AlonKellner-RedHat May 5, 2026
26d7e94
fix: register HuggingFace architecture name LLaDA2MoeModelLM
AlonKellner-RedHat May 5, 2026
bfa7174
fix: correct AttentionMetadata import path for vLLM compatibility
AlonKellner-RedHat May 5, 2026
9549ab5
fix: add supported_runners to LLaDA2ForCausalLM
AlonKellner-RedHat May 5, 2026
be20bda
fix: remove trust_remote_code from vLLM model loading
AlonKellner-RedHat May 5, 2026
511dad2
fix: add trust_remote_code=True for LLaDA2.0 config loading
AlonKellner-RedHat May 5, 2026
12383c7
fix: set VLLM_PLUGINS env var before importing vllm
AlonKellner-RedHat May 5, 2026
a9b69a4
fix: disable trust_remote_code to use ModelRegistry model
AlonKellner-RedHat May 5, 2026
9531737
feat: add local LLaDA2.0-mini fixture without auto_map
AlonKellner-RedHat May 5, 2026
3ea5db7
fix: use model_type='mistral' for Transformers compatibility
AlonKellner-RedHat May 5, 2026
a3ec6ef
fix: explicitly call register_dllm() in test module
AlonKellner-RedHat May 5, 2026
3088dac
debug: add print statements to trace register_dllm() execution
AlonKellner-RedHat May 5, 2026
3312ec3
fix: remove model_type from config to force architectures lookup
AlonKellner-RedHat May 5, 2026
675f986
fix: add model_type='llama' for config loading, rely on architectures…
AlonKellner-RedHat May 5, 2026
a7d9514
workaround: monkeypatch ModelConfig validation for LLaDA2 architectures
AlonKellner-RedHat May 5, 2026
8c683d9
fix: monkeypatch signature and logic for ModelConfig validation bypass
AlonKellner-RedHat May 5, 2026
b1abcc2
debug: add diagnostics to monkeypatch to see why it's not executing
AlonKellner-RedHat May 5, 2026
9f3b643
fix: check model path instead of architectures field in monkeypatch
AlonKellner-RedHat May 5, 2026
431f144
fix: call __post_init__ but catch and suppress validation error
AlonKellner-RedHat May 5, 2026
17fa358
try: use model_impl parameter to bypass validation and force plugin m…
AlonKellner-RedHat May 5, 2026
ea20315
fix: patch _verify_runner_supported instead of __post_init__ to compl…
AlonKellner-RedHat May 5, 2026
48653a5
try: remove monkeypatch, rely only on model_impl parameter
AlonKellner-RedHat May 5, 2026
943a356
try: use trust_remote_code=True with model_impl parameter
AlonKellner-RedHat May 5, 2026
b650d4b
fix: add embed_input_ids method to satisfy VllmModel protocol
AlonKellner-RedHat May 5, 2026
e270bbf
fix: accept LLADA2_HF_ARCHITECTURE_NAME in stack validation
AlonKellner-RedHat May 5, 2026
031719d
fix: update attention layer API for vLLM 0.20.0
AlonKellner-RedHat May 5, 2026
c07050d
fix: download real LLaDA2.0-mini weights from HuggingFace
AlonKellner-RedHat May 5, 2026
630bdcd
feat: add Phase 7 A100 GPU test Helm values
AlonKellner-RedHat May 5, 2026
0fec484
fix: add A100-40 node toleration for Phase 7 GPU test
AlonKellner-RedHat May 5, 2026
2771ecb
fix: implement weight tying for lm_head in LLaDA2.0 model
AlonKellner-RedHat May 5, 2026
a79e5d5
fix: implement proper weight tying for LLaDA2.0 lm_head
AlonKellner-RedHat May 5, 2026
a838a4c
fix: handle HuggingFace checkpoint naming in LLaDA2.0 weight loading
AlonKellner-RedHat May 5, 2026
6bd1f23
fix: correct loaded_params tracking in expert weight collection
AlonKellner-RedHat May 5, 2026
5162b4d
debug: add detailed weight loading diagnostics
AlonKellner-RedHat May 5, 2026
01471f0
fix: implement expert weight stacking for FusedMoE
AlonKellner-RedHat May 5, 2026
5e0cf41
fix: add num_experts attribute to LLaDA2ForCausalLM
AlonKellner-RedHat May 5, 2026
32339da
fix: add weight name mappings for attention and shared expert
AlonKellner-RedHat May 5, 2026
391fc34
fix: handle plural shared_experts in checkpoint naming
AlonKellner-RedHat May 5, 2026
472f946
feat: add QKV projection and output layers to attention
AlonKellner-RedHat May 5, 2026
11ff410
fix: map mlp.gate.expert_bias to mlp.gate.bias
AlonKellner-RedHat May 5, 2026
432f414
fix: load expert weights into existing Parameter.data
AlonKellner-RedHat May 5, 2026
b621a5c
debug: add logging for expert weight stacking
AlonKellner-RedHat May 5, 2026
7136786
debug: check if FusedMoE creates w13/w2 Parameters
AlonKellner-RedHat May 5, 2026
4e21a51
fix: use weight_loader for stacked expert weights
AlonKellner-RedHat May 5, 2026
83fe8bd
fix: use .data.copy_() for stacked expert weights
AlonKellner-RedHat May 5, 2026
6b9b372
fix: return loaded_params instead of unloaded in load_weights
AlonKellner-RedHat May 5, 2026
a1bc587
debug: add layer 0 MLP checkpoint names output
AlonKellner-RedHat May 5, 2026
1cfa29c
feat: support dense-only layers (first_k_dense_replace)
AlonKellner-RedHat May 5, 2026
aeb19c0
fix: make forward() args optional for profiling
AlonKellner-RedHat May 5, 2026
c6cc893
fix: make compute_logits sampling_metadata optional
AlonKellner-RedHat May 5, 2026
252cfb7
fix: pass lm_head module instead of weight to LogitsProcessor
AlonKellner-RedHat May 5, 2026
de55019
fix: pass positions to attention and create default if None
AlonKellner-RedHat May 5, 2026
fd9aab6
debug: add shape and NaN/Inf validation in forward pass
AlonKellner-RedHat May 5, 2026
9b36f46
debug: add detailed logging in compute_logits
AlonKellner-RedHat May 5, 2026
58db841
debug: enable CUDA_LAUNCH_BLOCKING for synchronous error reporting
AlonKellner-RedHat May 5, 2026
1417a3d
debug: add expert weight shape validation for Triton MoE kernel
AlonKellner-RedHat May 5, 2026
f5be9a7
debug: add logits shape and vocab_size logging
AlonKellner-RedHat May 5, 2026
30ea409
debug: add extensive logging before combine_sampled_and_draft_tokens
AlonKellner-RedHat May 5, 2026
2066014
debug: add detailed draft tokens logging
AlonKellner-RedHat May 5, 2026
573a7da
fix: populate req_states.draft_tokens from scheduler output
AlonKellner-RedHat May 5, 2026
7c1545c
fix: add LLaDA2MoeModelLM to dllm_architecture_match
AlonKellner-RedHat May 5, 2026
4e36a68
fix: extract only draft token logits for dLLM remasking
AlonKellner-RedHat May 5, 2026
6392de4
fix: add num_sampled argument to SamplerOutput
AlonKellner-RedHat May 5, 2026
c7ca86f
chore: remove debug logging from Phase 7 implementation
AlonKellner-RedHat May 5, 2026
54f8176
feat: add LLaDA2.0 benchmark tests with TPS, TTFT, ITL, E2E metrics
AlonKellner-RedHat May 6, 2026
e9144dc
feat: add GuideLLM benchmark test for LLaDA2.0-mini
AlonKellner-RedHat May 6, 2026
5b8f13e
fix: benchmark test trust_remote_code and async-scheduling args
AlonKellner-RedHat May 6, 2026
3efb0c9
fix: correct GuideLLM CLI command structure
AlonKellner-RedHat May 6, 2026
9042a9e
fix: use correct GuideLLM synthetic data format
AlonKellner-RedHat May 6, 2026
9956fe7
fix: add processor argument for GuideLLM synthetic data
AlonKellner-RedHat May 6, 2026
20ffcb6
fix(scheduler): skip spec decode metrics to prevent assertion failures
AlonKellner-RedHat May 6, 2026
102c248
feat(phase8): add GPU capability detection and torch.compile optimiza…
AlonKellner-RedHat May 6, 2026
76e4e1f
feat(phase8): use vLLM-native torch.compile via @support_torch_compile
AlonKellner-RedHat May 6, 2026
8c92dc5
chore: clean up root directory and document Phase 8 benchmarks
AlonKellner-RedHat May 6, 2026
74d110b
fix: address code review feedback (P0-P1 issues)
AlonKellner-RedHat May 6, 2026
182ba3c
feat(phase7): implement virtual batch pattern for block-style attenti…
AlonKellner-RedHat May 6, 2026
a621f51
fix: resolve linting issues in virtual_batches.py
AlonKellner-RedHat May 7, 2026
302de1c
feat(phase7): wire num_prefix_tokens through call stack and activate …
AlonKellner-RedHat May 7, 2026
836a25c
feat(phase7): extract num_prefix_tokens from scheduler to model runner
AlonKellner-RedHat May 7, 2026
92ee989
feat(phase7): override _model_forward to inject num_prefix_tokens
AlonKellner-RedHat May 7, 2026
44c06c2
feat(scripts): add LLaDA2 server and benchmark scripts
AlonKellner-RedHat May 7, 2026
a22975e
fix(phase7): address PR #38 code review feedback (P0/P1 issues)
AlonKellner-RedHat May 7, 2026
06f36da
feat(phase7): query KV cache block size from cache_config
AlonKellner-RedHat May 7, 2026
aefaeec
fix(typing): resolve all 27 typing diagnostics in dllm-plugin
AlonKellner-RedHat May 7, 2026
c801f68
feat: vLLM type safety improvements for Phase 7
AlonKellner-RedHat May 7, 2026
dcf7fe1
feat: add vLLM server pod manifest for testing
AlonKellner-RedHat May 7, 2026
315a3b2
fix(phase7): address P0 blocking issues from PR #38 review
AlonKellner-RedHat May 7, 2026
76f6cd7
docs(phase8): add P0-2 A/B benchmark results - torch.compile neutral
AlonKellner-RedHat May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,12 @@ htmlcov/
.vscode/
*.swp
.DS_Store

# Benchmarks (results should not be tracked)
benchmarks/
benchmarks.csv
benchmarks.json

# Temporary values files
*-values.yaml
values.yaml
91 changes: 76 additions & 15 deletions dllm_plugin/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,18 @@ def __getattr__(name: str):
def register_dllm() -> None:
"""Entry point for ``vllm.general_plugins`` (``dllm``).

When ``vllm`` is importable, registers **two** architecture names with
``ModelRegistry``, both pointing at the same **mock** implementation for
Phases 2–6 stack testing (issues #5 and #24):
**Phase 7 update:** Registers architecture names with ``ModelRegistry``:

* :data:`~dllm_plugin.config.LLADA2_ARCHITECTURE_NAME` — placeholder
until the real HF-mapped module ships (issue #12 / Phase 7).
* :data:`~dllm_plugin.config.DLLM_MOCK_STACK_MODEL_ID` — explicit test id.
* :data:`~dllm_plugin.config.LLADA2_ARCHITECTURE_NAME` — Production LLaDA2.0
model with MoE and block-style attention (Phase 7 / issue #12). Use
``VLLM_DLLM_USE_MOCK_MODEL=1`` to override with mock for testing.
* :data:`~dllm_plugin.config.DLLM_MOCK_STACK_MODEL_ID` — Explicit mock model
for Phases 2–6 stack testing (always uses mock implementation).

Environment variables:
* ``VLLM_DLLM_USE_MOCK_MODEL``: If set to ``1``/``true``/``yes``/``on``,
registers LLADA2_ARCHITECTURE_NAME to the mock model instead of real model.
Useful for testing Phases 2-6 behavior with real model disabled.

Uses lazy ``"<module>:<Class>"`` registration so importing this package does
not pull ``torch``/CUDA until the model class is needed.
Expand All @@ -74,11 +79,16 @@ def register_dllm() -> None:
helper (no-op), not by omitting the call—so with both envs set, ``apply_*``
still runs and returns without patching.
"""
_logger.debug("dLLM plugin: register_dllm() called")

if importlib.util.find_spec("vllm") is None:
_logger.debug("dLLM plugin: vllm not found, skipping registration")
return

try:
from vllm import ModelRegistry

_logger.debug("dLLM plugin: ModelRegistry imported successfully")
except ImportError:
_logger.debug(
"vllm-dllm-plugin (dllm): vLLM spec found but import failed; "
Expand All @@ -91,22 +101,73 @@ def register_dllm() -> None:
DLLM_MOCK_MODEL_CLASS_FQCN,
DLLM_MOCK_STACK_MODEL_ID,
LLADA2_ARCHITECTURE_NAME,
LLADA2_HF_ARCHITECTURE_NAME,
LLADA2_REAL_MODEL_CLASS_FQCN,
)

# Determine which model to use for LLADA2_ARCHITECTURE_NAME
use_mock_raw = os.environ.get("VLLM_DLLM_USE_MOCK_MODEL", "").strip().lower()
use_mock_model = use_mock_raw in {"1", "true", "yes", "on"}

if use_mock_model:
llada2_model_class = DLLM_MOCK_MODEL_CLASS_FQCN
_logger.info(
"dLLM plugin: VLLM_DLLM_USE_MOCK_MODEL=1, using mock model for %s",
LLADA2_ARCHITECTURE_NAME,
)
else:
llada2_model_class = LLADA2_REAL_MODEL_CLASS_FQCN
_logger.info(
"dLLM plugin: Using real LLaDA2.0 model for %s (Phase 7)",
LLADA2_ARCHITECTURE_NAME,
)

supported = ModelRegistry.get_supported_archs()
for arch in (LLADA2_ARCHITECTURE_NAME, DLLM_MOCK_STACK_MODEL_ID):
if arch in supported:
_logger.debug(
"dLLM plugin: architecture %r already registered, skipping",
arch,
)
continue
ModelRegistry.register_model(arch, DLLM_MOCK_MODEL_CLASS_FQCN)

# Register LLADA2_ARCHITECTURE_NAME (real or mock based on env var)
if LLADA2_ARCHITECTURE_NAME not in supported:
ModelRegistry.register_model(LLADA2_ARCHITECTURE_NAME, llada2_model_class)
_logger.debug(
"dLLM plugin: registered architecture %r -> %s",
LLADA2_ARCHITECTURE_NAME,
llada2_model_class,
)
else:
_logger.debug(
"dLLM plugin: architecture %r already registered, skipping",
LLADA2_ARCHITECTURE_NAME,
)

# Register LLADA2_HF_ARCHITECTURE_NAME (HuggingFace naming convention)
# Points to same implementation as LLADA2_ARCHITECTURE_NAME
if LLADA2_HF_ARCHITECTURE_NAME not in supported:
ModelRegistry.register_model(LLADA2_HF_ARCHITECTURE_NAME, llada2_model_class)
_logger.debug(
"dLLM plugin: registered HF architecture %r -> %s",
LLADA2_HF_ARCHITECTURE_NAME,
llada2_model_class,
)
else:
_logger.debug(
"dLLM plugin: HF architecture %r already registered, skipping",
LLADA2_HF_ARCHITECTURE_NAME,
)

# Register DLLM_MOCK_STACK_MODEL_ID (always mock)
if DLLM_MOCK_STACK_MODEL_ID not in supported:
ModelRegistry.register_model(
DLLM_MOCK_STACK_MODEL_ID, DLLM_MOCK_MODEL_CLASS_FQCN
)
_logger.debug(
"dLLM plugin: registered architecture %r -> %s",
arch,
DLLM_MOCK_STACK_MODEL_ID,
DLLM_MOCK_MODEL_CLASS_FQCN,
)
else:
_logger.debug(
"dLLM plugin: architecture %r already registered, skipping",
DLLM_MOCK_STACK_MODEL_ID,
)

from dllm_plugin.config import DLLM_APPLY_ENGINE_CORE_DRAFT_HOOK_ENV_VAR

Expand Down
Empty file.
135 changes: 135 additions & 0 deletions dllm_plugin/attention/virtual_batches.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
"""Virtual batch decomposition for block-style attention.

Following vLLM's chunked_local_attention pattern, transforms CommonAttentionMetadata
to create virtual batches for prefix and block attention chunks.

Reference: vllm/model_executor/layers/attention/chunked_local_attention.py
"""

from __future__ import annotations

import torch

# vLLM imports (centralized in vllm_compat for version handling)
from dllm_plugin.vllm_compat import CommonAttentionMetadata


def make_block_attention_virtual_batches(
attn_metadata: CommonAttentionMetadata,
num_prefix_tokens: int,
block_size: int,
kv_cache_block_size: int = 16,
) -> tuple[CommonAttentionMetadata | None, CommonAttentionMetadata]:
"""Transform metadata for block-style dual-chunk attention.

Creates two virtual batches per request:
1. Prefix chunk: Q=current_block (block_size tokens), KV=prefix (num_prefix_tokens)
2. Block chunk: Q=current_block (block_size tokens), KV=current_block (block_size)

Each virtual batch gets its own:
- seq_lens: Length of KV for that chunk
- block_table: KV cache pages accessible to that chunk
- query_start_loc: Position offsets in the query tensor

Args:
attn_metadata: Original CommonAttentionMetadata from vLLM
num_prefix_tokens: Number of committed tokens (prefix length)
block_size: Size of current generation block (typically 32)
kv_cache_block_size: KV cache block size (default 16, should be
queried from cache_config in future)

Returns:
(prefix_metadata, block_metadata): Transformed metadata for each chunk
prefix_metadata is None if num_prefix_tokens == 0

Raises:
NotImplementedError: If num_reqs > 1 (multi-request batching not yet supported)
"""
device = attn_metadata.query_start_loc.device
num_reqs = attn_metadata.num_reqs
total_query_tokens = attn_metadata.num_actual_tokens

# MVP limitation: Only single-request batches supported
# Multi-request batching with heterogeneous prefix lengths requires
# per-request virtual batch transformation (deferred to Phase 7.1)
if num_reqs > 1:
raise NotImplementedError(
"LLaDA2.0 virtual batch attention does not support multi-request "
"batching in this release (MVP Phase 7). Use max_num_seqs=1 or "
"wait for Phase 7.1 update. See docs/OPERATOR_LLaDA2.md for details."
)

# Edge case: First block (no prefix)
if num_prefix_tokens == 0:
# Only block self-attention, no prefix chunk
block_metadata = CommonAttentionMetadata(
query_start_loc=attn_metadata.query_start_loc,
query_start_loc_cpu=attn_metadata.query_start_loc_cpu,
seq_lens=torch.full(
(num_reqs,), block_size, dtype=torch.int32, device=device
),
num_reqs=num_reqs,
num_actual_tokens=total_query_tokens,
max_query_len=block_size,
max_seq_len=block_size,
block_table_tensor=attn_metadata.block_table_tensor,
slot_mapping=attn_metadata.slot_mapping,
causal=False, # Non-causal (bidirectional within block)
)
return None, block_metadata

# Calculate how many KV cache pages we need for prefix and block
# Assuming block_table has shape [num_reqs, max_num_blocks_per_seq]
# We need to slice it to get only the pages for prefix vs block

# Calculate blocks needed for prefix using configured KV cache block size
num_prefix_blocks = (
num_prefix_tokens + kv_cache_block_size - 1
) // kv_cache_block_size

# Slice block_table for each chunk
prefix_block_table = attn_metadata.block_table_tensor[:, :num_prefix_blocks]
block_start_idx = num_prefix_blocks
num_block_blocks = (block_size + kv_cache_block_size - 1) // kv_cache_block_size
block_end_idx = block_start_idx + num_block_blocks
block_block_table = attn_metadata.block_table_tensor[
:, block_start_idx:block_end_idx
]

# --- Virtual Batch 1: Prefix chunk ---
# Query: current block (block_size tokens)
# KV: prefix (num_prefix_tokens)

prefix_metadata = CommonAttentionMetadata(
query_start_loc=attn_metadata.query_start_loc,
query_start_loc_cpu=attn_metadata.query_start_loc_cpu,
seq_lens=torch.full(
(num_reqs,), num_prefix_tokens, dtype=torch.int32, device=device
),
num_reqs=num_reqs,
num_actual_tokens=total_query_tokens,
max_query_len=block_size,
max_seq_len=num_prefix_tokens,
block_table_tensor=prefix_block_table,
slot_mapping=attn_metadata.slot_mapping, # May need adjustment
causal=False, # Non-causal (all queries attend to all prefix keys)
)

# --- Virtual Batch 2: Block chunk ---
# Query: current block (block_size tokens)
# KV: current block (block_size tokens)

block_metadata = CommonAttentionMetadata(
query_start_loc=attn_metadata.query_start_loc,
query_start_loc_cpu=attn_metadata.query_start_loc_cpu,
seq_lens=torch.full((num_reqs,), block_size, dtype=torch.int32, device=device),
num_reqs=num_reqs,
num_actual_tokens=total_query_tokens,
max_query_len=block_size,
max_seq_len=block_size,
block_table_tensor=block_block_table,
slot_mapping=attn_metadata.slot_mapping, # May need adjustment
causal=False, # Non-causal (bidirectional within block)
)

return prefix_metadata, block_metadata
35 changes: 35 additions & 0 deletions dllm_plugin/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ def _read_draft_size() -> int:
#: Exact registry string may be refined when ``register()`` lands (issue #5).
LLADA2_ARCHITECTURE_NAME: Final[str] = "LLaDA2ForCausalLM"

#: HuggingFace architecture name used by inclusionAI/LLaDA2.0-mini model config.
#: Registered alongside LLADA2_ARCHITECTURE_NAME to support both naming conventions.
LLADA2_HF_ARCHITECTURE_NAME: Final[str] = "LLaDA2MoeModelLM"

#: Registered model id for the **mock / stub** forward used in Phases 2-6 stack
#: testing (deterministic outputs; see milestone issue #24).
DLLM_MOCK_STACK_MODEL_ID: Final[str] = "DllmMockLlada2StackTest"
Expand Down Expand Up @@ -107,3 +111,34 @@ def resolve_strict_stack_validation(explicit: bool | None) -> bool:
#: (zeros + ``1.0`` at index ``0``, ``docs/MOCK_STACK_MODEL.md``) commit under
#: default settings for stack tests.
LLADA2_DEFAULT_COMMIT_CONFIDENCE_THRESHOLD: Final[float] = 0.01

# Phase 7: Real LLaDA2.0 model configuration (issue #12)

#: Lazy import target for real LLaDA2.0 vLLM model (``<module>:<Class>``).
#: Phase 7 adds production-ready model with MoE weight loading and
#: block-style attention.
LLADA2_REAL_MODEL_CLASS_FQCN: Final[str] = "dllm_plugin.models.llada2:LLaDA2ForCausalLM"

#: Default number of experts per MoE layer (from HuggingFace LLaDA2.0 config).
LLADA2_DEFAULT_NUM_EXPERTS: Final[int] = 256

#: Default number of experts activated per token (top-k routing).
LLADA2_DEFAULT_NUM_EXPERTS_PER_TOK: Final[int] = 8

#: Default number of shared experts (always active, not routed).
LLADA2_DEFAULT_NUM_SHARED_EXPERTS: Final[int] = 1

#: Default MoE intermediate size (FFN hidden dimension per expert).
LLADA2_DEFAULT_MOE_INTERMEDIATE_SIZE: Final[int] = 512

#: Default number of expert groups for group-limited routing.
#: LLaDA2.0 uses 8 groups for two-stage expert selection.
LLADA2_DEFAULT_N_GROUP: Final[int] = 8

#: Default number of groups to select in group-limited routing.
#: First selects top-4 groups from 8, then top-k experts from selected groups.
LLADA2_DEFAULT_TOPK_GROUP: Final[int] = 4

#: Default scaling factor applied to routed expert output.
#: LLaDA2.0 uses 2.5x scaling on routed experts before adding shared expert output.
LLADA2_DEFAULT_ROUTED_SCALING_FACTOR: Final[float] = 2.5
Loading
Loading