Summary
This issue is the orchestration source of truth for MVP LLaDA2.0 .
It now includes:
phase gates (entry/exit criteria),
explicit dependency semantics (HARD vs SOFT),
an issue-by-issue deep dive in phase order,
PR sequencing and review contract,
milestone definition of done,
mock-first MVP (phases 0–6): real LLaDA2 HF mapping (#12 ) and attention work (#11 ) are deferred to Phase 7 —mock stack uses #24 only (no attention spike in Phase 2); real-model integration is #25 (nothing in 2–6 depends on [MVP LLaDA2.0]: Phase 7 — Real-model integration (re-validate stack on real weights) #25 ),
GPU Model Runner v2 expectations (#10 , VLLM_USE_V2_MODEL_RUNNER) and worker–runner risk mitigations.
Scope reference: DESIGN_MVP.md , ROADMAP.md , upstream vllm#36155 .
Delivery status (mock-stack train)
2026-04-28: PR #30 merged. It closed #8 — Phase 4 scheduler/runtime path (initial decode wiring).
2026-04-29: PR #33 merged (squash 41042b4768dba20ad6dc1940b4c5ee42dee1408e). It closed #4 , #14 , #16 , #17 , #31 , #32 — strict validation, operator doc, unit/integration confidence, logits/remask policy follow-up, and CPU EngineArgs smoke with concrete vLLM objects.
2026-04-30: PR #34 merged. It closed #9 and #10 — grammar frontier / dLLM block safety and worker / v2 model-runner integration.
2026-05-04: PR #36 merged. It closed #35 — dLLM semantics tests, runtime EngineCore draft-hook alignment (opt-in VLLM_DLLM_APPLY_ENGINE_CORE_DRAFT_HOOK), vllm serve + curl HTTP smoke, and Helm GPU job wiring.
Next toward Phase 7–9: ship #12 (real LLaDA2 HF model), #11 (attention), then #25 (real-model integration evidence). Phase 8 (PR #38 ) delivers torch.compile optimization and benchmarking infrastructure. Phase 9 (#39 ) validates output correctness against reference implementations. Keep #2 current for pin / upstream companion plumbing. Phase 0 housekeeping (#18 , #20 ) as needed.
Canonical phase map
Phase
Goal
Issues
0
Tracking, orchestration, contributor process
#18 , #2 , #20
1
Shared constants/interfaces/docs contracts
#3 , #6 , #15
2
Mock/register model surface only (#5 , #24 ) — no real LLaDA2, no #11
#5 , #24
3
Remasking behavior + model-policy wiring
#7 , #13
4
Runtime scheduler/worker decode path
#8 , #9 , #10
5
Strict stack validation
#4
6
Tests/docs/integration release confidence (mock stack)
#16 , #14 , #17
7
Real implementation + real-model integration
#12 , #11 , #25 (integration evidence on real weights—after Phase 6; no downstream issue depends on #25 )
8
Benchmarking and runtime optimizations
#38
9
Output correctness validation
#39 (orchestration), #42 (Phase 9.1 - Numerical), #43 (Phase 9.2 - lm-eval)
Phase 4 (#8 , #9 , #10 ) closed via PR #30 (2026-04-28) and PR #34 (2026-04-30). Phase 5 #4 and Phase 6 #14 , #16 , #17 ; follow-ups #31 , #32 closed via PR #33 (2026-04-29). #35 closed via PR #36 (2026-05-04) (extended semantics + HTTP serve smoke on GPU Helm).
Phase gates (entry/exit criteria)
Phase
Entry criteria
Exit criteria
0
Milestone issues exist and are labeled mvp-llada2
Upstream tracker active, roadmap links synchronized, PR checklist issue created
1
Phase 0 tracking stable
Config constants + remasking interface + field map merged or stable enough for dependent stubs
2
Phase 1 contracts available
Mock model (#24 ) registered and usable for downstream wiring; no Phase 2 #11 attention work; no requirement for real HF LLaDA2 forward (#12 + #11 → Phase 7 )
3
Model and interface surfaces exist
Default policy implemented and model->policy handoff fields stabilized (mock tensors / #24 )
4
Phase 2/3 interfaces stable enough to integrate
Scheduler-worker one-block path works with explicit grammar constraints; DllmWorker validated with model runner v2 where applicable; worker–runner overrides stay minimal (see #10 )
5
Runtime baseline from Phase 4 exists
Invalid stack combinations fail fast with actionable errors
6
Runtime + validation are available
Unit/doc/integration evidence complete and reproducible for the mock plugin stack (v2 runner per #10 where applicable)
7
Phase 6 mock MVP exit criteria met (or parallel planning)
#12 + #11 shipped; #25 complete (extends #17 / real-weights proof; optional small PRs to #10 /#13 only if gaps found)
8
Phase 7 real model complete (#12 , #11 , #25 closed)
Performance baselines documented; torch.compile integration verified; GPU capability detection operational
9.1
Phase 8 benchmarks available
Layer-by-layer numerical validation passes; tolerance bounds documented; router precision (FP32 vs BF16) validated; expert load balancing analyzed
9.2
Phase 9.1 numerical validation complete
lm-eval integration working; all benchmark tasks evaluated; results match HF/SGlang within tolerance (±1-2% categorical, ±2-3% generation); correctness test suite established
Dependency semantics
flowchart LR
hardUp[HardDependency] -->|"must merge first"| hardDown[DownstreamComplete]
softUp[SoftDependency] -.->|"stub or flag allowed"| softDown[DownstreamCanProgress]
Loading
Milestone timeline (high-level)
timeline
title MVP LLaDA2.0 phase sequence
section Phase0
TrackAndOrchestrate : #18 roadmap links
: #2 upstream hook and min vLLM
: #20 PR checklist template
section Phase1
ContractFoundation : #3 config constants
: #6 remasking interface
: #15 field mapping table
section Phase2
MockOnly : #5 registration
: #24 mock model
section Phase3
RemaskingWiring : #7 default policy
: #13 model to policy bridge
section Phase4
RuntimePath : #8 scheduler
: #9 grammar safety
: #10 worker integration
section Phase5
Validation : #4 strict stack checks
section Phase6
ShipConfidence : #16 unit tests
: #14 operator docs
: #17 integration checklist mock stack
section Phase7
RealLLaDA2 : #12 real HF model + forward
: #11 attention spike + implementation
Integration : #25 real-model integration
section Phase8
Benchmarking : PR #38 torch.compile + GuideLLM
: GPU capability detection
: Performance baselines
section Phase9
Correctness : #39 numerical validation
: lm-eval integration
: SGlang + HF comparison
Loading
Dependency graph (phase-aligned)
flowchart TB
subgraph phase0 [Phase0]
I2["#2"]
I18["#18"]
I20["#20"]
end
subgraph phase1 [Phase1]
I3["#3"]
I6["#6"]
I15["#15"]
end
subgraph phase2 [Phase2]
I5["#5"]
I24["#24"]
end
subgraph phase7 [Phase7]
I12["#12"]
I11["#11"]
I25["#25"]
end
subgraph phase3 [Phase3]
I7["#7"]
I13["#13"]
end
subgraph phase4 [Phase4]
I8["#8"]
I9["#9"]
I10["#10"]
end
subgraph phase5 [Phase5]
I4["#4"]
end
subgraph phase6 [Phase6]
I16["#16"]
I14["#14"]
I17["#17"]
end
subgraph phase8 [Phase8]
PR38["PR #38"]
end
subgraph phase9 [Phase9]
I39["#39 (orchestration)"]
I42["#42 (Phase 9.1)"]
I43["#43 (Phase 9.2)"]
end
I6 --> I7
I15 --> I16
I5 --> I24
I24 --> I13
I7 --> I13
I3 --> I8
I3 --> I10
I8 --> I9
I8 --> I10
I24 --> I10
I13 -.-> I10
I2 -.-> I8
I2 -.-> I10
I8 --> I4
I10 --> I4
I6 --> I16
I8 --> I17
I10 --> I17
I14 --> I17
I24 -.-> I12
I17 --> I25
I12 --> I25
I11 --> I25
I25 --> PR38
I12 --> PR38
I11 --> PR38
I2 -.-> PR38
PR38 --> I42
I42 --> I43
I43 --> I39
Loading
Deep dive by phase
Phase 0 - tracking and process enablement
Goal: Keep milestone discoverable, upstream-aligned, and contributor-friendly before heavy implementation.
#18 Role: roadmap link maintenance.
#2 Role: upstream dependency tracker.
#20 Role: contributor PR checklist template.
Phase 0 exit: Upstream tracker active, roadmap linked, and PR checklist workflow available.
Phase 1 - shared contracts
Goal: Define canonical names, constants, and interfaces used by all runtime work.
#3 Role: source-of-truth config constants.
#6 Role: remasking interface contract.
#15 Role: field map reference.
Phase 1 exit: constants + interface + field map stabilized and referenced.
Phase 2 - mock model surface only
Goal: Register mock/stub via #24 with #5 . #11 (attention) is not in Phase 2—it is deferred to Phase 7 with #12 (real model).
#5 Role: registration wiring.
#24 Role: mock registered model.
Phase 2 exit: mock path is registered and usable; no attention spike (#11 ) and no real HF mapping in Phase 2.
Phase 3 - remasking behavior and wiring
Goal: Implement default policy and stabilize model->policy handoff.
#7 Role: default remasking implementation.
#13 Role: wiring contract bridge.
Phase 3 exit: remasking behavior and handoff fields are explicit and testable (mock tensors first).
Phase 4 - runtime integration
Delivered: #8 closed — PR #30 (2026-04-28). #9 , #10 closed — PR #34 (2026-04-30).
Goal: Deliver one-block decode execution path via scheduler + worker.
#8 Role: scheduler semantics owner.
#9 Role: grammar safety overlay.
#10 Role: worker execution owner.
Phase 4 exit: scheduler-worker path works with explicit grammar behavior and no silent field mismatch on the mock model ; v2 expectations documented/tested per #10 ; overrides preserve upstream model_runner benefits where possible.
Phase 5 - strict validation
Delivered (2026-04-29): #4 closed — PR #33 .
Goal: Convert stack misuse into immediate actionable errors.
#4 Role: compatibility gatekeeper.
Phase 5 exit: invalid stack combinations fail fast with deterministic messages and test coverage.
Phase 6 - ship confidence
Delivered (2026-04-29): #14 , #16 , #17 , #31 , #32 closed — PR #33 . Full GPU LLM.generate integration remains CUDA-gated/off-runner; CPU PR CI covers EngineArgs + strict validation.
Extended (2026-05-04): #35 closed — PR #36 : dLLM semantics regression tests, optional runtime EngineCore draft-hook patch for vllm serve on legacy wheels, tools/e2e/serve_http_smoke.sh, and Helm GPU job running HTTP smoke after pytest.
Goal: lock contract confidence and operator reproducibility.
#16 Role: contract unit tests.
#14 Role: operator runbook.
#17 Role: final integration evidence.
Phase 6 exit: reproducible evidence confirms mock-stack MVP end-to-end viability (real LLaDA2 optional Phase 7 ).
Phase 7 - real LLaDA2 model, attention, and #25
Goal: Ship #12 and #11 , then close #25 (real-model integration on real weights). #10 / #13 / #17 do not depend on #25 —#25 depends on them (mock baseline) plus #12 /#11 .
Implementation (can overlap in time):
#12 Role: real LLaDA2 model surface.
#11 Role: attention spike + implementation (deferred from Phase 2).
#25 Role: real-model integration (checklist / evidence owner).
Phase 7 exit: #12 + #11 complete; #25 complete with real-weights integration evidence.
Note: Phase 7 completion and Phase 8 initial delivery occurred together via PR #38 , which combined real model implementation with torch.compile optimization and benchmarking infrastructure.
Phase 8 - benchmarking and runtime optimizations
Delivered: PR #38 (combined with Phase 7 delivery)
Goal: Establish performance baselines and optimize runtime with vLLM-native compilation.
PR #38 Role: torch.compile integration + benchmarking infrastructure.
Scope: vLLM-native @support_torch_compile decorator; GPU capability detection (A100/H100/B200); GuideLLM benchmark harness; baseline metrics documentation (TTFT, ITL, TPS).
Delivered metrics: 346 tokens/sec median, 522ms TTFT, 2.4ms ITL on A100-SXM4-40GB (vLLM 0.20.1).
Documentation: PHASE8_BENCHMARKS.md, tools/benchmark_optimization.sh, GPU capability detection via dllm_plugin/gpu_capability.py.
Dependencies:
Phase 8 exit: Baseline performance documented; torch.compile verified operational; GPU detection working; benchmarking reproducible via tools.
Future iterations (post-MVP):
Phase 8.2: Single-pass attention (+10-20% TTFT target)
Phase 8.3: CUTLASS FusedMoE (+15-30% TPS on A100)
Phase 8.4: FlashInfer fused topk (+20-40% TPS on H100+)
Phase 9 - output correctness validation
Goal: Validate numerical correctness and benchmark performance against reference implementations.
Sub-phases:
Phase 9.1: Numerical Validation (Incremental Layer-by-Layer)
#42 Role: Layer-by-layer numerical correctness validation owner.
Scope: 8 incremental validation points: Embedding → Attention (QKV, norms, computation) → MoE (router, group-limited routing, experts, scaling) → Decoder → Transformer stack → Final norm → LM head → E2E tokens→logits.
Critical tests:
Router precision: FP32 (default) vs BF16 (experimental VLLM_LLADA2_BF16_ROUTER=1)
Group-limited routing: 256 experts → 8 groups → top-4 groups → top-k experts
Routed scaling factor (2.5x) validation
Expert load balancing (no pathological bias)
Validation methodology: Extract intermediate tensors from HF/SGlang, compare with tolerance bounds (FP32: atol=1e-5/rtol=1e-4, BF16: atol=1e-3/rtol=1e-2)
Deliverables: Test suite tests/test_llada2_numerical_validation.py, tolerance framework dllm_plugin/validation_utils.py, documentation docs/PHASE9.1_NUMERICAL_VALIDATION.md
Dependencies:
Phase 9.2: E2E Evaluation Validation (lm-eval Integration)
#43 Role: E2E lm-eval integration and benchmark validation owner.
Scope: 5 standard benchmark tasks via lm-evaluation-harness:
Categorical: MMLU (5-shot, 57 subjects), HellaSwag (10-shot), ARC-Challenge (25-shot)
Generation: GSM8K (8-shot chain-of-thought), TruthfulQA (0-shot)
Evaluation strategy:
Phase 1: Sanity subset (100-500 examples, ~30 min/task)
Phase 2: Full dataset (complete, ~4-8 hours total)
Comparison: dllm-plugin vLLM vs HuggingFace baseline vs SGlang (optional)
Tolerance bounds: ±1-2% categorical accuracy, ±2-3% generation exact match
Deliverables: lm-eval integration, test suite tests/test_lm_eval_integration.py, comparison tooling tools/compare_lm_eval_results.py, documentation docs/PHASE9.2_LMEVAL_RESULTS.md
Dependencies:
Phase 9 exit: All numerical validation passes (Phase 9.1); all lm-eval tasks within tolerance (Phase 9.2); discrepancies documented; correctness test suite established.
PR sequencing and review contract
Preferred merge order (default)
Phase 0 process/trackers ([MVP LLaDA2.0]: ROADMAP: link milestone issues when created #18 /[MVP LLaDA2.0]: Track vLLM draft-token hook; pin minimum vLLM for MVP #2 /[MVP LLaDA2.0]: Contributor PR checklist template for milestone #20 ) and Phase 1 contracts ([MVP LLaDA2.0]: Add config.py: DRAFT_SIZE, model IDs, feature flags #3 /[MVP LLaDA2.0]: Define RemaskingPolicy protocol/ABC in remasking/base.py #6 /[MVP LLaDA2.0]: Document field-mapping table for contributors (copy-friendly) #15 ).
Phase 2 mock model + registration ([MVP LLaDA2.0]: Wire register() to register LLaDA2.0 architecture (bart-style) #5 , [MVP LLaDA2.0]: Mock/stub registered model for stack testing (deterministic forward) #24 ) — no [MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 in Phase 2.
Phase 3 remasking ([MVP LLaDA2.0]: Implement remasking/llada2_default.py (MVP default policy) #7 /[MVP LLaDA2.0]: Connect model forward outputs to RemaskingPolicy inputs #13 ).
Phase 4 runtime ([MVP LLaDA2.0]: DllmScheduler: spec_token_ids, DRAFT_SIZE, commit-0 rollback #8 -> [MVP LLaDA2.0]: Scheduler: draft grammar must not break dLLM blocks #9 and [MVP LLaDA2.0]: DllmWorker (WorkerBase): batch build, forward one block, take_draft_token_ids #10 ).
Phase 5 validation ([MVP LLaDA2.0]: Implement validation.py: assert compatible scheduler/worker/model stack #4 ).
Phase 6 tests/docs/integration ([MVP LLaDA2.0]: Unit tests: field mapping / remask contract (no vLLM required) #16 /[MVP LLaDA2.0]: Operator doc: VLLM_PLUGINS, CLI flags, first-block spec_token_ids init #14 /[MVP LLaDA2.0]: Integration test or manual checklist: serve LLaDA2.0 with plugin stack #17 ) — mock stack (core [MVP LLaDA2.0]: Implement validation.py: assert compatible scheduler/worker/model stack #4 /[MVP LLaDA2.0]: Operator doc: VLLM_PLUGINS, CLI flags, first-block spec_token_ids init #14 /[MVP LLaDA2.0]: Unit tests: field mapping / remask contract (no vLLM required) #16 /[MVP LLaDA2.0]: Integration test or manual checklist: serve LLaDA2.0 with plugin stack #17 /[Phase 4 Follow-up]: replace synthesized-logit remask bridge with model-score handoff #31 /[Phase 6 Follow-up]: add runtime-adapter integration smoke with concrete vLLM objects #32 in PR #33 , 2026-04-29; extended semantics + HTTP smoke in PR #36 , 2026-05-04) .
Phase 7 [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12 + [MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 , then #25 (after Phase 6 mock evidence).
Phase 8 PR feat(phase7+8): Production LLaDA2.0 model + vLLM-native torch.compile optimization #38 (torch.compile + benchmarking) - delivered alongside Phase 7
Phase 9 [MVP LLaDA2.0] Phase 9: Output correctness validation #39 (output correctness validation) - after Phase 8 baselines exist
Milestone PR checklist (required in PR description)
Milestone definition of done
Every milestone issue has explicit role/scope/dependencies/acceptance checks.
Phase names are consistent across phase map, timeline, dependency graph, and deep-dive (0–9 ).
Runtime path (Phase 4 : #8 , #9 , #10 , including v2 runner expectations per [MVP LLaDA2.0]: DllmWorker (WorkerBase): batch build, forward one block, take_draft_token_ids #10 ) is complete and coherent for the mock MVP ([MVP LLaDA2.0]: Mock/stub registered model for stack testing (deterministic forward) #24 ) — delivered PR #30 + PR #34 .
Validation path (#4 ) is complete for the mock MVP per PR #33 . [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12 / Phase 7 tracked separately for real LLaDA2.
Operator docs (#14 ), unit tests (#16 ), integration evidence (#17 ), and follow-ups (#31 , #32 ) for the mock stack — delivered PR #33 ; extended semantics / vllm serve HTTP smoke — PR #36 / #35 . Phase 7 completes [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12 /[MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 and #25 (real-weights proof).
Phase 8 benchmarking complete: torch.compile operational, baselines documented in PHASE8_BENCHMARKS.md, GPU detection verified
Phase 9.1 numerical validation complete: all 8 validation points pass, tolerance bounds documented, router precision validated
Phase 9.2 lm-eval integration complete: all 5 tasks evaluated, results within tolerance (±1-2% categorical, ±2-3% generation), comparison reports merged
Upstream dependency tracker (#2 ) reflects current minimum-version confidence for MVP.
Maintenance note
When adding, splitting, or closing milestone issues, update this issue's canonical phase map, dependency graph, and deep-dive in the same change to avoid drift.
Summary
This issue is the orchestration source of truth for MVP LLaDA2.0.
It now includes:
HARDvsSOFT),VLLM_USE_V2_MODEL_RUNNER) and worker–runner risk mitigations.Scope reference: DESIGN_MVP.md, ROADMAP.md, upstream vllm#36155.
Delivery status (mock-stack train)
2026-04-28: PR #30 merged. It closed #8 — Phase 4 scheduler/runtime path (initial decode wiring).
2026-04-29: PR #33 merged (squash
41042b4768dba20ad6dc1940b4c5ee42dee1408e). It closed #4, #14, #16, #17, #31, #32 — strict validation, operator doc, unit/integration confidence, logits/remask policy follow-up, and CPUEngineArgssmoke with concrete vLLM objects.2026-04-30: PR #34 merged. It closed #9 and #10 — grammar frontier / dLLM block safety and worker / v2 model-runner integration.
2026-05-04: PR #36 merged. It closed #35 — dLLM semantics tests, runtime
EngineCoredraft-hook alignment (opt-inVLLM_DLLM_APPLY_ENGINE_CORE_DRAFT_HOOK),vllm serve+curlHTTP smoke, and Helm GPU job wiring.Next toward Phase 7–9: ship #12 (real LLaDA2 HF model), #11 (attention), then #25 (real-model integration evidence). Phase 8 (PR #38) delivers torch.compile optimization and benchmarking infrastructure. Phase 9 (#39) validates output correctness against reference implementations. Keep #2 current for pin / upstream companion plumbing. Phase 0 housekeeping (#18, #20) as needed.
Canonical phase map
Phase 4 (#8, #9, #10) closed via PR #30 (2026-04-28) and PR #34 (2026-04-30). Phase 5 #4 and Phase 6 #14, #16, #17; follow-ups #31, #32 closed via PR #33 (2026-04-29). #35 closed via PR #36 (2026-05-04) (extended semantics + HTTP serve smoke on GPU Helm).
Phase gates (entry/exit criteria)
mvp-llada2Dependency semantics
flowchart LR hardUp[HardDependency] -->|"must merge first"| hardDown[DownstreamComplete] softUp[SoftDependency] -.->|"stub or flag allowed"| softDown[DownstreamCanProgress]Milestone timeline (high-level)
timeline title MVP LLaDA2.0 phase sequence section Phase0 TrackAndOrchestrate : #18 roadmap links : #2 upstream hook and min vLLM : #20 PR checklist template section Phase1 ContractFoundation : #3 config constants : #6 remasking interface : #15 field mapping table section Phase2 MockOnly : #5 registration : #24 mock model section Phase3 RemaskingWiring : #7 default policy : #13 model to policy bridge section Phase4 RuntimePath : #8 scheduler : #9 grammar safety : #10 worker integration section Phase5 Validation : #4 strict stack checks section Phase6 ShipConfidence : #16 unit tests : #14 operator docs : #17 integration checklist mock stack section Phase7 RealLLaDA2 : #12 real HF model + forward : #11 attention spike + implementation Integration : #25 real-model integration section Phase8 Benchmarking : PR #38 torch.compile + GuideLLM : GPU capability detection : Performance baselines section Phase9 Correctness : #39 numerical validation : lm-eval integration : SGlang + HF comparisonDependency graph (phase-aligned)
flowchart TB subgraph phase0 [Phase0] I2["#2"] I18["#18"] I20["#20"] end subgraph phase1 [Phase1] I3["#3"] I6["#6"] I15["#15"] end subgraph phase2 [Phase2] I5["#5"] I24["#24"] end subgraph phase7 [Phase7] I12["#12"] I11["#11"] I25["#25"] end subgraph phase3 [Phase3] I7["#7"] I13["#13"] end subgraph phase4 [Phase4] I8["#8"] I9["#9"] I10["#10"] end subgraph phase5 [Phase5] I4["#4"] end subgraph phase6 [Phase6] I16["#16"] I14["#14"] I17["#17"] end subgraph phase8 [Phase8] PR38["PR #38"] end subgraph phase9 [Phase9] I39["#39 (orchestration)"] I42["#42 (Phase 9.1)"] I43["#43 (Phase 9.2)"] end I6 --> I7 I15 --> I16 I5 --> I24 I24 --> I13 I7 --> I13 I3 --> I8 I3 --> I10 I8 --> I9 I8 --> I10 I24 --> I10 I13 -.-> I10 I2 -.-> I8 I2 -.-> I10 I8 --> I4 I10 --> I4 I6 --> I16 I8 --> I17 I10 --> I17 I14 --> I17 I24 -.-> I12 I17 --> I25 I12 --> I25 I11 --> I25 I25 --> PR38 I12 --> PR38 I11 --> PR38 I2 -.-> PR38 PR38 --> I42 I42 --> I43 I43 --> I39Deep dive by phase
Phase 0 - tracking and process enablement
Goal: Keep milestone discoverable, upstream-aligned, and contributor-friendly before heavy implementation.
Phase 0 exit: Upstream tracker active, roadmap linked, and PR checklist workflow available.
Phase 1 - shared contracts
Goal: Define canonical names, constants, and interfaces used by all runtime work.
Phase 1 exit: constants + interface + field map stabilized and referenced.
Phase 2 - mock model surface only
Goal: Register mock/stub via #24 with #5. #11 (attention) is not in Phase 2—it is deferred to Phase 7 with #12 (real model).
Phase 2 exit: mock path is registered and usable; no attention spike (#11) and no real HF mapping in Phase 2.
Phase 3 - remasking behavior and wiring
Goal: Implement default policy and stabilize model->policy handoff.
Phase 3 exit: remasking behavior and handoff fields are explicit and testable (mock tensors first).
Phase 4 - runtime integration
Goal: Deliver one-block decode execution path via scheduler + worker.
VLLM_USE_V2_MODEL_RUNNER); thin subclass ofWorker; worker–runner risk mitigations (avoid largeexecute_modelforks).Phase 4 exit: scheduler-worker path works with explicit grammar behavior and no silent field mismatch on the mock model; v2 expectations documented/tested per #10; overrides preserve upstream
model_runnerbenefits where possible.Phase 5 - strict validation
Goal: Convert stack misuse into immediate actionable errors.
Phase 5 exit: invalid stack combinations fail fast with deterministic messages and test coverage.
Phase 6 - ship confidence
Goal: lock contract confidence and operator reproducibility.
VLLM_USE_V2_MODEL_RUNNERrunbook vs v1 fallback; mock stack vs real weights path.Phase 6 exit: reproducible evidence confirms mock-stack MVP end-to-end viability (real LLaDA2 optional Phase 7).
Phase 7 - real LLaDA2 model, attention, and #25
Goal: Ship #12 and #11, then close #25 (real-model integration on real weights). #10 / #13 / #17 do not depend on #25—#25 depends on them (mock baseline) plus #12/#11.
Implementation (can overlap in time):
#12 Role: real LLaDA2 model surface.
#11 Role: attention spike + implementation (deferred from Phase 2).
#25 Role: real-model integration (checklist / evidence owner).
Phase 7 exit: #12 + #11 complete; #25 complete with real-weights integration evidence.
Note: Phase 7 completion and Phase 8 initial delivery occurred together via PR #38, which combined real model implementation with torch.compile optimization and benchmarking infrastructure.
Phase 8 - benchmarking and runtime optimizations
Goal: Establish performance baselines and optimize runtime with vLLM-native compilation.
@support_torch_compiledecorator; GPU capability detection (A100/H100/B200); GuideLLM benchmark harness; baseline metrics documentation (TTFT, ITL, TPS).PHASE8_BENCHMARKS.md,tools/benchmark_optimization.sh, GPU capability detection viadllm_plugin/gpu_capability.py.Phase 8 exit: Baseline performance documented; torch.compile verified operational; GPU detection working; benchmarking reproducible via tools.
Future iterations (post-MVP):
Phase 9 - output correctness validation
Goal: Validate numerical correctness and benchmark performance against reference implementations.
Sub-phases:
Phase 9.1: Numerical Validation (Incremental Layer-by-Layer)
VLLM_LLADA2_BF16_ROUTER=1)tests/test_llada2_numerical_validation.py, tolerance frameworkdllm_plugin/validation_utils.py, documentationdocs/PHASE9.1_NUMERICAL_VALIDATION.mdPhase 9.2: E2E Evaluation Validation (lm-eval Integration)
tests/test_lm_eval_integration.py, comparison toolingtools/compare_lm_eval_results.py, documentationdocs/PHASE9.2_LMEVAL_RESULTS.mdPhase 9 exit: All numerical validation passes (Phase 9.1); all lm-eval tasks within tolerance (Phase 9.2); discrepancies documented; correctness test suite established.
PR sequencing and review contract
Preferred merge order (default)
Milestone PR checklist (required in PR description)
Closes #<n>; Phase <0-9>(use Phase 7 for [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12 / [MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 / #25—not Phase 2)HARDandSOFTstatusMilestone definition of done
vllm serveHTTP smoke — PR #36 / #35. Phase 7 completes [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12/[MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 and #25 (real-weights proof).Maintenance note
When adding, splitting, or closing milestone issues, update this issue's canonical phase map, dependency graph, and deep-dive in the same change to avoid drift.