Skip to content

[MVP LLaDA2.0] Milestone orchestration: timeline, dependency graph, phased plan #19

@AlonKellner-RedHat

Description

@AlonKellner-RedHat

Summary

This issue is the orchestration source of truth for MVP LLaDA2.0.

It now includes:

  • phase gates (entry/exit criteria),
  • explicit dependency semantics (HARD vs SOFT),
  • an issue-by-issue deep dive in phase order,
  • PR sequencing and review contract,
  • milestone definition of done,
  • mock-first MVP (phases 0–6): real LLaDA2 HF mapping (#12) and attention work (#11) are deferred to Phase 7—mock stack uses #24 only (no attention spike in Phase 2); real-model integration is #25 (nothing in 2–6 depends on [MVP LLaDA2.0]: Phase 7 — Real-model integration (re-validate stack on real weights) #25),
  • GPU Model Runner v2 expectations (#10, VLLM_USE_V2_MODEL_RUNNER) and worker–runner risk mitigations.

Scope reference: DESIGN_MVP.md, ROADMAP.md, upstream vllm#36155.

Delivery status (mock-stack train)

2026-04-28: PR #30 merged. It closed #8 — Phase 4 scheduler/runtime path (initial decode wiring).

2026-04-29: PR #33 merged (squash 41042b4768dba20ad6dc1940b4c5ee42dee1408e). It closed #4, #14, #16, #17, #31, #32 — strict validation, operator doc, unit/integration confidence, logits/remask policy follow-up, and CPU EngineArgs smoke with concrete vLLM objects.

2026-04-30: PR #34 merged. It closed #9 and #10 — grammar frontier / dLLM block safety and worker / v2 model-runner integration.

2026-05-04: PR #36 merged. It closed #35 — dLLM semantics tests, runtime EngineCore draft-hook alignment (opt-in VLLM_DLLM_APPLY_ENGINE_CORE_DRAFT_HOOK), vllm serve + curl HTTP smoke, and Helm GPU job wiring.

Next toward Phase 7–9: ship #12 (real LLaDA2 HF model), #11 (attention), then #25 (real-model integration evidence). Phase 8 (PR #38) delivers torch.compile optimization and benchmarking infrastructure. Phase 9 (#39) validates output correctness against reference implementations. Keep #2 current for pin / upstream companion plumbing. Phase 0 housekeeping (#18, #20) as needed.


Canonical phase map

Phase Goal Issues
0 Tracking, orchestration, contributor process #18, #2, #20
1 Shared constants/interfaces/docs contracts #3, #6, #15
2 Mock/register model surface only (#5, #24) — no real LLaDA2, no #11 #5, #24
3 Remasking behavior + model-policy wiring #7, #13
4 Runtime scheduler/worker decode path #8, #9, #10
5 Strict stack validation #4
6 Tests/docs/integration release confidence (mock stack) #16, #14, #17
7 Real implementation + real-model integration #12, #11, #25 (integration evidence on real weights—after Phase 6; no downstream issue depends on #25)
8 Benchmarking and runtime optimizations #38
9 Output correctness validation #39 (orchestration), #42 (Phase 9.1 - Numerical), #43 (Phase 9.2 - lm-eval)

Phase 4 (#8, #9, #10) closed via PR #30 (2026-04-28) and PR #34 (2026-04-30). Phase 5 #4 and Phase 6 #14, #16, #17; follow-ups #31, #32 closed via PR #33 (2026-04-29). #35 closed via PR #36 (2026-05-04) (extended semantics + HTTP serve smoke on GPU Helm).


Phase gates (entry/exit criteria)

Phase Entry criteria Exit criteria
0 Milestone issues exist and are labeled mvp-llada2 Upstream tracker active, roadmap links synchronized, PR checklist issue created
1 Phase 0 tracking stable Config constants + remasking interface + field map merged or stable enough for dependent stubs
2 Phase 1 contracts available Mock model (#24) registered and usable for downstream wiring; no Phase 2 #11 attention work; no requirement for real HF LLaDA2 forward (#12 + #11 → Phase 7)
3 Model and interface surfaces exist Default policy implemented and model->policy handoff fields stabilized (mock tensors / #24)
4 Phase 2/3 interfaces stable enough to integrate Scheduler-worker one-block path works with explicit grammar constraints; DllmWorker validated with model runner v2 where applicable; worker–runner overrides stay minimal (see #10)
5 Runtime baseline from Phase 4 exists Invalid stack combinations fail fast with actionable errors
6 Runtime + validation are available Unit/doc/integration evidence complete and reproducible for the mock plugin stack (v2 runner per #10 where applicable)
7 Phase 6 mock MVP exit criteria met (or parallel planning) #12 + #11 shipped; #25 complete (extends #17 / real-weights proof; optional small PRs to #10/#13 only if gaps found)
8 Phase 7 real model complete (#12, #11, #25 closed) Performance baselines documented; torch.compile integration verified; GPU capability detection operational
9.1 Phase 8 benchmarks available Layer-by-layer numerical validation passes; tolerance bounds documented; router precision (FP32 vs BF16) validated; expert load balancing analyzed
9.2 Phase 9.1 numerical validation complete lm-eval integration working; all benchmark tasks evaluated; results match HF/SGlang within tolerance (±1-2% categorical, ±2-3% generation); correctness test suite established

Dependency semantics

flowchart LR
    hardUp[HardDependency] -->|"must merge first"| hardDown[DownstreamComplete]
    softUp[SoftDependency] -.->|"stub or flag allowed"| softDown[DownstreamCanProgress]
Loading

Milestone timeline (high-level)

timeline
    title MVP LLaDA2.0 phase sequence
    section Phase0
        TrackAndOrchestrate : #18 roadmap links
                           : #2 upstream hook and min vLLM
                           : #20 PR checklist template
    section Phase1
        ContractFoundation : #3 config constants
                           : #6 remasking interface
                           : #15 field mapping table
    section Phase2
        MockOnly : #5 registration
                 : #24 mock model
    section Phase3
        RemaskingWiring : #7 default policy
                        : #13 model to policy bridge
    section Phase4
        RuntimePath : #8 scheduler
                    : #9 grammar safety
                    : #10 worker integration
    section Phase5
        Validation : #4 strict stack checks
    section Phase6
        ShipConfidence : #16 unit tests
                       : #14 operator docs
                       : #17 integration checklist mock stack
    section Phase7
        RealLLaDA2 : #12 real HF model + forward
                   : #11 attention spike + implementation
        Integration : #25 real-model integration
    section Phase8
        Benchmarking : PR #38 torch.compile + GuideLLM
                     : GPU capability detection
                     : Performance baselines
    section Phase9
        Correctness : #39 numerical validation
                    : lm-eval integration
                    : SGlang + HF comparison
Loading

Dependency graph (phase-aligned)

flowchart TB
    subgraph phase0 [Phase0]
        I2["#2"]
        I18["#18"]
        I20["#20"]
    end

    subgraph phase1 [Phase1]
        I3["#3"]
        I6["#6"]
        I15["#15"]
    end

    subgraph phase2 [Phase2]
        I5["#5"]
        I24["#24"]
    end

    subgraph phase7 [Phase7]
        I12["#12"]
        I11["#11"]
        I25["#25"]
    end

    subgraph phase3 [Phase3]
        I7["#7"]
        I13["#13"]
    end

    subgraph phase4 [Phase4]
        I8["#8"]
        I9["#9"]
        I10["#10"]
    end

    subgraph phase5 [Phase5]
        I4["#4"]
    end

    subgraph phase6 [Phase6]
        I16["#16"]
        I14["#14"]
        I17["#17"]
    end

    subgraph phase8 [Phase8]
        PR38["PR #38"]
    end

    subgraph phase9 [Phase9]
        I39["#39 (orchestration)"]
        I42["#42 (Phase 9.1)"]
        I43["#43 (Phase 9.2)"]
    end

    I6 --> I7
    I15 --> I16
    I5 --> I24
    I24 --> I13
    I7 --> I13
    I3 --> I8
    I3 --> I10
    I8 --> I9
    I8 --> I10
    I24 --> I10
    I13 -.-> I10
    I2 -.-> I8
    I2 -.-> I10
    I8 --> I4
    I10 --> I4
    I6 --> I16
    I8 --> I17
    I10 --> I17
    I14 --> I17
    I24 -.-> I12
    I17 --> I25
    I12 --> I25
    I11 --> I25
    I25 --> PR38
    I12 --> PR38
    I11 --> PR38
    I2 -.-> PR38
    PR38 --> I42
    I42 --> I43
    I43 --> I39
Loading

Deep dive by phase

Phase 0 - tracking and process enablement

Goal: Keep milestone discoverable, upstream-aligned, and contributor-friendly before heavy implementation.

Phase 0 exit: Upstream tracker active, roadmap linked, and PR checklist workflow available.

Phase 1 - shared contracts

Goal: Define canonical names, constants, and interfaces used by all runtime work.

Phase 1 exit: constants + interface + field map stabilized and referenced.

Phase 2 - mock model surface only

Goal: Register mock/stub via #24 with #5. #11 (attention) is not in Phase 2—it is deferred to Phase 7 with #12 (real model).

Phase 2 exit: mock path is registered and usable; no attention spike (#11) and no real HF mapping in Phase 2.

Phase 3 - remasking behavior and wiring

Goal: Implement default policy and stabilize model->policy handoff.

Phase 3 exit: remasking behavior and handoff fields are explicit and testable (mock tensors first).

Phase 4 - runtime integration

Delivered: #8 closed — PR #30 (2026-04-28). #9, #10 closed — PR #34 (2026-04-30).

Goal: Deliver one-block decode execution path via scheduler + worker.

Phase 4 exit: scheduler-worker path works with explicit grammar behavior and no silent field mismatch on the mock model; v2 expectations documented/tested per #10; overrides preserve upstream model_runner benefits where possible.

Phase 5 - strict validation

Delivered (2026-04-29): #4 closed — PR #33.

Goal: Convert stack misuse into immediate actionable errors.

Phase 5 exit: invalid stack combinations fail fast with deterministic messages and test coverage.

Phase 6 - ship confidence

Delivered (2026-04-29): #14, #16, #17, #31, #32 closed — PR #33. Full GPU LLM.generate integration remains CUDA-gated/off-runner; CPU PR CI covers EngineArgs + strict validation.

Extended (2026-05-04): #35 closed — PR #36: dLLM semantics regression tests, optional runtime EngineCore draft-hook patch for vllm serve on legacy wheels, tools/e2e/serve_http_smoke.sh, and Helm GPU job running HTTP smoke after pytest.

Goal: lock contract confidence and operator reproducibility.

Phase 6 exit: reproducible evidence confirms mock-stack MVP end-to-end viability (real LLaDA2 optional Phase 7).

Phase 7 - real LLaDA2 model, attention, and #25

Goal: Ship #12 and #11, then close #25 (real-model integration on real weights). #10 / #13 / #17 do not depend on #25#25 depends on them (mock baseline) plus #12/#11.

Implementation (can overlap in time):

Phase 7 exit: #12 + #11 complete; #25 complete with real-weights integration evidence.

Note: Phase 7 completion and Phase 8 initial delivery occurred together via PR #38, which combined real model implementation with torch.compile optimization and benchmarking infrastructure.

Phase 8 - benchmarking and runtime optimizations

Delivered: PR #38 (combined with Phase 7 delivery)

Goal: Establish performance baselines and optimize runtime with vLLM-native compilation.

Phase 8 exit: Baseline performance documented; torch.compile verified operational; GPU detection working; benchmarking reproducible via tools.

Future iterations (post-MVP):

  • Phase 8.2: Single-pass attention (+10-20% TTFT target)
  • Phase 8.3: CUTLASS FusedMoE (+15-30% TPS on A100)
  • Phase 8.4: FlashInfer fused topk (+20-40% TPS on H100+)

Phase 9 - output correctness validation

Goal: Validate numerical correctness and benchmark performance against reference implementations.

Sub-phases:

Phase 9.1: Numerical Validation (Incremental Layer-by-Layer)

  • #42 Role: Layer-by-layer numerical correctness validation owner.
    • Scope: 8 incremental validation points: Embedding → Attention (QKV, norms, computation) → MoE (router, group-limited routing, experts, scaling) → Decoder → Transformer stack → Final norm → LM head → E2E tokens→logits.
    • Critical tests:
      • Router precision: FP32 (default) vs BF16 (experimental VLLM_LLADA2_BF16_ROUTER=1)
      • Group-limited routing: 256 experts → 8 groups → top-4 groups → top-k experts
      • Routed scaling factor (2.5x) validation
      • Expert load balancing (no pathological bias)
    • Validation methodology: Extract intermediate tensors from HF/SGlang, compare with tolerance bounds (FP32: atol=1e-5/rtol=1e-4, BF16: atol=1e-3/rtol=1e-2)
    • Deliverables: Test suite tests/test_llada2_numerical_validation.py, tolerance framework dllm_plugin/validation_utils.py, documentation docs/PHASE9.1_NUMERICAL_VALIDATION.md
    • Dependencies:

Phase 9.2: E2E Evaluation Validation (lm-eval Integration)

  • #43 Role: E2E lm-eval integration and benchmark validation owner.
    • Scope: 5 standard benchmark tasks via lm-evaluation-harness:
      • Categorical: MMLU (5-shot, 57 subjects), HellaSwag (10-shot), ARC-Challenge (25-shot)
      • Generation: GSM8K (8-shot chain-of-thought), TruthfulQA (0-shot)
    • Evaluation strategy:
      • Phase 1: Sanity subset (100-500 examples, ~30 min/task)
      • Phase 2: Full dataset (complete, ~4-8 hours total)
    • Comparison: dllm-plugin vLLM vs HuggingFace baseline vs SGlang (optional)
    • Tolerance bounds: ±1-2% categorical accuracy, ±2-3% generation exact match
    • Deliverables: lm-eval integration, test suite tests/test_lm_eval_integration.py, comparison tooling tools/compare_lm_eval_results.py, documentation docs/PHASE9.2_LMEVAL_RESULTS.md
    • Dependencies:

Phase 9 exit: All numerical validation passes (Phase 9.1); all lm-eval tasks within tolerance (Phase 9.2); discrepancies documented; correctness test suite established.


PR sequencing and review contract

Preferred merge order (default)

  1. Phase 0 process/trackers ([MVP LLaDA2.0]: ROADMAP: link milestone issues when created #18/[MVP LLaDA2.0]: Track vLLM draft-token hook; pin minimum vLLM for MVP #2/[MVP LLaDA2.0]: Contributor PR checklist template for milestone #20) and Phase 1 contracts ([MVP LLaDA2.0]: Add config.py: DRAFT_SIZE, model IDs, feature flags #3/[MVP LLaDA2.0]: Define RemaskingPolicy protocol/ABC in remasking/base.py #6/[MVP LLaDA2.0]: Document field-mapping table for contributors (copy-friendly) #15).
  2. Phase 2 mock model + registration ([MVP LLaDA2.0]: Wire register() to register LLaDA2.0 architecture (bart-style) #5, [MVP LLaDA2.0]: Mock/stub registered model for stack testing (deterministic forward) #24) — no [MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11 in Phase 2.
  3. Phase 3 remasking ([MVP LLaDA2.0]: Implement remasking/llada2_default.py (MVP default policy) #7/[MVP LLaDA2.0]: Connect model forward outputs to RemaskingPolicy inputs #13).
  4. Phase 4 runtime ([MVP LLaDA2.0]: DllmScheduler: spec_token_ids, DRAFT_SIZE, commit-0 rollback #8 -> [MVP LLaDA2.0]: Scheduler: draft grammar must not break dLLM blocks #9 and [MVP LLaDA2.0]: DllmWorker (WorkerBase): batch build, forward one block, take_draft_token_ids #10).
  5. Phase 5 validation ([MVP LLaDA2.0]: Implement validation.py: assert compatible scheduler/worker/model stack #4).
  6. Phase 6 tests/docs/integration ([MVP LLaDA2.0]: Unit tests: field mapping / remask contract (no vLLM required) #16/[MVP LLaDA2.0]: Operator doc: VLLM_PLUGINS, CLI flags, first-block spec_token_ids init #14/[MVP LLaDA2.0]: Integration test or manual checklist: serve LLaDA2.0 with plugin stack #17) — mock stack (core [MVP LLaDA2.0]: Implement validation.py: assert compatible scheduler/worker/model stack #4/[MVP LLaDA2.0]: Operator doc: VLLM_PLUGINS, CLI flags, first-block spec_token_ids init #14/[MVP LLaDA2.0]: Unit tests: field mapping / remask contract (no vLLM required) #16/[MVP LLaDA2.0]: Integration test or manual checklist: serve LLaDA2.0 with plugin stack #17/[Phase 4 Follow-up]: replace synthesized-logit remask bridge with model-score handoff #31/[Phase 6 Follow-up]: add runtime-adapter integration smoke with concrete vLLM objects #32 in PR #33, 2026-04-29; extended semantics + HTTP smoke in PR #36, 2026-05-04).
  7. Phase 7 [MVP LLaDA2.0]: LLaDA2.0 HF mapping + real vLLM model module (models/llada2.py) #12 + [MVP LLaDA2.0]: Attention path spike: FlexAttention / non-causal virtual chunks for LLaDA2.0 #11, then #25 (after Phase 6 mock evidence).
  8. Phase 8 PR feat(phase7+8): Production LLaDA2.0 model + vLLM-native torch.compile optimization #38 (torch.compile + benchmarking) - delivered alongside Phase 7
  9. Phase 9 [MVP LLaDA2.0] Phase 9: Output correctness validation #39 (output correctness validation) - after Phase 8 baselines exist

Milestone PR checklist (required in PR description)


Milestone definition of done


Maintenance note

When adding, splitting, or closing milestone issues, update this issue's canonical phase map, dependency graph, and deep-dive in the same change to avoid drift.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions