[MVP LLaDA2.0] Milestone orchestration: timeline, dependency graph, phased plan

## Summary

This issue is the orchestration source of truth for **[MVP LLaDA2.0](https://github.com/vllm-project/dllm-plugin/milestone/1)**.

It now includes:
- phase gates (entry/exit criteria),
- explicit dependency semantics (`HARD` vs `SOFT`),
- an issue-by-issue deep dive in phase order,
- PR sequencing and review contract,
- milestone definition of done,
- **mock-first MVP (phases 0–6):** **real LLaDA2 HF mapping** ([#12](https://github.com/vllm-project/dllm-plugin/issues/12)) and **attention work ([#11](https://github.com/vllm-project/dllm-plugin/issues/11))** are deferred to **Phase 7**—mock stack uses **[#24](https://github.com/vllm-project/dllm-plugin/issues/24)** only (no attention spike in Phase 2); **real-model integration** is **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (**nothing** in 2–6 **depends on** #25),
- **GPU Model Runner v2** expectations ([#10](https://github.com/vllm-project/dllm-plugin/issues/10), `VLLM_USE_V2_MODEL_RUNNER`) and **worker–runner** risk mitigations.

**Scope reference:** [DESIGN_MVP.md](https://github.com/vllm-project/dllm-plugin/blob/main/docs/DESIGN_MVP.md), [ROADMAP.md](https://github.com/vllm-project/dllm-plugin/blob/main/docs/ROADMAP.md), upstream [vllm#36155](https://github.com/vllm-project/vllm/issues/36155).

### Delivery status (mock-stack train)

**2026-04-28:** [PR #30](https://github.com/vllm-project/dllm-plugin/pull/30) merged. It **closed** [#8](https://github.com/vllm-project/dllm-plugin/issues/8) — Phase 4 scheduler/runtime path (initial decode wiring).

**2026-04-29:** [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33) merged (squash `41042b4768dba20ad6dc1940b4c5ee42dee1408e`). It **closed** [#4](https://github.com/vllm-project/dllm-plugin/issues/4), [#14](https://github.com/vllm-project/dllm-plugin/issues/14), [#16](https://github.com/vllm-project/dllm-plugin/issues/16), [#17](https://github.com/vllm-project/dllm-plugin/issues/17), [#31](https://github.com/vllm-project/dllm-plugin/issues/31), [#32](https://github.com/vllm-project/dllm-plugin/issues/32) — strict validation, operator doc, unit/integration confidence, logits/remask policy follow-up, and CPU `EngineArgs` smoke with concrete vLLM objects.

**2026-04-30:** [PR #34](https://github.com/vllm-project/dllm-plugin/pull/34) merged. It **closed** [#9](https://github.com/vllm-project/dllm-plugin/issues/9) and [#10](https://github.com/vllm-project/dllm-plugin/issues/10) — grammar frontier / dLLM block safety and worker / v2 model-runner integration.

**2026-05-04:** [PR #36](https://github.com/vllm-project/dllm-plugin/pull/36) merged. It **closed** [#35](https://github.com/vllm-project/dllm-plugin/issues/35) — dLLM semantics tests, runtime `EngineCore` draft-hook alignment (opt-in `VLLM_DLLM_APPLY_ENGINE_CORE_DRAFT_HOOK`), `vllm serve` + `curl` HTTP smoke, and Helm GPU job wiring.

**Next toward Phase 7–9:** ship **[#12](https://github.com/vllm-project/dllm-plugin/issues/12)** (real LLaDA2 HF model), **[#11](https://github.com/vllm-project/dllm-plugin/issues/11)** (attention), then **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (real-model integration evidence). Phase **8** ([PR #38](https://github.com/vllm-project/dllm-plugin/pull/38)) delivers torch.compile optimization and benchmarking infrastructure. Phase **9** ([#39](https://github.com/vllm-project/dllm-plugin/issues/39)) validates output correctness against reference implementations. Keep **[#2](https://github.com/vllm-project/dllm-plugin/issues/2)** current for pin / upstream companion plumbing. Phase **0** housekeeping ([#18](https://github.com/vllm-project/dllm-plugin/issues/18), [#20](https://github.com/vllm-project/dllm-plugin/issues/20)) as needed.

---

## Canonical phase map

| Phase | Goal | Issues |
|---|---|---|
| **0** | Tracking, orchestration, contributor process | [#18](https://github.com/vllm-project/dllm-plugin/issues/18), [#2](https://github.com/vllm-project/dllm-plugin/issues/2), [#20](https://github.com/vllm-project/dllm-plugin/issues/20) |
| **1** | Shared constants/interfaces/docs contracts | [#3](https://github.com/vllm-project/dllm-plugin/issues/3), [#6](https://github.com/vllm-project/dllm-plugin/issues/6), [#15](https://github.com/vllm-project/dllm-plugin/issues/15) |
| **2** | Mock/register model surface only (**#5**, **#24**) — no real LLaDA2, **no #11** | [#5](https://github.com/vllm-project/dllm-plugin/issues/5), [#24](https://github.com/vllm-project/dllm-plugin/issues/24) |
| **3** | Remasking behavior + model-policy wiring | [#7](https://github.com/vllm-project/dllm-plugin/issues/7), [#13](https://github.com/vllm-project/dllm-plugin/issues/13) |
| **4** | Runtime scheduler/worker decode path | [#8](https://github.com/vllm-project/dllm-plugin/issues/8), [#9](https://github.com/vllm-project/dllm-plugin/issues/9), [#10](https://github.com/vllm-project/dllm-plugin/issues/10) |
| **5** | Strict stack validation | [#4](https://github.com/vllm-project/dllm-plugin/issues/4) |
| **6** | Tests/docs/integration release confidence (**mock** stack) | [#16](https://github.com/vllm-project/dllm-plugin/issues/16), [#14](https://github.com/vllm-project/dllm-plugin/issues/14), [#17](https://github.com/vllm-project/dllm-plugin/issues/17) |
| **7** | Real **implementation** + **real-model integration** | [#12](https://github.com/vllm-project/dllm-plugin/issues/12), [#11](https://github.com/vllm-project/dllm-plugin/issues/11), **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (integration evidence on real weights—**after** Phase 6; **no downstream issue depends on #25**) |
| **8** | Benchmarking and runtime optimizations | [#38](https://github.com/vllm-project/dllm-plugin/pull/38) |
| **9** | Output correctness validation | [#39](https://github.com/vllm-project/dllm-plugin/issues/39) (orchestration), [#42](https://github.com/vllm-project/dllm-plugin/issues/42) (Phase 9.1 - Numerical), [#43](https://github.com/vllm-project/dllm-plugin/issues/43) (Phase 9.2 - lm-eval) |

*Phase **4** ([#8](https://github.com/vllm-project/dllm-plugin/issues/8), [#9](https://github.com/vllm-project/dllm-plugin/issues/9), [#10](https://github.com/vllm-project/dllm-plugin/issues/10)) closed via [PR #30](https://github.com/vllm-project/dllm-plugin/pull/30) (2026-04-28) and [PR #34](https://github.com/vllm-project/dllm-plugin/pull/34) (2026-04-30). Phase **5** [#4](https://github.com/vllm-project/dllm-plugin/issues/4) and Phase **6** [#14](https://github.com/vllm-project/dllm-plugin/issues/14), [#16](https://github.com/vllm-project/dllm-plugin/issues/16), [#17](https://github.com/vllm-project/dllm-plugin/issues/17); follow-ups [#31](https://github.com/vllm-project/dllm-plugin/issues/31), [#32](https://github.com/vllm-project/dllm-plugin/issues/32) closed via [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33) (2026-04-29). [#35](https://github.com/vllm-project/dllm-plugin/issues/35) closed via [PR #36](https://github.com/vllm-project/dllm-plugin/pull/36) (2026-05-04) (extended semantics + HTTP serve smoke on GPU Helm).*

---

## Phase gates (entry/exit criteria)

| Phase | Entry criteria | Exit criteria |
|---|---|---|
| **0** | Milestone issues exist and are labeled `mvp-llada2` | Upstream tracker active, roadmap links synchronized, PR checklist issue created |
| **1** | Phase 0 tracking stable | Config constants + remasking interface + field map merged or stable enough for dependent stubs |
| **2** | Phase 1 contracts available | **Mock model (#24)** registered and usable for downstream wiring; **no** Phase 2 **#11** attention work; **no** requirement for real HF LLaDA2 forward (**#12 + #11 → Phase 7**) |
| **3** | Model and interface surfaces exist | Default policy implemented and model->policy handoff fields stabilized (mock tensors / **#24**) |
| **4** | Phase 2/3 interfaces stable enough to integrate | Scheduler-worker one-block path works with explicit grammar constraints; **DllmWorker** validated with **model runner v2** where applicable; worker–runner overrides stay minimal (see #10) |
| **5** | Runtime baseline from Phase 4 exists | Invalid stack combinations fail fast with actionable errors |
| **6** | Runtime + validation are available | Unit/doc/integration evidence complete and reproducible for the **mock** plugin stack (v2 runner per #10 where applicable) |
| **7** | Phase 6 mock MVP exit criteria met (or parallel planning) | **[#12](https://github.com/vllm-project/dllm-plugin/issues/12)** + **#11** shipped; **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** complete (extends **#17** / real-weights proof; optional small PRs to **#10**/**#13** only if gaps found) |
| **8** | Phase 7 real model complete (#12, #11, #25 closed) | Performance baselines documented; torch.compile integration verified; GPU capability detection operational |
| **9.1** | Phase 8 benchmarks available | Layer-by-layer numerical validation passes; tolerance bounds documented; router precision (FP32 vs BF16) validated; expert load balancing analyzed |
| **9.2** | Phase 9.1 numerical validation complete | lm-eval integration working; all benchmark tasks evaluated; results match HF/SGlang within tolerance (±1-2% categorical, ±2-3% generation); correctness test suite established |

---

## Dependency semantics

- **HARD dependency**: downstream issue is not complete without upstream merge.
- **SOFT dependency**: downstream may proceed behind stubs/flags but must reconcile before phase exit.
- **Special case**: [#2](https://github.com/vllm-project/dllm-plugin/issues/2) is long-running and acts as a confidence gate for runtime finalization.
- **Phase flow (no "Phase 7 → Phase 4" completion deps):** Phases **2–6** complete the **mock** stack (**#24**, then #10/#13/#17, etc.). **Phase 4** issues (e.g. **#10**) are **done for MVP** when the **mock** path is validated—**not** when real weights exist. **Phase 7** adds **(1)** implementations **#12** / **#11** and **(2)** **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (real-model integration)—a **separate issue** that **depends on** mock evidence + **#12**/**#11** and **does not** block **#10**/**#13**/**#17** (nothing in phases 2–6 **depends on** #25).

```mermaid
flowchart LR
    hardUp[HardDependency] -->|"must merge first"| hardDown[DownstreamComplete]
    softUp[SoftDependency] -.->|"stub or flag allowed"| softDown[DownstreamCanProgress]
```

---

## Milestone timeline (high-level)

```mermaid
timeline
    title MVP LLaDA2.0 phase sequence
    section Phase0
        TrackAndOrchestrate : #18 roadmap links
                           : #2 upstream hook and min vLLM
                           : #20 PR checklist template
    section Phase1
        ContractFoundation : #3 config constants
                           : #6 remasking interface
                           : #15 field mapping table
    section Phase2
        MockOnly : #5 registration
                 : #24 mock model
    section Phase3
        RemaskingWiring : #7 default policy
                        : #13 model to policy bridge
    section Phase4
        RuntimePath : #8 scheduler
                    : #9 grammar safety
                    : #10 worker integration
    section Phase5
        Validation : #4 strict stack checks
    section Phase6
        ShipConfidence : #16 unit tests
                       : #14 operator docs
                       : #17 integration checklist mock stack
    section Phase7
        RealLLaDA2 : #12 real HF model + forward
                   : #11 attention spike + implementation
        Integration : #25 real-model integration
    section Phase8
        Benchmarking : PR #38 torch.compile + GuideLLM
                     : GPU capability detection
                     : Performance baselines
    section Phase9
        Correctness : #39 numerical validation
                    : lm-eval integration
                    : SGlang + HF comparison
```

---

## Dependency graph (phase-aligned)

```mermaid
flowchart TB
    subgraph phase0 [Phase0]
        I2["#2"]
        I18["#18"]
        I20["#20"]
    end

    subgraph phase1 [Phase1]
        I3["#3"]
        I6["#6"]
        I15["#15"]
    end

    subgraph phase2 [Phase2]
        I5["#5"]
        I24["#24"]
    end

    subgraph phase7 [Phase7]
        I12["#12"]
        I11["#11"]
        I25["#25"]
    end

    subgraph phase3 [Phase3]
        I7["#7"]
        I13["#13"]
    end

    subgraph phase4 [Phase4]
        I8["#8"]
        I9["#9"]
        I10["#10"]
    end

    subgraph phase5 [Phase5]
        I4["#4"]
    end

    subgraph phase6 [Phase6]
        I16["#16"]
        I14["#14"]
        I17["#17"]
    end

    subgraph phase8 [Phase8]
        PR38["PR #38"]
    end

    subgraph phase9 [Phase9]
        I39["#39 (orchestration)"]
        I42["#42 (Phase 9.1)"]
        I43["#43 (Phase 9.2)"]
    end

    I6 --> I7
    I15 --> I16
    I5 --> I24
    I24 --> I13
    I7 --> I13
    I3 --> I8
    I3 --> I10
    I8 --> I9
    I8 --> I10
    I24 --> I10
    I13 -.-> I10
    I2 -.-> I8
    I2 -.-> I10
    I8 --> I4
    I10 --> I4
    I6 --> I16
    I8 --> I17
    I10 --> I17
    I14 --> I17
    I24 -.-> I12
    I17 --> I25
    I12 --> I25
    I11 --> I25
    I25 --> PR38
    I12 --> PR38
    I11 --> PR38
    I2 -.-> PR38
    PR38 --> I42
    I42 --> I43
    I43 --> I39
```

---

## Deep dive by phase

### Phase 0 - tracking and process enablement

**Goal:** Keep milestone discoverable, upstream-aligned, and contributor-friendly before heavy implementation.

- **[#18](https://github.com/vllm-project/dllm-plugin/issues/18) Role:** roadmap link maintenance.
  - **Scope:** keep ROADMAP references synced to active milestone issues.
  - **Dependencies:** HARD none; SOFT with #19 updates.
- **[#2](https://github.com/vllm-project/dllm-plugin/issues/2) Role:** upstream dependency tracker.
  - **Scope:** track hook landing and minimum vLLM pin text across docs/config.
  - **Dependencies:** HARD none; SOFT confidence gate for #8/#10/#17.
- **[#20](https://github.com/vllm-project/dllm-plugin/issues/20) Role:** contributor PR checklist template.
  - **Scope:** add milestone PR checklist structure and link from orchestration.
  - **Dependencies:** HARD #19 existence; SOFT none.

**Phase 0 exit:** Upstream tracker active, roadmap linked, and PR checklist workflow available.

### Phase 1 - shared contracts

**Goal:** Define canonical names, constants, and interfaces used by all runtime work.

- **[#3](https://github.com/vllm-project/dllm-plugin/issues/3) Role:** source-of-truth config constants.
  - **Scope:** DRAFT_SIZE/model IDs/flags centralization.
  - **Dependencies:** HARD none; SOFT consumed by #6/#8/#10/#4.
- **[#6](https://github.com/vllm-project/dllm-plugin/issues/6) Role:** remasking interface contract.
  - **Scope:** protocol/ABC and invariants only.
  - **Dependencies:** HARD #3; SOFT #7/#16.
- **[#15](https://github.com/vllm-project/dllm-plugin/issues/15) Role:** field map reference.
  - **Scope:** contributor table for cross-component fields.
  - **Dependencies:** HARD none; SOFT #16 and #13.

**Phase 1 exit:** constants + interface + field map stabilized and referenced.

### Phase 2 - mock model surface only

**Goal:** Register **mock/stub** via **[#24](https://github.com/vllm-project/dllm-plugin/issues/24)** with **#5**. **#11 (attention)** is **not** in Phase 2—it is deferred to **Phase 7** with **[#12](https://github.com/vllm-project/dllm-plugin/issues/12)** (real model).

- **[#5](https://github.com/vllm-project/dllm-plugin/issues/5) Role:** registration wiring.
  - **Scope:** architecture registration (mock test id + eventual LLaDA2 id, or single id with flag), optional-vLLM safe behavior.
  - **Dependencies:** HARD #3; SOFT #4 alignment.
- **[#24](https://github.com/vllm-project/dllm-plugin/issues/24) Role:** mock registered model.
  - **Scope:** mock forward / deterministic outputs for #10/#13/#16. **Real HF LLaDA2** is **[#12](https://github.com/vllm-project/dllm-plugin/issues/12)** (Phase 7).
  - **Dependencies:** HARD #5; SOFT #13/#10.

**Phase 2 exit:** mock path is registered and usable; **no** attention spike (#11) and **no** real HF mapping in Phase 2.

### Phase 3 - remasking behavior and wiring

**Goal:** Implement default policy and stabilize model->policy handoff.

- **[#7](https://github.com/vllm-project/dllm-plugin/issues/7) Role:** default remasking implementation.
  - **Scope:** MVP default policy with tunable knobs.
  - **Dependencies:** HARD #6; SOFT #16.
- **[#13](https://github.com/vllm-project/dllm-plugin/issues/13) Role:** wiring contract bridge.
  - **Scope:** connect forward outputs to policy inputs; validate on **#24** in MVP phases; **#12** re-validation in **Phase 7**.
  - **Dependencies:** HARD **#24** and #7; SOFT #10.

**Phase 3 exit:** remasking behavior and handoff fields are explicit and testable (mock tensors first).

### Phase 4 - runtime integration

> **Delivered:** [#8](https://github.com/vllm-project/dllm-plugin/issues/8) closed — [PR #30](https://github.com/vllm-project/dllm-plugin/pull/30) (2026-04-28). [#9](https://github.com/vllm-project/dllm-plugin/issues/9), [#10](https://github.com/vllm-project/dllm-plugin/issues/10) closed — [PR #34](https://github.com/vllm-project/dllm-plugin/pull/34) (2026-04-30).

**Goal:** Deliver one-block decode execution path via scheduler + worker.

- **[#8](https://github.com/vllm-project/dllm-plugin/issues/8) Role:** scheduler semantics owner.
  - **Scope:** spec_token_ids handling, commit-0 rollback, first-block semantics.
  - **Dependencies:** HARD #3; SOFT #2, #9, #10.
- **[#9](https://github.com/vllm-project/dllm-plugin/issues/9) Role:** grammar safety overlay.
  - **Scope:** ensure grammar path cannot corrupt dLLM blocks.
  - **Dependencies:** HARD #8; SOFT #17.
- **[#10](https://github.com/vllm-project/dllm-plugin/issues/10) Role:** worker execution owner.
  - **Scope:** batch build, one-block forward, draft token path; **GPU Model Runner v2** (`VLLM_USE_V2_MODEL_RUNNER`); thin subclass of `Worker`; **worker–runner** risk mitigations (avoid large `execute_model` forks).
  - **Dependencies (Phase 4 MVP):** HARD #8 and **#24** (mock registered model); SOFT #13, #2. **Do not** treat **#12** / **#11** as Phase 4 completion criteria—those land in **Phase 7** with a **separate real-model integration** pass (see Phase 7 below).

**Phase 4 exit:** scheduler-worker path works with explicit grammar behavior and no silent field mismatch **on the mock model**; **v2** expectations documented/tested per #10; overrides preserve upstream `model_runner` benefits where possible.

### Phase 5 - strict validation

> **Delivered (2026-04-29):** [#4](https://github.com/vllm-project/dllm-plugin/issues/4) closed — [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33).

**Goal:** Convert stack misuse into immediate actionable errors.

- **[#4](https://github.com/vllm-project/dllm-plugin/issues/4) Role:** compatibility gatekeeper.
  - **Scope:** reject invalid scheduler/worker/model combinations; validate **allowed** combinations including **test-only mock** registrations without weakening production safety.
  - **Dependencies:** HARD #8 and #10 readiness; SOFT stubs allowed earlier.

**Phase 5 exit:** invalid stack combinations fail fast with deterministic messages and test coverage.

### Phase 6 - ship confidence

> **Delivered (2026-04-29):** [#14](https://github.com/vllm-project/dllm-plugin/issues/14), [#16](https://github.com/vllm-project/dllm-plugin/issues/16), [#17](https://github.com/vllm-project/dllm-plugin/issues/17), [#31](https://github.com/vllm-project/dllm-plugin/issues/31), [#32](https://github.com/vllm-project/dllm-plugin/issues/32) closed — [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33). Full GPU `LLM.generate` integration remains CUDA-gated/off-runner; CPU PR CI covers `EngineArgs` + strict validation.

> **Extended (2026-05-04):** [#35](https://github.com/vllm-project/dllm-plugin/issues/35) closed — [PR #36](https://github.com/vllm-project/dllm-plugin/pull/36): dLLM semantics regression tests, optional runtime `EngineCore` draft-hook patch for `vllm serve` on legacy wheels, `tools/e2e/serve_http_smoke.sh`, and Helm GPU job running HTTP smoke after pytest.

**Goal:** lock contract confidence and operator reproducibility.

- **[#16](https://github.com/vllm-project/dllm-plugin/issues/16) Role:** contract unit tests.
  - **Scope:** field mapping and remask behavior tests without full vLLM; **worker–runner** contract assumptions with mock data; **v1 vs v2** runner expectations where feasible.
  - **Dependencies:** HARD #6 and #15; SOFT #13 expansion.
- **[#14](https://github.com/vllm-project/dllm-plugin/issues/14) Role:** operator runbook.
  - **Scope:** VLLM_PLUGINS/CLI/first-block guidance and caveats.
  - **Dependencies:** HARD #8 and #10 stable behavior; SOFT #17 usage.
- **[#17](https://github.com/vllm-project/dllm-plugin/issues/17) Role:** final integration evidence.
  - **Scope:** integration test or reproducible manual checklist; **`VLLM_USE_V2_MODEL_RUNNER`** runbook vs v1 fallback; **mock stack** vs **real weights** path.
  - **Dependencies:** HARD #8, #10, #14; SOFT #16 and #2 confidence.

**Phase 6 exit:** reproducible evidence confirms **mock-stack** MVP end-to-end viability (real LLaDA2 optional **Phase 7**).

### Phase 7 - real LLaDA2 model, attention, and **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)**

**Goal:** Ship **[#12](https://github.com/vllm-project/dllm-plugin/issues/12)** and **#11**, then close **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (real-model integration on real weights). **#10** / **#13** / **#17** do **not** depend on **#25**—**#25** depends on them (mock baseline) plus **#12**/**#11**.

**Implementation (can overlap in time):**

- **[#12](https://github.com/vllm-project/dllm-plugin/issues/12) Role:** real LLaDA2 model surface.
  - **Scope:** HF config → vLLM module; real forward path; coordinate with **#11** in the same phase.
  - **Dependencies:** HARD #5; SOFT **Phase 6 mock baseline** (so integration is not blind to prior **#24** behavior).
- **[#11](https://github.com/vllm-project/dllm-plugin/issues/11) Role:** attention spike + implementation (deferred from Phase 2).
  - **Scope:** feasibility memo / prototype **and** chosen attention path for real LLaDA2; align with **#12**.
  - **Dependencies:** SOFT **#12**.

- **[#25](https://github.com/vllm-project/dllm-plugin/issues/25) Role:** real-model integration (checklist / evidence owner).
  - **Scope:** Extend **#17** for real weights; capture proof that **#10**/**#13** paths behave on real tensors; file targeted fixes against **#10**/**#13**/**#12**/**#11**/**#4** only if needed—**no issue** lists **#25** as a **HARD** prerequisite for mock phases.
  - **Dependencies:** **HARD:** Phase 6 mock exit (**#17**), **#12**, **#11**; **SOFT:** #2, #14.

**Phase 7 exit:** **#12** + **#11** complete; **#25** complete with real-weights integration evidence.

**Note:** Phase 7 completion and Phase 8 initial delivery occurred together via [PR #38](https://github.com/vllm-project/dllm-plugin/pull/38), which combined real model implementation with torch.compile optimization and benchmarking infrastructure.

### Phase 8 - benchmarking and runtime optimizations

> **Delivered:** [PR #38](https://github.com/vllm-project/dllm-plugin/pull/38) (combined with Phase 7 delivery)

**Goal:** Establish performance baselines and optimize runtime with vLLM-native compilation.

- **[PR #38](https://github.com/vllm-project/dllm-plugin/pull/38) Role:** torch.compile integration + benchmarking infrastructure.
  - **Scope:** vLLM-native `@support_torch_compile` decorator; GPU capability detection (A100/H100/B200); GuideLLM benchmark harness; baseline metrics documentation (TTFT, ITL, TPS).
  - **Delivered metrics:** 346 tokens/sec median, 522ms TTFT, 2.4ms ITL on A100-SXM4-40GB (vLLM 0.20.1).
  - **Documentation:** `PHASE8_BENCHMARKS.md`, `tools/benchmark_optimization.sh`, GPU capability detection via `dllm_plugin/gpu_capability.py`.
  - **Dependencies:** 
    - HARD: #12 + #11 + #25 (real model required for meaningful benchmarks)
    - SOFT: #2 (vLLM 0.20+ for torch.compile support)

**Phase 8 exit:** Baseline performance documented; torch.compile verified operational; GPU detection working; benchmarking reproducible via tools.

**Future iterations** (post-MVP):
- Phase 8.2: Single-pass attention (+10-20% TTFT target)
- Phase 8.3: CUTLASS FusedMoE (+15-30% TPS on A100)
- Phase 8.4: FlashInfer fused topk (+20-40% TPS on H100+)

### Phase 9 - output correctness validation

**Goal:** Validate numerical correctness and benchmark performance against reference implementations.

**Sub-phases:**

**Phase 9.1: Numerical Validation (Incremental Layer-by-Layer)**

- **[#42](https://github.com/vllm-project/dllm-plugin/issues/42) Role:** Layer-by-layer numerical correctness validation owner.
  - **Scope:** 8 incremental validation points: Embedding → Attention (QKV, norms, computation) → MoE (router, group-limited routing, experts, scaling) → Decoder → Transformer stack → Final norm → LM head → E2E tokens→logits.
  - **Critical tests:**
    - Router precision: FP32 (default) vs BF16 (experimental `VLLM_LLADA2_BF16_ROUTER=1`)
    - Group-limited routing: 256 experts → 8 groups → top-4 groups → top-k experts
    - Routed scaling factor (2.5x) validation
    - Expert load balancing (no pathological bias)
  - **Validation methodology:** Extract intermediate tensors from HF/SGlang, compare with tolerance bounds (FP32: atol=1e-5/rtol=1e-4, BF16: atol=1e-3/rtol=1e-2)
  - **Deliverables:** Test suite `tests/test_llada2_numerical_validation.py`, tolerance framework `dllm_plugin/validation_utils.py`, documentation `docs/PHASE9.1_NUMERICAL_VALIDATION.md`
  - **Dependencies:** 
    - HARD: Phase 8 complete (PR #38 merged)
    - HARD: GPU environment (A100), HF weights access
    - SOFT: SGlang installation for cross-reference

**Phase 9.2: E2E Evaluation Validation (lm-eval Integration)**

- **[#43](https://github.com/vllm-project/dllm-plugin/issues/43) Role:** E2E lm-eval integration and benchmark validation owner.
  - **Scope:** 5 standard benchmark tasks via lm-evaluation-harness:
    - Categorical: MMLU (5-shot, 57 subjects), HellaSwag (10-shot), ARC-Challenge (25-shot)
    - Generation: GSM8K (8-shot chain-of-thought), TruthfulQA (0-shot)
  - **Evaluation strategy:**
    - Phase 1: Sanity subset (100-500 examples, ~30 min/task)
    - Phase 2: Full dataset (complete, ~4-8 hours total)
  - **Comparison:** dllm-plugin vLLM vs HuggingFace baseline vs SGlang (optional)
  - **Tolerance bounds:** ±1-2% categorical accuracy, ±2-3% generation exact match
  - **Deliverables:** lm-eval integration, test suite `tests/test_lm_eval_integration.py`, comparison tooling `tools/compare_lm_eval_results.py`, documentation `docs/PHASE9.2_LMEVAL_RESULTS.md`
  - **Dependencies:**
    - HARD: Phase 9.1 complete (#42)
    - HARD: lm-eval>=0.4.0, GPU environment, HF weights
    - SOFT: SGlang for cross-reference

**Phase 9 exit:** All numerical validation passes (Phase 9.1); all lm-eval tasks within tolerance (Phase 9.2); discrepancies documented; correctness test suite established.

---

## PR sequencing and review contract

### Preferred merge order (default)

1. Phase 0 process/trackers (#18/#2/#20) and Phase 1 contracts (#3/#6/#15).
2. Phase 2 mock model + registration (#5, **#24**) — **no #11** in Phase 2.
3. Phase 3 remasking (#7/#13).
4. Phase 4 runtime (#8 -> #9 and #10).
5. Phase 5 validation (#4).
6. Phase 6 tests/docs/integration (#16/#14/#17) — **mock** stack *(core #4/#14/#16/#17/#31/#32 in [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33), 2026-04-29; extended semantics + HTTP smoke in [PR #36](https://github.com/vllm-project/dllm-plugin/pull/36), 2026-05-04)*.
7. **Phase 7** **#12** + **#11**, then **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (after Phase 6 mock evidence).
8. **Phase 8** PR #38 (torch.compile + benchmarking) - delivered alongside Phase 7
9. **Phase 9** #39 (output correctness validation) - after Phase 8 baselines exist

### Milestone PR checklist (required in PR description)

- Issue and phase: `Closes #<n>; Phase <0-9>` (use **Phase 7** for **#12** / **#11** / **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)**—**not** Phase 2)
- Dependencies: `HARD` and `SOFT` status
- Scope statement: in scope / out of scope
- Validation evidence: tests/manual output/checklist
- Docs impact: updated or N/A with reason

---

## Milestone definition of done

- [ ] Every milestone issue has explicit role/scope/dependencies/acceptance checks.
- [x] Phase names are consistent across phase map, timeline, dependency graph, and deep-dive (**0–9**).
- [x] Runtime path (Phase **4**: [#8](https://github.com/vllm-project/dllm-plugin/issues/8), [#9](https://github.com/vllm-project/dllm-plugin/issues/9), [#10](https://github.com/vllm-project/dllm-plugin/issues/10), including **v2 runner** expectations per #10) is complete and coherent for the **mock** MVP (**#24**) — **delivered** [PR #30](https://github.com/vllm-project/dllm-plugin/pull/30) + [PR #34](https://github.com/vllm-project/dllm-plugin/pull/34).
- [x] Validation path ([#4](https://github.com/vllm-project/dllm-plugin/issues/4)) is complete for the **mock** MVP per [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33). **#12 / Phase 7** tracked separately for real LLaDA2.
- [x] Operator docs ([#14](https://github.com/vllm-project/dllm-plugin/issues/14)), unit tests ([#16](https://github.com/vllm-project/dllm-plugin/issues/16)), integration evidence ([#17](https://github.com/vllm-project/dllm-plugin/issues/17)), and follow-ups ([#31](https://github.com/vllm-project/dllm-plugin/issues/31), [#32](https://github.com/vllm-project/dllm-plugin/issues/32)) for the **mock** stack — **delivered** [PR #33](https://github.com/vllm-project/dllm-plugin/pull/33); extended semantics / `vllm serve` HTTP smoke — [PR #36](https://github.com/vllm-project/dllm-plugin/pull/36) / [#35](https://github.com/vllm-project/dllm-plugin/issues/35). **Phase 7** completes **#12**/**#11** and **[#25](https://github.com/vllm-project/dllm-plugin/issues/25)** (real-weights proof).
- [ ] Phase 8 benchmarking complete: torch.compile operational, baselines documented in PHASE8_BENCHMARKS.md, GPU detection verified
- [ ] Phase 9.1 numerical validation complete: all 8 validation points pass, tolerance bounds documented, router precision validated
- [ ] Phase 9.2 lm-eval integration complete: all 5 tasks evaluated, results within tolerance (±1-2% categorical, ±2-3% generation), comparison reports merged
- [ ] Upstream dependency tracker ([#2](https://github.com/vllm-project/dllm-plugin/issues/2)) reflects current minimum-version confidence for MVP.

---

## Maintenance note

When adding, splitting, or closing milestone issues, update this issue's canonical phase map, dependency graph, and deep-dive in the same change to avoid drift.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MVP LLaDA2.0] Milestone orchestration: timeline, dependency graph, phased plan #19

Summary

Delivery status (mock-stack train)

Canonical phase map

Phase gates (entry/exit criteria)

Dependency semantics

Milestone timeline (high-level)

Dependency graph (phase-aligned)

Deep dive by phase

Phase 0 - tracking and process enablement

Phase 1 - shared contracts

Phase 2 - mock model surface only

Phase 3 - remasking behavior and wiring

Phase 4 - runtime integration

Phase 5 - strict validation

Phase 6 - ship confidence

Phase 7 - real LLaDA2 model, attention, and #25

Phase 8 - benchmarking and runtime optimizations

Phase 9 - output correctness validation

PR sequencing and review contract

Preferred merge order (default)

Milestone PR checklist (required in PR description)

Milestone definition of done

Maintenance note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Goal	Issues
0	Tracking, orchestration, contributor process	#18, #2, #20
1	Shared constants/interfaces/docs contracts	#3, #6, #15
2	Mock/register model surface only (#5, #24) — no real LLaDA2, no #11	#5, #24
3	Remasking behavior + model-policy wiring	#7, #13
4	Runtime scheduler/worker decode path	#8, #9, #10
5	Strict stack validation	#4
6	Tests/docs/integration release confidence (mock stack)	#16, #14, #17
7	Real implementation + real-model integration	#12, #11, #25 (integration evidence on real weights—after Phase 6; no downstream issue depends on #25)
8	Benchmarking and runtime optimizations	#38
9	Output correctness validation	#39 (orchestration), #42 (Phase 9.1 - Numerical), #43 (Phase 9.2 - lm-eval)

Phase	Entry criteria	Exit criteria
0	Milestone issues exist and are labeled `mvp-llada2`	Upstream tracker active, roadmap links synchronized, PR checklist issue created
1	Phase 0 tracking stable	Config constants + remasking interface + field map merged or stable enough for dependent stubs
2	Phase 1 contracts available	Mock model (#24) registered and usable for downstream wiring; no Phase 2 #11 attention work; no requirement for real HF LLaDA2 forward (#12 + #11 → Phase 7)
3	Model and interface surfaces exist	Default policy implemented and model->policy handoff fields stabilized (mock tensors / #24)
4	Phase 2/3 interfaces stable enough to integrate	Scheduler-worker one-block path works with explicit grammar constraints; DllmWorker validated with model runner v2 where applicable; worker–runner overrides stay minimal (see #10)
5	Runtime baseline from Phase 4 exists	Invalid stack combinations fail fast with actionable errors
6	Runtime + validation are available	Unit/doc/integration evidence complete and reproducible for the mock plugin stack (v2 runner per #10 where applicable)
7	Phase 6 mock MVP exit criteria met (or parallel planning)	#12 + #11 shipped; #25 complete (extends #17 / real-weights proof; optional small PRs to #10/#13 only if gaps found)
8	Phase 7 real model complete (#12, #11, #25 closed)	Performance baselines documented; torch.compile integration verified; GPU capability detection operational
9.1	Phase 8 benchmarks available	Layer-by-layer numerical validation passes; tolerance bounds documented; router precision (FP32 vs BF16) validated; expert load balancing analyzed
9.2	Phase 9.1 numerical validation complete	lm-eval integration working; all benchmark tasks evaluated; results match HF/SGlang within tolerance (±1-2% categorical, ±2-3% generation); correctness test suite established

[MVP LLaDA2.0] Milestone orchestration: timeline, dependency graph, phased plan #19

Description

Summary

Delivery status (mock-stack train)

Canonical phase map

Phase gates (entry/exit criteria)

Dependency semantics

Milestone timeline (high-level)

Dependency graph (phase-aligned)

Deep dive by phase

Phase 0 - tracking and process enablement

Phase 1 - shared contracts

Phase 2 - mock model surface only

Phase 3 - remasking behavior and wiring

Phase 4 - runtime integration

Phase 5 - strict validation

Phase 6 - ship confidence

Phase 7 - real LLaDA2 model, attention, and #25

Phase 8 - benchmarking and runtime optimizations

Phase 9 - output correctness validation

PR sequencing and review contract

Preferred merge order (default)

Milestone PR checklist (required in PR description)

Milestone definition of done

Maintenance note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions