Overview
Phase 9 focuses on validating the numerical correctness of LLaDA2.0 model outputs after Phase 7 (virtual batch attention) and Phase 8 (torch.compile optimization) implementation.
Milestone: Phase 9 - Correctness Validation
Blocked by: PR #38 (Phase 7+8 implementation)
Priority: P0 (Required before production deployment)
Background
Phase 7/8 implementation passes structural validation (API contracts, shape checks) but does not validate output correctness. Integration tests verify that the model generates something, not that it generates the right thing.
Risk: Model could be generating nonsense and tests would pass.
Scope
Required Validations
-
Numerical Precision Validation
- Compare logits/tensors against reference implementation
- Verify attention patterns match specification
- Validate MoE routing decisions are correct
- Check block-style attention mask geometry
-
lm-eval Integration
- Run standard benchmark tasks (MMLU, HellaSwag, etc.)
- Compare scores against baseline
- Verify generation quality metrics
-
Reference Implementation Comparison
- SGlang: Compare against SGlang LLaDA2.0 implementation
- HuggingFace: Compare against HF Transformers baseline
- Match outputs token-for-token on fixed inputs
-
Regression Testing
- Establish golden outputs for key test cases
- Add snapshot tests to CI
- Prevent silent correctness degradation
Deliverables
1. Numerical Validation Tests
File: tests/test_llada2_correctness.py (NEW)
@pytest.mark.dllm_correctness
class TestLLaDA2Correctness:
def test_attention_output_vs_reference(self):
"""Compare attention outputs against known-good reference."""
pass
def test_moe_routing_vs_spec(self):
"""Verify MoE routing matches specification."""
pass
def test_logits_precision_fp16(self):
"""Check numerical precision of logits."""
pass
2. lm-eval Integration
File: tools/run_lm_eval.sh (NEW)
#!/bin/bash
# Run lm-evaluation-harness on LLaDA2.0 model
lm_eval --model vllm \
--model_args pretrained=inclusionAI/LLaDA2.0-mini,trust_remote_code=True \
--tasks mmlu,hellaswag,arc_easy \
--device cuda:0 \
--batch_size 1
3. Reference Comparison Tests
File: tests/test_llada2_reference_comparison.py (NEW)
@pytest.mark.dllm_reference
class TestLLaDA2ReferenceComparison:
def test_vs_sglang_implementation(self):
"""Compare outputs with SGlang LLaDA2.0."""
pass
def test_vs_huggingface_transformers(self):
"""Compare outputs with HF reference."""
pass
def test_golden_outputs(self):
"""Verify outputs match saved golden snapshots."""
pass
4. CI Integration
- Add
correctness pytest marker
- Run on every PR (with caching for speed)
- Fail PR if correctness tests don't pass
- Document expected accuracy ranges
Success Criteria
Non-Goals (Out of Scope)
- Performance optimization (covered in Phase 8)
- Multi-request batching (deferred to Phase 7.1)
- Custom attention kernels (future work)
Dependencies
Timeline
Estimated effort: 3-5 days
Target completion: 1 week after PR #38 merge
Breakdown:
- Day 1: Set up lm-eval integration and baseline
- Day 2-3: Implement numerical validation tests
- Day 3-4: Reference comparison (SGlang/HF)
- Day 4-5: Golden snapshots and CI integration
References
Related Issues
This issue is required before declaring Phase 7+8 production-ready. Phase 7/8 PR (#38) can merge with structural validation only, but this issue must be completed before production deployment.
Overview
Phase 9 focuses on validating the numerical correctness of LLaDA2.0 model outputs after Phase 7 (virtual batch attention) and Phase 8 (torch.compile optimization) implementation.
Milestone: Phase 9 - Correctness Validation
Blocked by: PR #38 (Phase 7+8 implementation)
Priority: P0 (Required before production deployment)
Background
Phase 7/8 implementation passes structural validation (API contracts, shape checks) but does not validate output correctness. Integration tests verify that the model generates something, not that it generates the right thing.
Risk: Model could be generating nonsense and tests would pass.
Scope
Required Validations
Numerical Precision Validation
lm-eval Integration
Reference Implementation Comparison
Regression Testing
Deliverables
1. Numerical Validation Tests
File:
tests/test_llada2_correctness.py(NEW)2. lm-eval Integration
File:
tools/run_lm_eval.sh(NEW)3. Reference Comparison Tests
File:
tests/test_llada2_reference_comparison.py(NEW)4. CI Integration
correctnesspytest markerSuccess Criteria
Non-Goals (Out of Scope)
Dependencies
Timeline
Estimated effort: 3-5 days
Target completion: 1 week after PR #38 merge
Breakdown:
References
.claude/plans/let-s-plan-phase-7-agile-mochi.mdRelated Issues
This issue is required before declaring Phase 7+8 production-ready. Phase 7/8 PR (#38) can merge with structural validation only, but this issue must be completed before production deployment.