Phase 9: Output Correctness Validation and Reference Comparison

## Overview

Phase 9 focuses on validating the **numerical correctness** of LLaDA2.0 model outputs after Phase 7 (virtual batch attention) and Phase 8 (torch.compile optimization) implementation.

**Milestone:** Phase 9 - Correctness Validation  
**Blocked by:** PR #38 (Phase 7+8 implementation)  
**Priority:** P0 (Required before production deployment)

## Background

Phase 7/8 implementation passes structural validation (API contracts, shape checks) but does not validate output correctness. Integration tests verify that the model generates *something*, not that it generates the *right thing*.

**Risk:** Model could be generating nonsense and tests would pass.

## Scope

### Required Validations

1. **Numerical Precision Validation**
   - Compare logits/tensors against reference implementation
   - Verify attention patterns match specification
   - Validate MoE routing decisions are correct
   - Check block-style attention mask geometry

2. **lm-eval Integration**
   - Run standard benchmark tasks (MMLU, HellaSwag, etc.)
   - Compare scores against baseline
   - Verify generation quality metrics

3. **Reference Implementation Comparison**
   - **SGlang:** Compare against SGlang LLaDA2.0 implementation
   - **HuggingFace:** Compare against HF Transformers baseline
   - Match outputs token-for-token on fixed inputs

4. **Regression Testing**
   - Establish golden outputs for key test cases
   - Add snapshot tests to CI
   - Prevent silent correctness degradation

## Deliverables

### 1. Numerical Validation Tests

**File:** `tests/test_llada2_correctness.py` (NEW)

```python
@pytest.mark.dllm_correctness
class TestLLaDA2Correctness:
    def test_attention_output_vs_reference(self):
        """Compare attention outputs against known-good reference."""
        pass
    
    def test_moe_routing_vs_spec(self):
        """Verify MoE routing matches specification."""
        pass
    
    def test_logits_precision_fp16(self):
        """Check numerical precision of logits."""
        pass
```

### 2. lm-eval Integration

**File:** `tools/run_lm_eval.sh` (NEW)

```bash
#!/bin/bash
# Run lm-evaluation-harness on LLaDA2.0 model

lm_eval --model vllm \
  --model_args pretrained=inclusionAI/LLaDA2.0-mini,trust_remote_code=True \
  --tasks mmlu,hellaswag,arc_easy \
  --device cuda:0 \
  --batch_size 1
```

### 3. Reference Comparison Tests

**File:** `tests/test_llada2_reference_comparison.py` (NEW)

```python
@pytest.mark.dllm_reference
class TestLLaDA2ReferenceComparison:
    def test_vs_sglang_implementation(self):
        """Compare outputs with SGlang LLaDA2.0."""
        pass
    
    def test_vs_huggingface_transformers(self):
        """Compare outputs with HF reference."""
        pass
    
    def test_golden_outputs(self):
        """Verify outputs match saved golden snapshots."""
        pass
```

### 4. CI Integration

- Add `correctness` pytest marker
- Run on every PR (with caching for speed)
- Fail PR if correctness tests don't pass
- Document expected accuracy ranges

## Success Criteria

- [ ] Logits match reference implementation within tolerance (< 1e-3 difference)
- [ ] Attention patterns verified to match block-style specification
- [ ] lm-eval scores within expected range for LLaDA2.0-mini
- [ ] Token-for-token match with reference on fixed inputs
- [ ] Golden output snapshots established and passing in CI

## Non-Goals (Out of Scope)

- Performance optimization (covered in Phase 8)
- Multi-request batching (deferred to Phase 7.1)
- Custom attention kernels (future work)

## Dependencies

- **Blocked by:** PR #38 merge (Phase 7+8 implementation)
- **Requires:** Access to reference implementations (SGlang, HF)
- **Requires:** GPU for lm-eval and correctness tests

## Timeline

**Estimated effort:** 3-5 days  
**Target completion:** 1 week after PR #38 merge

**Breakdown:**
- Day 1: Set up lm-eval integration and baseline
- Day 2-3: Implement numerical validation tests
- Day 3-4: Reference comparison (SGlang/HF)
- Day 4-5: Golden snapshots and CI integration

## References

- **lm-evaluation-harness:** https://github.com/EleutherAI/lm-evaluation-harness
- **SGlang:** https://github.com/sgl-project/sglang
- **Phase 7 Implementation:** PR #38
- **Phase 7 Plan:** `.claude/plans/let-s-plan-phase-7-agile-mochi.md`

## Related Issues

- #12 - Real LLaDA2.0 Model Implementation (Phase 7)
- #11 - Block-Style Attention (Phase 7)
- #25 - GPU Integration Testing (Phase 7)
- #19 - Milestone: Phase 7+8 completion

---

**This issue is required before declaring Phase 7+8 production-ready.** Phase 7/8 PR (#38) can merge with structural validation only, but this issue must be completed before production deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 9: Output Correctness Validation and Reference Comparison #40

Overview

Background

Scope

Required Validations

Deliverables

1. Numerical Validation Tests

2. lm-eval Integration

3. Reference Comparison Tests

4. CI Integration

Success Criteria

Non-Goals (Out of Scope)

Dependencies

Timeline

References

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase 9: Output Correctness Validation and Reference Comparison #40

Description

Overview

Background

Scope

Required Validations

Deliverables

1. Numerical Validation Tests

2. lm-eval Integration

3. Reference Comparison Tests

4. CI Integration

Success Criteria

Non-Goals (Out of Scope)

Dependencies

Timeline

References

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions