Skip to content

Phase 9: Output Correctness Validation and Reference Comparison #40

@AlonKellner-RedHat

Description

@AlonKellner-RedHat

Overview

Phase 9 focuses on validating the numerical correctness of LLaDA2.0 model outputs after Phase 7 (virtual batch attention) and Phase 8 (torch.compile optimization) implementation.

Milestone: Phase 9 - Correctness Validation
Blocked by: PR #38 (Phase 7+8 implementation)
Priority: P0 (Required before production deployment)

Background

Phase 7/8 implementation passes structural validation (API contracts, shape checks) but does not validate output correctness. Integration tests verify that the model generates something, not that it generates the right thing.

Risk: Model could be generating nonsense and tests would pass.

Scope

Required Validations

  1. Numerical Precision Validation

    • Compare logits/tensors against reference implementation
    • Verify attention patterns match specification
    • Validate MoE routing decisions are correct
    • Check block-style attention mask geometry
  2. lm-eval Integration

    • Run standard benchmark tasks (MMLU, HellaSwag, etc.)
    • Compare scores against baseline
    • Verify generation quality metrics
  3. Reference Implementation Comparison

    • SGlang: Compare against SGlang LLaDA2.0 implementation
    • HuggingFace: Compare against HF Transformers baseline
    • Match outputs token-for-token on fixed inputs
  4. Regression Testing

    • Establish golden outputs for key test cases
    • Add snapshot tests to CI
    • Prevent silent correctness degradation

Deliverables

1. Numerical Validation Tests

File: tests/test_llada2_correctness.py (NEW)

@pytest.mark.dllm_correctness
class TestLLaDA2Correctness:
    def test_attention_output_vs_reference(self):
        """Compare attention outputs against known-good reference."""
        pass
    
    def test_moe_routing_vs_spec(self):
        """Verify MoE routing matches specification."""
        pass
    
    def test_logits_precision_fp16(self):
        """Check numerical precision of logits."""
        pass

2. lm-eval Integration

File: tools/run_lm_eval.sh (NEW)

#!/bin/bash
# Run lm-evaluation-harness on LLaDA2.0 model

lm_eval --model vllm \
  --model_args pretrained=inclusionAI/LLaDA2.0-mini,trust_remote_code=True \
  --tasks mmlu,hellaswag,arc_easy \
  --device cuda:0 \
  --batch_size 1

3. Reference Comparison Tests

File: tests/test_llada2_reference_comparison.py (NEW)

@pytest.mark.dllm_reference
class TestLLaDA2ReferenceComparison:
    def test_vs_sglang_implementation(self):
        """Compare outputs with SGlang LLaDA2.0."""
        pass
    
    def test_vs_huggingface_transformers(self):
        """Compare outputs with HF reference."""
        pass
    
    def test_golden_outputs(self):
        """Verify outputs match saved golden snapshots."""
        pass

4. CI Integration

  • Add correctness pytest marker
  • Run on every PR (with caching for speed)
  • Fail PR if correctness tests don't pass
  • Document expected accuracy ranges

Success Criteria

  • Logits match reference implementation within tolerance (< 1e-3 difference)
  • Attention patterns verified to match block-style specification
  • lm-eval scores within expected range for LLaDA2.0-mini
  • Token-for-token match with reference on fixed inputs
  • Golden output snapshots established and passing in CI

Non-Goals (Out of Scope)

  • Performance optimization (covered in Phase 8)
  • Multi-request batching (deferred to Phase 7.1)
  • Custom attention kernels (future work)

Dependencies

Timeline

Estimated effort: 3-5 days
Target completion: 1 week after PR #38 merge

Breakdown:

  • Day 1: Set up lm-eval integration and baseline
  • Day 2-3: Implement numerical validation tests
  • Day 3-4: Reference comparison (SGlang/HF)
  • Day 4-5: Golden snapshots and CI integration

References

Related Issues


This issue is required before declaring Phase 7+8 production-ready. Phase 7/8 PR (#38) can merge with structural validation only, but this issue must be completed before production deployment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions