chrishayuk · chrishayuk · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025
diff --git a/README.md b/README.md
@@ -2,11 +2,18 @@
 
 **A deterministic data & execution substrate that enables reliable training.**
 
+> Lazarus makes training runs reproducible **the same way lockfiles make builds reproducible**.
+
 Offline batch plans, reproducible batching, measurable efficiency — the stuff you wish every training stack shipped with.
 
-Built on MLX for Apple Silicon.
+Runs on macOS; designed for Apple Silicon first (MLX backend).
+
+**The core idea:** The BatchPlan is the contract. Trainers enforce it; they don't invent it. Build plans offline, version them, verify them in CI/CD (fingerprints + schema validation), and replay them exactly across distributed workers that share the same plan artifact. BatchPlans are fingerprinted against the tokenizer and length cache, so you can detect drift when data or tokenization changes.
 
-**The core idea:** The BatchPlan becomes the contract, not the dataloader. Build plans offline, version them, verify them in CI/CD, and replay them exactly across distributed workers. BatchPlans are fingerprinted against the tokenizer and length cache, so you can detect drift when data or tokenization changes.
+```
+Dataset → Tokenizer → Length Cache → BatchPlan Artifact → Trainer (enforces) → Checkpoints
+                 fingerprint └─────────── fingerprint ┘
+```
 
 Most training pipelines entangle data loading, batching, and execution inside the trainer, making runs hard to reproduce, debug, or scale. Lazarus separates *planning* from *execution*: batching decisions are made once, recorded as artifacts, and enforced consistently across runs and workers.
 
@@ -103,7 +110,8 @@ chuk-lazarus data batching analyze --cache lengths.jsonl --bucket-edges 128,256,
 chuk-lazarus bench --num-samples 1000
 chuk-lazarus bench -d train.jsonl -t gpt2 --bucket-edges 128,256,512
 
-# Benchmark reports: length histogram, bucket efficiency, pack vs pad comparison,
+# Benchmark reports are saved as JSON + markdown for tracking regressions:
+# length histogram, bucket efficiency, pack vs pad comparison,
 # throughput metrics, memory footprint, and actionable recommendations
 ```
 
@@ -160,13 +168,26 @@ async with TelnetGymClient(config) as client:
 Training in Lazarus is driven entirely by precomputed BatchPlans. The trainer does not decide batching, sequencing, or token budgets — it enforces them.
 
 > **Invariant:** If two runs use the same BatchPlan artifact (including its fingerprints) and seed, Lazarus guarantees identical batch structure and ordering across runs and workers.
+>
+> *Identical* means: same sample IDs per step, in the same order, with the same packing boundaries and token budgets. (Numerical results may differ slightly across hardware/kernel implementations; the **batch schedule** remains identical.)
 
 ```bash
-# Train with SFT (uses BatchPlan for deterministic batching)
-chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora
+# Canonical deterministic training (always use --batch-plan)
+chuk-lazarus train sft \
+  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
+  --data train.jsonl \
+  --batch-plan batch_plan/ \
+  --use-lora
+
+# Dev convenience (builds plan on the fly; still fingerprints and saves it)
+chuk-lazarus train sft \
+  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
+  --data train.jsonl \
+  --build-plan --predictable \
+  --use-lora
 
 # Train with DPO
-chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl
+chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl --batch-plan batch_plan/
 
 # Generate synthetic training data
 chuk-lazarus generate --type math --output ./data/lazarus
@@ -224,7 +245,7 @@ src/chuk_lazarus/
 │   ├── models/             # CausalLM, classifiers
 │   ├── families/           # Llama, Mamba implementations
 │   ├── adapters/           # LoRA adapters
-│   └── training/           # Loss functions
+│   └── losses/             # Loss functions (pure math)
 ├── training/               # BatchPlan-driven reference trainers (SFT, DPO, GRPO, PPO)
 ├── inference/              # Text generation
 ├── distributed/            # Distributed training utilities
@@ -251,12 +272,53 @@ src/chuk_lazarus/
 - **BatchPlan Artifacts**: Versioned, fingerprinted batch schedules for reproducibility and CI/CD
 - **Pipeline Benchmark**: Pack vs pad comparison, throughput metrics, memory footprint analysis
 - **BatchPlan-Driven Training**: Trainers enforce plans, not build them — deterministic by design
-- **Focused Scope**: Lazarus does not optimize model architectures, optimizers, or schedulers — it makes data and execution deterministic and inspectable
 - **Puzzle Arcade Integration**: Stream training data from 24 puzzle types for online/RL learning
 - **Replay Buffers**: Priority sampling, difficulty tracking, curriculum support
 - **Analysis**: Coverage, entropy, efficiency, fit scoring, vocabulary induction
 - **Instrumentation**: Histograms, OOV analysis, waste metrics, vocab comparison
 
+**What Lazarus is NOT:**
+- Not a trainer framework competing with Lightning/Accelerate
+- Not a new optimizer zoo or model architecture lab
+- Not a "magic trainer" that decides things for you
+
+**What Lazarus IS:** A reproducible planning/execution substrate you can plug into anything.
+
+## Artifacts
+
+BatchPlans are the core artifact. When you build a batch plan, Lazarus creates:
+
+```
+batch_plan/
+├── plan.jsonl          # Batch schedule: sample IDs, packing, token counts per step
+├── metadata.json       # Epochs, token budget, strategy, version info
+├── fingerprints.json   # Tokenizer + length cache fingerprints for drift detection
+└── stats.json          # Efficiency metrics: utilization, waste, packing ratio
+```
+
+**Schema promise:** The `plan.jsonl` format is stable. Each line is a JSON object:
+
+```json
+{"step":0,"samples":[12,88,104],"tokens":4096,"packing":[[0,128],[128,256]]}
+```
+
+Fields: `step` (global index), `samples` (sample IDs), `tokens` (batch total), `packing` (boundaries).
+
+**metadata.json** includes:
+- `plan_format_version`: Schema version for forward compatibility
+- `tool_version`: Lazarus version that created the plan
+- `seed`: Random seed used (if predictable mode)
+- `created_at`: Timestamp
+
+**CI/CD validation:**
+
+```bash
+# Validate a plan artifact before training (CI-friendly)
+chuk-lazarus data batchplan validate -p batch_plan/ --strict
+```
+
+If the tokenizer or data changes, fingerprint mismatch is detected before training starts.
+
 ## Documentation
 
 - [Getting Started](docs/getting-started.md) - Installation and quick reference

diff --git a/docs/models.md b/docs/models.md
@@ -690,7 +690,7 @@ models_v2/
 ├── adapters/                # Parameter-efficient fine-tuning
 │   └── lora.py              # LoRAConfig, LoRALinear, apply_lora
 │
-├── training/                # Training utilities
+├── losses/                  # Loss functions (pure math)
 │   └── loss.py              # compute_lm_loss
 │
 └── loader.py                # load_model, load_model_async

diff --git a/examples/integration/end_to_end_training.py b/examples/integration/end_to_end_training.py
@@ -206,7 +206,7 @@ def demo_model_architecture():
     print(f"   Target modules: {lora_config.target_modules}")
 
     # 4. Show loss function (the only training-related thing in models_v2)
-    print("\n4. Basic LM Loss (models_v2.training.loss):")
+    print("\n4. Basic LM Loss (models_v2.losses.loss):")
     print("   compute_lm_loss(model, input_ids, labels, attention_mask)")
     print("   → Returns (loss, num_tokens)")
 

diff --git a/src/chuk_lazarus/models_v2/README.md b/src/chuk_lazarus/models_v2/README.md
@@ -1,6 +1,8 @@
 # Models v2
 
-A composable, async-native, Pydantic-native model framework for building and training language models on MLX.
+A composable, Pydantic-native model framework for building and training language models on MLX.
+
+Models v2 treats **model architecture as a composable system**, not a monolith.
 
 ## Architecture
 
@@ -67,16 +69,16 @@ mamba_model = MambaForCausalLM(mamba_config)
 | `models/` | CausalLM, SequenceClassifier, TokenClassifier |
 | `families/` | LlamaForCausalLM, MambaForCausalLM |
 | `adapters/` | LoRA adapters for efficient fine-tuning |
-| `training/` | Loss functions |
+| `losses/` | Loss functions (pure math) |
 | `loader.py` | Async model loading |
 
 ## Design Principles
 
-- **Pydantic-native**: Configs use BaseModel with frozen=True
-- **Async-native**: All I/O is async
-- **No magic strings**: Enums for type safety
-- **No dictionary goop**: Structured output types
-- **Backend-agnostic**: Works on MLX, PyTorch, JAX
+- **Pydantic-native**: Configs use BaseModel with frozen=True for validation and serialization
+- **Async-native**: Model loading, checkpoint I/O, and external resources are async-safe
+- **No magic strings**: Enums for type safety (AttentionType, FFNType, NormType, etc.)
+- **No dictionary goop**: Structured output types (ModelOutput, BlockOutput, BackboneOutput)
+- **Backend-agnostic by design**: Core abstractions are backend-neutral; MLX is the reference implementation
 
 ## Documentation
 

diff --git a/src/chuk_lazarus/models_v2/__init__.py b/src/chuk_lazarus/models_v2/__init__.py
@@ -123,11 +123,13 @@
     FLOPsEstimate,
     MemoryEstimate,
     ModelCapabilities,
+    ModelInfo,
     ParameterStats,
     count_parameters,
     detect_model_capabilities,
     estimate_flops,
     estimate_memory,
+    get_model_info,
     introspect,
     print_introspection,
 )
@@ -140,6 +142,9 @@
     load_model_async,
 )
 
+# Loss functions
+from .losses import compute_lm_loss
+
 # Models
 from .models import (
     CausalLM,
@@ -149,9 +154,6 @@
     TokenClassifier,
 )
 
-# Training utilities
-from .training import compute_lm_loss
-
 __all__ = [
     # === Core ===
     # Enums
@@ -258,13 +260,15 @@
     "FLOPsEstimate",
     "MemoryEstimate",
     "ModelCapabilities",
+    "ModelInfo",
     "count_parameters",
     "estimate_flops",
     "estimate_memory",
     "get_model_capabilities",
     "detect_model_capabilities",
+    "get_model_info",
     "introspect",
     "print_introspection",
-    # === Training ===
+    # === Losses ===
     "compute_lm_loss",
 ]
diff --git a/src/chuk_lazarus/models_v2/introspection.py b/src/chuk_lazarus/models_v2/introspection.py
@@ -205,6 +205,76 @@ def summary(self) -> str:
         )
 
 
+@dataclass
+class ModelInfo:
+    """
+    Core model information for routing, benchmarking, and deployment.
+
+    This is the stable introspection contract that every Model should expose.
+    Mirrors the structure of BatchPlan metadata for consistency.
+
+    Used for:
+    - MoE routing decisions
+    - Distributed execution planning
+    - Memory-constrained deployment
+    - Gym-driven model selection
+    - Registry queries
+    """
+
+    # Identity
+    name: str = ""
+    family: str = ""  # e.g., "llama", "mamba"
+
+    # Architecture
+    params: int = 0
+    d_model: int = 0  # hidden_size
+    n_layers: int = 0
+    n_heads: int = 0
+    vocab_size: int = 0
+
+    # Sequence limits
+    max_seq_len: int = 0
+    context_window: int = 0  # May differ from max_seq_len for sliding window
+
+    # Capabilities (boolean flags for fast filtering)
+    supports_kv_cache: bool = False
+    supports_generation: bool = False
+    supports_lora: bool = True  # Most models do
+    is_causal: bool = True
+
+    # Resource estimates (for routing/scheduling)
+    memory_mb: float = 0.0  # Inference memory
+    flops_per_token: int = 0
+
+    def summary(self) -> str:
+        """Human-readable one-liner."""
+        return (
+            f"{self.name}: {self.params:,} params, "
+            f"d={self.d_model}, L={self.n_layers}, "
+            f"ctx={self.max_seq_len}"
+        )
+
+    def to_dict(self) -> dict:
+        """Convert to dictionary for serialization."""
+        return {
+            "name": self.name,
+            "family": self.family,
+            "params": self.params,
+            "d_model": self.d_model,
+            "n_layers": self.n_layers,
+            "n_heads": self.n_heads,
+            "vocab_size": self.vocab_size,
+            "max_seq_len": self.max_seq_len,
+            "context_window": self.context_window,
+            "supports_kv_cache": self.supports_kv_cache,
+            "supports_generation": self.supports_generation,
+            "supports_lora": self.supports_lora,
+            "is_causal": self.is_causal,
+            "memory_mb": self.memory_mb,
+            "flops_per_token": self.flops_per_token,
+        }
+
+
 @dataclass
 class ModelCapabilities:
     """
@@ -498,3 +568,54 @@ def print_introspection(model: nn.Module, config: ModelConfig | None = None) ->
         print(f"\n{mem.summary()}")
 
     print("=" * 60)
+
+
+def get_model_info(
+    model: nn.Module,
+    config: ModelConfig | None = None,
+    name: str = "",
+    family: str = "",
+) -> ModelInfo:
+    """
+    Build ModelInfo from a model instance.
+
+    This is the canonical way to get the stable introspection contract.
+
+    Args:
+        model: The model to introspect
+        config: Model configuration (required for full info)
+        name: Model name (optional, for display)
+        family: Model family (optional, e.g., "llama", "mamba")
+
+    Returns:
+        ModelInfo with all available information
+    """
+    params = count_parameters(model)
+    caps = detect_model_capabilities(model)
+
+    info = ModelInfo(
+        name=name or model.__class__.__name__,
+        family=family,
+        params=params.total,
+        supports_kv_cache=caps.supports_kv_cache,
+        supports_generation=caps.is_causal_lm,
+        supports_lora=caps.supports_lora,
+        is_causal=caps.is_causal_lm,
+    )
+
+    if config is not None:
+        info.d_model = config.hidden_size
+        info.n_layers = config.num_hidden_layers
+        info.n_heads = config.num_attention_heads
+        info.vocab_size = config.vocab_size
+        info.max_seq_len = getattr(config, "max_position_embeddings", 0)
+        info.context_window = info.max_seq_len
+
+        # Estimate resources
+        flops = estimate_flops(config, seq_length=1, batch_size=1)
+        info.flops_per_token = flops.per_token
+
+        memory = estimate_memory(model, config, seq_length=1, batch_size=1)
+        info.memory_mb = memory.total_inference_mb
+
+    return info
diff --git a/src/chuk_lazarus/models_v2/losses/__init__.py b/src/chuk_lazarus/models_v2/losses/__init__.py
@@ -0,0 +1,11 @@
+"""
+Loss functions for models_v2.
+
+Pure math: CE, DPO loss, etc. Training loops live in src/chuk_lazarus/training/.
+"""
+
+from .loss import compute_lm_loss
+
+__all__ = [
+    "compute_lm_loss",
+]
diff --git a/src/chuk_lazarus/models_v2/training/loss.py → src/chuk_lazarus/models_v2/losses/loss.py b/src/chuk_lazarus/models_v2/training/loss.py → src/chuk_lazarus/models_v2/losses/loss.py
diff --git a/src/chuk_lazarus/models_v2/training/__init__.py b/src/chuk_lazarus/models_v2/training/__init__.py