Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 70 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,18 @@

**A deterministic data & execution substrate that enables reliable training.**

> Lazarus makes training runs reproducible **the same way lockfiles make builds reproducible**.

Offline batch plans, reproducible batching, measurable efficiency — the stuff you wish every training stack shipped with.

Built on MLX for Apple Silicon.
Runs on macOS; designed for Apple Silicon first (MLX backend).

**The core idea:** The BatchPlan is the contract. Trainers enforce it; they don't invent it. Build plans offline, version them, verify them in CI/CD (fingerprints + schema validation), and replay them exactly across distributed workers that share the same plan artifact. BatchPlans are fingerprinted against the tokenizer and length cache, so you can detect drift when data or tokenization changes.

**The core idea:** The BatchPlan becomes the contract, not the dataloader. Build plans offline, version them, verify them in CI/CD, and replay them exactly across distributed workers. BatchPlans are fingerprinted against the tokenizer and length cache, so you can detect drift when data or tokenization changes.
```
Dataset → Tokenizer → Length Cache → BatchPlan Artifact → Trainer (enforces) → Checkpoints
fingerprint └─────────── fingerprint ┘
```

Most training pipelines entangle data loading, batching, and execution inside the trainer, making runs hard to reproduce, debug, or scale. Lazarus separates *planning* from *execution*: batching decisions are made once, recorded as artifacts, and enforced consistently across runs and workers.

Expand Down Expand Up @@ -103,7 +110,8 @@ chuk-lazarus data batching analyze --cache lengths.jsonl --bucket-edges 128,256,
chuk-lazarus bench --num-samples 1000
chuk-lazarus bench -d train.jsonl -t gpt2 --bucket-edges 128,256,512

# Benchmark reports: length histogram, bucket efficiency, pack vs pad comparison,
# Benchmark reports are saved as JSON + markdown for tracking regressions:
# length histogram, bucket efficiency, pack vs pad comparison,
# throughput metrics, memory footprint, and actionable recommendations
```

Expand Down Expand Up @@ -160,13 +168,26 @@ async with TelnetGymClient(config) as client:
Training in Lazarus is driven entirely by precomputed BatchPlans. The trainer does not decide batching, sequencing, or token budgets — it enforces them.

> **Invariant:** If two runs use the same BatchPlan artifact (including its fingerprints) and seed, Lazarus guarantees identical batch structure and ordering across runs and workers.
>
> *Identical* means: same sample IDs per step, in the same order, with the same packing boundaries and token budgets. (Numerical results may differ slightly across hardware/kernel implementations; the **batch schedule** remains identical.)

```bash
# Train with SFT (uses BatchPlan for deterministic batching)
chuk-lazarus train sft --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --data train.jsonl --use-lora
# Canonical deterministic training (always use --batch-plan)
chuk-lazarus train sft \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--data train.jsonl \
--batch-plan batch_plan/ \
--use-lora

# Dev convenience (builds plan on the fly; still fingerprints and saves it)
chuk-lazarus train sft \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--data train.jsonl \
--build-plan --predictable \
--use-lora

# Train with DPO
chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl
chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl --batch-plan batch_plan/

# Generate synthetic training data
chuk-lazarus generate --type math --output ./data/lazarus
Expand Down Expand Up @@ -224,7 +245,7 @@ src/chuk_lazarus/
│ ├── models/ # CausalLM, classifiers
│ ├── families/ # Llama, Mamba implementations
│ ├── adapters/ # LoRA adapters
│ └── training/ # Loss functions
│ └── losses/ # Loss functions (pure math)
├── training/ # BatchPlan-driven reference trainers (SFT, DPO, GRPO, PPO)
├── inference/ # Text generation
├── distributed/ # Distributed training utilities
Expand All @@ -251,12 +272,53 @@ src/chuk_lazarus/
- **BatchPlan Artifacts**: Versioned, fingerprinted batch schedules for reproducibility and CI/CD
- **Pipeline Benchmark**: Pack vs pad comparison, throughput metrics, memory footprint analysis
- **BatchPlan-Driven Training**: Trainers enforce plans, not build them — deterministic by design
- **Focused Scope**: Lazarus does not optimize model architectures, optimizers, or schedulers — it makes data and execution deterministic and inspectable
- **Puzzle Arcade Integration**: Stream training data from 24 puzzle types for online/RL learning
- **Replay Buffers**: Priority sampling, difficulty tracking, curriculum support
- **Analysis**: Coverage, entropy, efficiency, fit scoring, vocabulary induction
- **Instrumentation**: Histograms, OOV analysis, waste metrics, vocab comparison

**What Lazarus is NOT:**
- Not a trainer framework competing with Lightning/Accelerate
- Not a new optimizer zoo or model architecture lab
- Not a "magic trainer" that decides things for you

**What Lazarus IS:** A reproducible planning/execution substrate you can plug into anything.

## Artifacts

BatchPlans are the core artifact. When you build a batch plan, Lazarus creates:

```
batch_plan/
├── plan.jsonl # Batch schedule: sample IDs, packing, token counts per step
├── metadata.json # Epochs, token budget, strategy, version info
├── fingerprints.json # Tokenizer + length cache fingerprints for drift detection
└── stats.json # Efficiency metrics: utilization, waste, packing ratio
```

**Schema promise:** The `plan.jsonl` format is stable. Each line is a JSON object:

```json
{"step":0,"samples":[12,88,104],"tokens":4096,"packing":[[0,128],[128,256]]}
```

Fields: `step` (global index), `samples` (sample IDs), `tokens` (batch total), `packing` (boundaries).

**metadata.json** includes:
- `plan_format_version`: Schema version for forward compatibility
- `tool_version`: Lazarus version that created the plan
- `seed`: Random seed used (if predictable mode)
- `created_at`: Timestamp

**CI/CD validation:**

```bash
# Validate a plan artifact before training (CI-friendly)
chuk-lazarus data batchplan validate -p batch_plan/ --strict
```

If the tokenizer or data changes, fingerprint mismatch is detected before training starts.

## Documentation

- [Getting Started](docs/getting-started.md) - Installation and quick reference
Expand Down
2 changes: 1 addition & 1 deletion docs/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,7 @@ models_v2/
├── adapters/ # Parameter-efficient fine-tuning
│ └── lora.py # LoRAConfig, LoRALinear, apply_lora
├── training/ # Training utilities
├── losses/ # Loss functions (pure math)
│ └── loss.py # compute_lm_loss
└── loader.py # load_model, load_model_async
Expand Down
2 changes: 1 addition & 1 deletion examples/integration/end_to_end_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def demo_model_architecture():
print(f" Target modules: {lora_config.target_modules}")

# 4. Show loss function (the only training-related thing in models_v2)
print("\n4. Basic LM Loss (models_v2.training.loss):")
print("\n4. Basic LM Loss (models_v2.losses.loss):")
print(" compute_lm_loss(model, input_ids, labels, attention_mask)")
print(" → Returns (loss, num_tokens)")

Expand Down
16 changes: 9 additions & 7 deletions src/chuk_lazarus/models_v2/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Models v2

A composable, async-native, Pydantic-native model framework for building and training language models on MLX.
A composable, Pydantic-native model framework for building and training language models on MLX.

Models v2 treats **model architecture as a composable system**, not a monolith.

## Architecture

Expand Down Expand Up @@ -67,16 +69,16 @@ mamba_model = MambaForCausalLM(mamba_config)
| `models/` | CausalLM, SequenceClassifier, TokenClassifier |
| `families/` | LlamaForCausalLM, MambaForCausalLM |
| `adapters/` | LoRA adapters for efficient fine-tuning |
| `training/` | Loss functions |
| `losses/` | Loss functions (pure math) |
| `loader.py` | Async model loading |

## Design Principles

- **Pydantic-native**: Configs use BaseModel with frozen=True
- **Async-native**: All I/O is async
- **No magic strings**: Enums for type safety
- **No dictionary goop**: Structured output types
- **Backend-agnostic**: Works on MLX, PyTorch, JAX
- **Pydantic-native**: Configs use BaseModel with frozen=True for validation and serialization
- **Async-native**: Model loading, checkpoint I/O, and external resources are async-safe
- **No magic strings**: Enums for type safety (AttentionType, FFNType, NormType, etc.)
- **No dictionary goop**: Structured output types (ModelOutput, BlockOutput, BackboneOutput)
- **Backend-agnostic by design**: Core abstractions are backend-neutral; MLX is the reference implementation

## Documentation

Expand Down
12 changes: 8 additions & 4 deletions src/chuk_lazarus/models_v2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,13 @@
FLOPsEstimate,
MemoryEstimate,
ModelCapabilities,
ModelInfo,
ParameterStats,
count_parameters,
detect_model_capabilities,
estimate_flops,
estimate_memory,
get_model_info,
introspect,
print_introspection,
)
Expand All @@ -140,6 +142,9 @@
load_model_async,
)

# Loss functions
from .losses import compute_lm_loss

# Models
from .models import (
CausalLM,
Expand All @@ -149,9 +154,6 @@
TokenClassifier,
)

# Training utilities
from .training import compute_lm_loss

__all__ = [
# === Core ===
# Enums
Expand Down Expand Up @@ -258,13 +260,15 @@
"FLOPsEstimate",
"MemoryEstimate",
"ModelCapabilities",
"ModelInfo",
"count_parameters",
"estimate_flops",
"estimate_memory",
"get_model_capabilities",
"detect_model_capabilities",
"get_model_info",
"introspect",
"print_introspection",
# === Training ===
# === Losses ===
"compute_lm_loss",
]
121 changes: 121 additions & 0 deletions src/chuk_lazarus/models_v2/introspection.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,76 @@ def summary(self) -> str:
)


@dataclass
class ModelInfo:
"""
Core model information for routing, benchmarking, and deployment.

This is the stable introspection contract that every Model should expose.
Mirrors the structure of BatchPlan metadata for consistency.

Used for:
- MoE routing decisions
- Distributed execution planning
- Memory-constrained deployment
- Gym-driven model selection
- Registry queries
"""

# Identity
name: str = ""
family: str = "" # e.g., "llama", "mamba"

# Architecture
params: int = 0
d_model: int = 0 # hidden_size
n_layers: int = 0
n_heads: int = 0
vocab_size: int = 0

# Sequence limits
max_seq_len: int = 0
context_window: int = 0 # May differ from max_seq_len for sliding window

# Capabilities (boolean flags for fast filtering)
supports_kv_cache: bool = False
supports_generation: bool = False
supports_lora: bool = True # Most models do
is_causal: bool = True

# Resource estimates (for routing/scheduling)
memory_mb: float = 0.0 # Inference memory
flops_per_token: int = 0

def summary(self) -> str:
"""Human-readable one-liner."""
return (
f"{self.name}: {self.params:,} params, "
f"d={self.d_model}, L={self.n_layers}, "
f"ctx={self.max_seq_len}"
)

def to_dict(self) -> dict:
"""Convert to dictionary for serialization."""
return {
"name": self.name,
"family": self.family,
"params": self.params,
"d_model": self.d_model,
"n_layers": self.n_layers,
"n_heads": self.n_heads,
"vocab_size": self.vocab_size,
"max_seq_len": self.max_seq_len,
"context_window": self.context_window,
"supports_kv_cache": self.supports_kv_cache,
"supports_generation": self.supports_generation,
"supports_lora": self.supports_lora,
"is_causal": self.is_causal,
"memory_mb": self.memory_mb,
"flops_per_token": self.flops_per_token,
}


@dataclass
class ModelCapabilities:
"""
Expand Down Expand Up @@ -498,3 +568,54 @@ def print_introspection(model: nn.Module, config: ModelConfig | None = None) ->
print(f"\n{mem.summary()}")

print("=" * 60)


def get_model_info(
model: nn.Module,
config: ModelConfig | None = None,
name: str = "",
family: str = "",
) -> ModelInfo:
"""
Build ModelInfo from a model instance.

This is the canonical way to get the stable introspection contract.

Args:
model: The model to introspect
config: Model configuration (required for full info)
name: Model name (optional, for display)
family: Model family (optional, e.g., "llama", "mamba")

Returns:
ModelInfo with all available information
"""
params = count_parameters(model)
caps = detect_model_capabilities(model)

info = ModelInfo(
name=name or model.__class__.__name__,
family=family,
params=params.total,
supports_kv_cache=caps.supports_kv_cache,
supports_generation=caps.is_causal_lm,
supports_lora=caps.supports_lora,
is_causal=caps.is_causal_lm,
)

if config is not None:
info.d_model = config.hidden_size
info.n_layers = config.num_hidden_layers
info.n_heads = config.num_attention_heads
info.vocab_size = config.vocab_size
info.max_seq_len = getattr(config, "max_position_embeddings", 0)
info.context_window = info.max_seq_len

# Estimate resources
flops = estimate_flops(config, seq_length=1, batch_size=1)
info.flops_per_token = flops.per_token

memory = estimate_memory(model, config, seq_length=1, batch_size=1)
info.memory_mb = memory.total_inference_mb

return info
11 changes: 11 additions & 0 deletions src/chuk_lazarus/models_v2/losses/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""
Loss functions for models_v2.

Pure math: CE, DPO loss, etc. Training loops live in src/chuk_lazarus/training/.
"""

from .loss import compute_lm_loss

__all__ = [
"compute_lm_loss",
]
11 changes: 0 additions & 11 deletions src/chuk_lazarus/models_v2/training/__init__.py

This file was deleted.

Loading