chrishayuk · chrishayuk · Jan 2, 2026 · Dec 26, 2025 · Dec 30, 2025 · Dec 30, 2025
diff --git a/.gitignore b/.gitignore
@@ -70,3 +70,6 @@ sample_data/
 
 # Jupyter
 .ipynb_checkpoints/
+
+# ai
+.claude/
diff --git a/L19_cluster.png b/L19_cluster.png
diff --git a/L20_cluster.png b/L20_cluster.png
diff --git a/L21_cluster.png b/L21_cluster.png
diff --git a/README.md b/README.md
@@ -161,7 +161,7 @@ async with TelnetGymClient(config) as client:
     buffer.add(sample)
 ```
 
-**Supported puzzles:** Sudoku, KenKen, Nonogram, Lights Out, Sokoban, Minesweeper, and [16 others](docs/gym.md).
+**Supported puzzles:** Sudoku, KenKen, Kakuro, Binary, Futoshiki, Nonogram, Logic Grid, Killer Sudoku, Lights Out, Mastermind, Slitherlink, Bridges, Hitori, Shikaku, Hidato, Tents, Fillomino, Star Battle, Sokoban, Knapsack, Nurikabe, Minesweeper.
 
 ### BatchPlan-Driven Training
 
@@ -196,33 +196,30 @@ chuk-lazarus generate --type math --output ./data/lazarus
 chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"
 ```
 
-### Inference Pipeline (New!)
+### UnifiedPipeline
 
-The new unified inference pipeline provides a simplified API for running inference with any supported model family. One-liner setup, no boilerplate:
+The `UnifiedPipeline` auto-detects model family and provides a simplified API. One-liner setup, no boilerplate:
 
 ```python
-from chuk_lazarus.inference import InferencePipeline, PipelineConfig, DType
-from chuk_lazarus.models_v2 import LlamaConfig, LlamaForCausalLM
-
-# One-liner model loading
-pipeline = InferencePipeline.from_pretrained(
-    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-    LlamaForCausalLM,
-    LlamaConfig,
-)
+from chuk_lazarus.inference import UnifiedPipeline, UnifiedPipelineConfig, DType
+
+# One-liner model loading - auto-detects family!
+pipeline = UnifiedPipeline.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
 
 # Simple chat API
 result = pipeline.chat("What is the capital of France?")
 print(result.text)
 print(result.stats.summary)  # "25 tokens in 0.42s (59.5 tok/s)"
+print(f"Model family: {pipeline.family_type}")  # ModelFamilyType.LLAMA
 ```
 
 **Key features:**
-- Typed configuration with Pydantic (`PipelineConfig`, `GenerationConfig`)
-- Async support (`InferencePipeline.from_pretrained_async`)
+- Auto-detection of model family from HuggingFace config
+- Typed configuration with Pydantic (`UnifiedPipelineConfig`, `GenerationConfig`)
+- Async support (`UnifiedPipeline.from_pretrained_async`)
 - Chat history management (`ChatHistory`)
 - Streaming generation (`generate_stream`)
-- No magic strings - uses enums (`DType`, `Role`)
+- No magic strings - uses enums (`DType`, `Role`, `ModelFamilyType`)
 
 ```bash
 # Simplified inference examples
@@ -251,6 +248,14 @@ uv run python examples/inference/granite_inference.py --model granite-3.1-2b
 
 # Llama 4 Scout (Hybrid Mamba-Transformer MoE)
 uv run python examples/inference/llama4_inference.py
+
+# StarCoder2 (Code generation, 3B/7B/15B)
+uv run python examples/inference/starcoder2_inference.py --prompt "def fibonacci(n):"
+uv run python examples/inference/starcoder2_inference.py --interactive  # Interactive mode
+
+# Jamba (Hybrid Mamba-Transformer MoE, 256K context)
+uv run python examples/inference/jamba_inference.py --test-tiny  # Test without download
+uv run python examples/inference/jamba_inference.py --list       # Show models
 ```
 
 ### FunctionGemma (Function Calling)
@@ -270,6 +275,112 @@ FunctionGemma is a 270M parameter model optimized for on-device function calling
 
 See [docs/inference.md](docs/inference.md) for detailed inference documentation.
 
+### Introspection (Model Analysis)
+
+Analyze model behavior using logit lens, ablation studies, attention visualization, and MoE expert identification:
+
+```bash
+# Run logit lens analysis - see how predictions evolve across layers
+chuk-lazarus introspect analyze -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 -p "The capital of France is"
+
+# Track specific tokens through layers
+chuk-lazarus introspect analyze -m model -p "Hello" --track "world,there" --layer-strategy all
+
+# Compare two models' predictions
+chuk-lazarus introspect compare -m1 google/gemma-3-270m-it -m2 google/functiongemma-270m-it -p "Get the weather" --track "get_"
+
+# Ablation study - find causal circuits
+chuk-lazarus introspect ablate -m model -p "What's the weather?" -c function_call --layers 8-15
+
+# Multi-layer ablation - test layers together
+chuk-lazarus introspect ablate -m model -p "45 * 45 = " -c "2025" --layers 22,23 --multi
+
+# Low-level hook demonstration
+chuk-lazarus introspect hooks -m model -p "Test" --layers 0,4,8 --capture-attention
+```
+
+**MoE Expert Identification** - Discover what each expert specializes in:
+
+```python
+from mlx_lm import load
+from chuk_lazarus.introspection import ExpertIdentifier, identify_experts
+
+# Load any MoE model
+model, tokenizer = load("openai/gpt-oss-20b")
+
+# Identify all experts in a layer
+result = identify_experts(model, tokenizer, layer_idx=12)
+print(result.summary())
+
+# Results show expert specializations:
+# CODE: Experts [1, 14, 22, 23, 27, 28]
+# MATH: Experts [6, 7, 19, 24, 30, 31]
+# CONTENT_WORDS: Experts [0, 2, 3, 4, 5, 8, 9, ...]
+# NAMES: Experts [15, 26]
+
+# Get detailed identity for specific expert
+expert_6 = result.expert_identities[6]
+print(expert_6.detailed_report())
+# Expert 6: math (52% confidence)
+# Top tokens: ['+', '2', 'x', '3', ...]
+# Semantic clusters: ['numeric_values']
+```
+
+**MoE Routing Analysis** - Capture and analyze routing decisions:
+
+```python
+from chuk_lazarus.introspection import MoEHooks, MoECaptureConfig
+
+hooks = MoEHooks(model)
+hooks.configure(MoECaptureConfig(
+    capture_router_logits=True,
+    capture_selected_experts=True,
+))
+
+logits = hooks.forward(input_ids)
+
+# Analyze routing
+utilization = hooks.get_expert_utilization(layer_idx=12)
+print(f"Load balance: {utilization.load_balance_score:.2%}")
+
+entropy = hooks.get_router_entropy(layer_idx=12)
+print(f"Router confidence: {1 - entropy.normalized_entropy:.2%}")
+```
+
+**Logit Lens and Ablation:**
+
+```python
+from chuk_lazarus.introspection import ModelAnalyzer, AnalysisConfig, LayerStrategy
+
+# Async API for logit lens analysis
+async with ModelAnalyzer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") as analyzer:
+    result = await analyzer.analyze("The capital of France is")
+    print(result.predicted_token)  # " Paris"
+    for layer in result.layer_predictions:
+        print(f"Layer {layer.layer_idx}: {layer.top_token}")
+
+# Track token evolution
+config = AnalysisConfig(track_tokens=["Paris", " Paris"])
+result = await analyzer.analyze("The capital of France is", config)
+for evo in result.token_evolutions:
+    print(f"{evo.token} emerges at layer {evo.emergence_layer}")
+```
+
+```python
+from chuk_lazarus.introspection import AblationStudy, AblationConfig
+
+# Ablation studies - identify causal circuits
+study = AblationStudy.from_pretrained("openai/gpt-oss-20b")
+config = AblationConfig(max_new_tokens=15)
+
+original = study.ablate_and_generate("45 * 45 = ", layers=[], config=config)
+ablated = study.ablate_and_generate("45 * 45 = ", layers=[22, 23], config=config)
+print(f"Original: {original}")  # "2025..."
+print(f"L22+L23 ablated: {ablated}")  # Broken output
+```
+
+See [docs/introspection.md](docs/introspection.md) for detailed introspection documentation.
+
 ## Python API
 
 ```python
@@ -322,10 +433,18 @@ src/chuk_lazarus/
 │   └── losses/             # Loss functions (pure math)
 ├── training/               # BatchPlan-driven reference trainers (SFT, DPO, GRPO, PPO)
 ├── inference/              # Unified inference pipeline
-│   ├── pipeline.py         # InferencePipeline high-level API
+│   ├── unified.py          # UnifiedPipeline with auto-detection
 │   ├── loader.py           # HFLoader, DType, WeightConverter
 │   ├── chat.py             # ChatHistory, Role, format_chat_prompt
 │   └── generation.py       # GenerationConfig, generate, generate_stream
+├── introspection/          # Model introspection and analysis
+│   ├── analyzer.py         # ModelAnalyzer async API with Pydantic models
+│   ├── hooks.py            # ModelHooks for capturing intermediate states
+│   ├── logit_lens.py       # Layer-by-layer prediction analysis
+│   ├── attention.py        # Attention pattern analysis
+│   ├── moe.py              # MoE introspection (routing, expert identification)
+│   ├── ablation/           # Ablation studies for causal discovery
+│   └── visualizers/        # Heatmaps and evolution plots
 ├── distributed/            # Distributed training utilities
 └── utils/                  # Utilities
 ```
@@ -335,14 +454,16 @@ src/chuk_lazarus/
 | Module | Description |
 |--------|-------------|
 | **Models** | Composable architecture: components, blocks, backbones, heads, families (Llama, Gemma, Granite) |
-| **Inference** | Unified pipeline API: `InferencePipeline`, chat history, streaming generation |
+| **Inference** | `UnifiedPipeline` with auto-detection, chat history, streaming generation |
+| **Introspection** | Model analysis: logit lens, attention visualization, MoE expert identification, ablation studies |
 | **Tokenizers** | Comprehensive toolkit for analysis, preprocessing, and runtime management |
 | **Batching** | Token-budget batching, sequence packing, distributed batch planning |
 | **Streaming** | Puzzle arcade integration, replay buffers, online learning |
 | **Training** | BatchPlan-driven trainers — enforce, don't decide |
 
 ## Features
 
+- **Introspection**: Logit lens, attention visualization, MoE expert identification, ablation studies, token evolution tracking
 - **Tokenizer Toolkit**: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
 - **Character Tokenizer**: Built-in character-level tokenizer for classification experiments
 - **Tokenizer Doctor**: Health check with auto-fix for missing chat templates
@@ -404,6 +525,7 @@ If the tokenizer or data changes, fingerprint mismatch is detected before traini
 - [CLI Reference](docs/cli.md) - Command-line interface documentation
 - [Models Guide](docs/models.md) - Composable model architecture, components, LoRA adapters
 - [Inference Guide](docs/inference.md) - Run inference with pretrained HuggingFace models
+- [Introspection Guide](docs/introspection.md) - Logit lens, attention visualization, model analysis
 - [Tokenizers Guide](docs/tokenizers.md) - Comprehensive tokenizer toolkit
 - [Batching Guide](docs/batching.md) - Token-budget batching, packing, distributed training
 - [Training Guide](docs/training.md) - BatchPlan-driven training
@@ -419,6 +541,8 @@ If the tokenizer or data changes, fingerprint mismatch is detected before traini
 | **Gemma** | Gemma 3 (270M, 1B, 4B, 12B, 27B), FunctionGemma | 128K context, function calling |
 | **Granite** | 3.0/3.1 (2B, 8B), 4.0 Tiny (1B, 1.5B MoE) | IBM, dense and MoE variants |
 | **StarCoder2** | 3B, 7B, 15B | Code generation |
+| **Jamba** | v0.1, 1.5 Mini (52B), 1.5 Large (398B) | AI21 hybrid Mamba-Transformer MoE, 256K context |
+| **Mamba** | 130M, 370M, 790M, 1.4B, 2.8B | Pure SSM architecture |
 
 ## OpenAI Tokenizers