Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,6 @@ sample_data/

# Jupyter
.ipynb_checkpoints/

# ai
.claude/
Binary file added L19_cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added L20_cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added L21_cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
158 changes: 141 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ async with TelnetGymClient(config) as client:
buffer.add(sample)
```

**Supported puzzles:** Sudoku, KenKen, Nonogram, Lights Out, Sokoban, Minesweeper, and [16 others](docs/gym.md).
**Supported puzzles:** Sudoku, KenKen, Kakuro, Binary, Futoshiki, Nonogram, Logic Grid, Killer Sudoku, Lights Out, Mastermind, Slitherlink, Bridges, Hitori, Shikaku, Hidato, Tents, Fillomino, Star Battle, Sokoban, Knapsack, Nurikabe, Minesweeper.

### BatchPlan-Driven Training

Expand Down Expand Up @@ -196,33 +196,30 @@ chuk-lazarus generate --type math --output ./data/lazarus
chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"
```

### Inference Pipeline (New!)
### UnifiedPipeline

The new unified inference pipeline provides a simplified API for running inference with any supported model family. One-liner setup, no boilerplate:
The `UnifiedPipeline` auto-detects model family and provides a simplified API. One-liner setup, no boilerplate:

```python
from chuk_lazarus.inference import InferencePipeline, PipelineConfig, DType
from chuk_lazarus.models_v2 import LlamaConfig, LlamaForCausalLM

# One-liner model loading
pipeline = InferencePipeline.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
LlamaForCausalLM,
LlamaConfig,
)
from chuk_lazarus.inference import UnifiedPipeline, UnifiedPipelineConfig, DType

# One-liner model loading - auto-detects family!
pipeline = UnifiedPipeline.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Simple chat API
result = pipeline.chat("What is the capital of France?")
print(result.text)
print(result.stats.summary) # "25 tokens in 0.42s (59.5 tok/s)"
print(f"Model family: {pipeline.family_type}") # ModelFamilyType.LLAMA
```

**Key features:**
- Typed configuration with Pydantic (`PipelineConfig`, `GenerationConfig`)
- Async support (`InferencePipeline.from_pretrained_async`)
- Auto-detection of model family from HuggingFace config
- Typed configuration with Pydantic (`UnifiedPipelineConfig`, `GenerationConfig`)
- Async support (`UnifiedPipeline.from_pretrained_async`)
- Chat history management (`ChatHistory`)
- Streaming generation (`generate_stream`)
- No magic strings - uses enums (`DType`, `Role`)
- No magic strings - uses enums (`DType`, `Role`, `ModelFamilyType`)

```bash
# Simplified inference examples
Expand Down Expand Up @@ -251,6 +248,14 @@ uv run python examples/inference/granite_inference.py --model granite-3.1-2b

# Llama 4 Scout (Hybrid Mamba-Transformer MoE)
uv run python examples/inference/llama4_inference.py

# StarCoder2 (Code generation, 3B/7B/15B)
uv run python examples/inference/starcoder2_inference.py --prompt "def fibonacci(n):"
uv run python examples/inference/starcoder2_inference.py --interactive # Interactive mode

# Jamba (Hybrid Mamba-Transformer MoE, 256K context)
uv run python examples/inference/jamba_inference.py --test-tiny # Test without download
uv run python examples/inference/jamba_inference.py --list # Show models
```

### FunctionGemma (Function Calling)
Expand All @@ -270,6 +275,112 @@ FunctionGemma is a 270M parameter model optimized for on-device function calling

See [docs/inference.md](docs/inference.md) for detailed inference documentation.

### Introspection (Model Analysis)

Analyze model behavior using logit lens, ablation studies, attention visualization, and MoE expert identification:

```bash
# Run logit lens analysis - see how predictions evolve across layers
chuk-lazarus introspect analyze -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 -p "The capital of France is"

# Track specific tokens through layers
chuk-lazarus introspect analyze -m model -p "Hello" --track "world,there" --layer-strategy all

# Compare two models' predictions
chuk-lazarus introspect compare -m1 google/gemma-3-270m-it -m2 google/functiongemma-270m-it -p "Get the weather" --track "get_"

# Ablation study - find causal circuits
chuk-lazarus introspect ablate -m model -p "What's the weather?" -c function_call --layers 8-15

# Multi-layer ablation - test layers together
chuk-lazarus introspect ablate -m model -p "45 * 45 = " -c "2025" --layers 22,23 --multi

# Low-level hook demonstration
chuk-lazarus introspect hooks -m model -p "Test" --layers 0,4,8 --capture-attention
```

**MoE Expert Identification** - Discover what each expert specializes in:

```python
from mlx_lm import load
from chuk_lazarus.introspection import ExpertIdentifier, identify_experts

# Load any MoE model
model, tokenizer = load("openai/gpt-oss-20b")

# Identify all experts in a layer
result = identify_experts(model, tokenizer, layer_idx=12)
print(result.summary())

# Results show expert specializations:
# CODE: Experts [1, 14, 22, 23, 27, 28]
# MATH: Experts [6, 7, 19, 24, 30, 31]
# CONTENT_WORDS: Experts [0, 2, 3, 4, 5, 8, 9, ...]
# NAMES: Experts [15, 26]

# Get detailed identity for specific expert
expert_6 = result.expert_identities[6]
print(expert_6.detailed_report())
# Expert 6: math (52% confidence)
# Top tokens: ['+', '2', 'x', '3', ...]
# Semantic clusters: ['numeric_values']
```

**MoE Routing Analysis** - Capture and analyze routing decisions:

```python
from chuk_lazarus.introspection import MoEHooks, MoECaptureConfig

hooks = MoEHooks(model)
hooks.configure(MoECaptureConfig(
capture_router_logits=True,
capture_selected_experts=True,
))

logits = hooks.forward(input_ids)

# Analyze routing
utilization = hooks.get_expert_utilization(layer_idx=12)
print(f"Load balance: {utilization.load_balance_score:.2%}")

entropy = hooks.get_router_entropy(layer_idx=12)
print(f"Router confidence: {1 - entropy.normalized_entropy:.2%}")
```

**Logit Lens and Ablation:**

```python
from chuk_lazarus.introspection import ModelAnalyzer, AnalysisConfig, LayerStrategy

# Async API for logit lens analysis
async with ModelAnalyzer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") as analyzer:
result = await analyzer.analyze("The capital of France is")
print(result.predicted_token) # " Paris"
for layer in result.layer_predictions:
print(f"Layer {layer.layer_idx}: {layer.top_token}")

# Track token evolution
config = AnalysisConfig(track_tokens=["Paris", " Paris"])
result = await analyzer.analyze("The capital of France is", config)
for evo in result.token_evolutions:
print(f"{evo.token} emerges at layer {evo.emergence_layer}")
```

```python
from chuk_lazarus.introspection import AblationStudy, AblationConfig

# Ablation studies - identify causal circuits
study = AblationStudy.from_pretrained("openai/gpt-oss-20b")
config = AblationConfig(max_new_tokens=15)

original = study.ablate_and_generate("45 * 45 = ", layers=[], config=config)
ablated = study.ablate_and_generate("45 * 45 = ", layers=[22, 23], config=config)
print(f"Original: {original}") # "2025..."
print(f"L22+L23 ablated: {ablated}") # Broken output
```

See [docs/introspection.md](docs/introspection.md) for detailed introspection documentation.

## Python API

```python
Expand Down Expand Up @@ -322,10 +433,18 @@ src/chuk_lazarus/
│ └── losses/ # Loss functions (pure math)
├── training/ # BatchPlan-driven reference trainers (SFT, DPO, GRPO, PPO)
├── inference/ # Unified inference pipeline
│ ├── pipeline.py # InferencePipeline high-level API
│ ├── unified.py # UnifiedPipeline with auto-detection
│ ├── loader.py # HFLoader, DType, WeightConverter
│ ├── chat.py # ChatHistory, Role, format_chat_prompt
│ └── generation.py # GenerationConfig, generate, generate_stream
├── introspection/ # Model introspection and analysis
│ ├── analyzer.py # ModelAnalyzer async API with Pydantic models
│ ├── hooks.py # ModelHooks for capturing intermediate states
│ ├── logit_lens.py # Layer-by-layer prediction analysis
│ ├── attention.py # Attention pattern analysis
│ ├── moe.py # MoE introspection (routing, expert identification)
│ ├── ablation/ # Ablation studies for causal discovery
│ └── visualizers/ # Heatmaps and evolution plots
├── distributed/ # Distributed training utilities
└── utils/ # Utilities
```
Expand All @@ -335,14 +454,16 @@ src/chuk_lazarus/
| Module | Description |
|--------|-------------|
| **Models** | Composable architecture: components, blocks, backbones, heads, families (Llama, Gemma, Granite) |
| **Inference** | Unified pipeline API: `InferencePipeline`, chat history, streaming generation |
| **Inference** | `UnifiedPipeline` with auto-detection, chat history, streaming generation |
| **Introspection** | Model analysis: logit lens, attention visualization, MoE expert identification, ablation studies |
| **Tokenizers** | Comprehensive toolkit for analysis, preprocessing, and runtime management |
| **Batching** | Token-budget batching, sequence packing, distributed batch planning |
| **Streaming** | Puzzle arcade integration, replay buffers, online learning |
| **Training** | BatchPlan-driven trainers — enforce, don't decide |

## Features

- **Introspection**: Logit lens, attention visualization, MoE expert identification, ablation studies, token evolution tracking
- **Tokenizer Toolkit**: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
- **Character Tokenizer**: Built-in character-level tokenizer for classification experiments
- **Tokenizer Doctor**: Health check with auto-fix for missing chat templates
Expand Down Expand Up @@ -404,6 +525,7 @@ If the tokenizer or data changes, fingerprint mismatch is detected before traini
- [CLI Reference](docs/cli.md) - Command-line interface documentation
- [Models Guide](docs/models.md) - Composable model architecture, components, LoRA adapters
- [Inference Guide](docs/inference.md) - Run inference with pretrained HuggingFace models
- [Introspection Guide](docs/introspection.md) - Logit lens, attention visualization, model analysis
- [Tokenizers Guide](docs/tokenizers.md) - Comprehensive tokenizer toolkit
- [Batching Guide](docs/batching.md) - Token-budget batching, packing, distributed training
- [Training Guide](docs/training.md) - BatchPlan-driven training
Expand All @@ -419,6 +541,8 @@ If the tokenizer or data changes, fingerprint mismatch is detected before traini
| **Gemma** | Gemma 3 (270M, 1B, 4B, 12B, 27B), FunctionGemma | 128K context, function calling |
| **Granite** | 3.0/3.1 (2B, 8B), 4.0 Tiny (1B, 1.5B MoE) | IBM, dense and MoE variants |
| **StarCoder2** | 3B, 7B, 15B | Code generation |
| **Jamba** | v0.1, 1.5 Mini (52B), 1.5 Large (398B) | AI21 hybrid Mamba-Transformer MoE, 256K context |
| **Mamba** | 130M, 370M, 790M, 1.4B, 2.8B | Pure SSM architecture |

## OpenAI Tokenizers

Expand Down
Loading
Loading