Skip to content

Conversation

@ericelliott
Copy link
Contributor

Repurposing deprecated Cloverfield as a next-gen streaming cross-modal AI transformer model.

- Remove all previous cf-package content
- Initialize AI Driven Development with npx aidd --cursor
- Add comprehensive README documenting SPCE (Spectral Phase-Coherent Encoding)
- SPCE: continuous spectral phase field encoding for cross-modal alignment
- Architecture supports streaming cross-modal attention across text, audio, video

This marks the transition from a package generation tool to a research project
focused on next-token prediction with synchronized multi-channel inputs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Document continuous field representation philosophy
- Specify absolute time evaluation to prevent drift
- Detail learnable spectral distributions with shared atom pools
- Explain cross-modal synchronization with global tick
- Provide training dynamics and gradient flow strategies
- Outline computational efficiency with fused CUDA kernels
- Define keyframe anchoring and spectral regularization
- Include minimal viable recipe and success metrics

Core principle: compute phase from absolute time, never integrate noise.
Use dual-view phase (absolute + local), shared ω atoms with per-head gates,
and scheduled keyframes for long-term stability.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Major revision grounded in "Textbooks Are All You Need" philosophy:

Training Philosophy:
- Quality over scale (Phi-1/1.5/2 approach)
- Physics simulations = textbook quality for multimodal
- Curated educational content vs web scrapes

Simplified Core Architecture:
- SPCE: shared ω atoms (12-24), per-head gates only
- Spectral SSM carry: diagonal, low-rank, unit circle eigenvalues
- Keyframes: periodic anchors every T seconds

Training Pipeline:
- Stage 1: Captioned lectures (static scenes)
- Stage 2: Unreal Engine physics (dynamic scenes)
- Stage 3: Mixed datasets for transfer
- Input structure: tick-ordered multimodal tokens

Success Metrics:
- Phase stability, A/V sync error, caption timing
- Force vector prediction, trajectory extrapolation
- Physics understanding benchmarks

Implementation Roadmap:
- Phase 1: SPCE validation (1-2 months)
- Phase 2: Single-modal streaming (2-3 months)
- Phase 3: Physics-grounded video (3-6 months)
- Phase 4: Full multimodal 6B (6-12 months)

Removed theoretical clutter, focused on measurable outcomes,
grounded in proven quality-over-scale principles.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
claude added 10 commits October 28, 2025 11:05
Detailed comparison of Cloverfield vs standard transformers:

Key Results:
- Time complexity: O(w²) vs O(n²) → 200× ops reduction for 4-hour video
- Memory: 0.27 GB vs 52 GB → 193× memory savings
- Throughput: 4-6× faster than standard attention
- Unbounded streaming with constant memory

Detailed Breakdown:
1. SPCE overhead vs RoPE (1-3%, negligible)
2. SSM carry vs full attention (40,000× attention ops reduction)
3. Keyframe overhead (<1% of compute)
4. Streaming comparison (Standard/FlashAttention/Mamba/Cloverfield)
5. Memory footprint analysis (KV cache explosion avoided)
6. Throughput estimates (competitive with Mamba)
7. Training efficiency (2-3× faster convergence expected)

When Cloverfield Wins:
- Long-form streaming (hours of video)
- Phase-sensitive tasks (A/V sync)
- Real-time inference (predictable latency)
- Multi-hour reasoning (SSM carry state)

Potential Weaknesses:
- Strong copying/in-context learning (Mamba limitation)
- Very short sequences (<2K tokens)
- Random access patterns

Architecture Comparison Table:
Standard Transformer vs Mamba vs Cloverfield across 8 dimensions

Implementation Notes:
- CUDA kernel priorities and development timeline
- Fused SPCE rotary, diagonal SSM update, keyframe injection

Grounded in published benchmarks from Mamba, FlashAttention,
and RoPE literature (2023-2024).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Expanded References section with 6 categories of related work:

1. Training Philosophy: Quality Over Scale
   - Textbooks Are All You Need (Phi-1, Phi-1.5, Phi-2)
   - Establishes quality-over-scale principle for SPCE approach

2. Unified Multimodal Architectures
   - Gemini 1.5 (joint transformers, direct cross-modal tokenization)
   - Meta-Transformer (unified tokenizer, 12 modalities)
   - ImageBind (joint embedding, 6 modalities)
   - Chameleon (early-fusion token-based)
   - UniForm (unified latent space for audio-video)

3. State Space Models for Streaming
   - Mamba (linear-time SSMs, 5× faster)
   - Mamba-2 (state space duality, 8× faster inference)

4. Spectral & Frequency Domain Methods
   - FNO (Fourier Neural Operator)
   - AFNO (Adaptive FNO for transformers)
   - GFNet (Global Filter Network)
   - SpectFormer (hybrid spectral + attention)

5. Complex-Valued & Phase-Aware Networks
   - Complex CNNs for non-stationary data
   - Survey of complex-valued architectures
   - Phase-aware audio processing

6. Continuous & Coordinate-Based Representations
   - Neural ODEs (continuous depth)
   - SIREN (periodic activations)
   - NeRF (continuous 3D scenes)

7. Positional Encoding Methods
   - RoPE (rotary embeddings)
   - ALiBi (attention with linear biases)
   - Time-aware encoding for video

8. Temporal Coherence
   - TimeSformer (space-time attention)

Added "SPCE's Unique Contribution" section highlighting how SPCE
differs from and builds upon all these approaches:
- Continuous spectral phase field (not discrete)
- Absolute time evaluation (no drift)
- Cross-modal phase coherence (structural, not learned)
- Physics-grounded coordinates
- Unbounded streaming with SSM + keyframes

All references verified with arXiv IDs, conference venues, and
project URLs where available.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…oding

Major architectural revision incorporating efficiency and NLU improvements:

Core Architecture Updates:
1. **SPCE preserved** - Spectral Phase-Coherent Encoding remains core innovation
2. **Windowed attention** - Mandatory 2048-token windows for O(w²) efficiency
3. **Visual text encoding** - DeepSeek OCR approach, 10× compression with 97% fidelity
4. **Three-tier retrieval** - Perfect token recall, unbounded context:
   - Tier 1: Windowed attention (local, 2K tokens)
   - Tier 2: RETRO-style chunked knowledge retrieval (physics DB)
   - Tier 3: kNN exact token retrieval (FAISS, phase-aware)
5. **SSM carry** - Cross-window state persistence
6. **Stateful inference** - Low-rank Hebbian adaptation (future capability)

Computational Efficiency:
- 38,000× reduction in attention ops vs standard transformers
- 173× memory savings (0.3 GB vs 52 GB for 400K tokens)
- Retrieval overhead negligible (<10% of compute)
- DeepSeek visual text: 10× compression, unified representation

Architecture Comparison:
- Added "Perfect recall" row (kNN retrieval)
- Added "Knowledge retrieval" row (RETRO-style)
- Added "NLU capability" row (Excellent via visual text)
- Updated complexity to O(w² + w·log n)

Success Metrics:
- Added retrieval performance metrics
- kNN recall accuracy, retrieval latency, phase-aware similarity

New References:
- **DeepSeek-OCR** - Context optical compression (arXiv:2510.18234)
- **RETRO** - Retrieval-enhanced transformer (arXiv:2112.04426)
- **Memorizing Transformers** - kNN-augmented attention (arXiv:2203.08913)
- **REALM** - End-to-end learned retrieval (arXiv:2002.08909)
- **FiD** - Fusion-in-decoder architecture
- **ATLAS** - Few-shot retrieval-augmented learning (arXiv:2208.03299)
- **Perceiver IO** - Cross-attention latent bottleneck (arXiv:2107.14795)

Key Properties:
✅ Full NLU capability maintained (visual text + retrieval)
✅ Unbounded context via kNN memory (perfect recall)
✅ Phase-coherent cross-modal alignment (SPCE)
✅ Efficient streaming (windowed O(w²))
✅ Physics knowledge retrieval (domain-specific DB)

Architecture now balances efficiency, NLU capability, and cross-modal
phase coherence through integrated retrieval mechanisms.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Define "SOTA in weight class" (4-10B params, M4 Max trainable)
- Specify task-specific window sizes (4K-16K) based on retrieval architecture
- Detail use cases for all modalities: NLU, video, image, audio, code
- Document physics grounding competitive advantage (learned world model)
- Establish quantitative benchmarks and qualitative metrics
- Emphasize flexible windowing: retrieval + SSM carry handle long-range dependencies

Key insight: Architecture enables smaller windows (video gen ~4K, code gen ~8K)
because three-tier retrieval and low-frequency phase coherence capture
context outside window boundaries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
SPCE is a generalization of RoPE that extends rotary position embeddings to:
- Continuous absolute time (t = timestamp, not discrete positions)
- Shared frequency palette across modalities (cross-modal phase coherence)
- Explicit phase offsets for alignment

Fixed the analogy section to make it clear SPCE *is* the rotary encoding
mechanism, replacing RoPE rather than working alongside it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…oding

Major corrections to align README with REQUIREMENTS and architectural reality:

1. CRITICAL: Removed false "orthogonal to content" claim
   - NLU preserved via full O(w²) cross-attention within window
   - Phase rotation enhances cross-modal alignment without compromising semantics

2. Flexible window sizing (4K-32K) replacing hardcoded 2048
   - Video gen (standard): 4K-8K tokens
   - Video gen (ultra quality): 16K-32K tokens (supports 8-12 sec @ ultra)
   - Code gen: ~8K tokens
   - NLU: 8K-16K tokens

3. Added key design principle: "Why small windows work"
   - Three-tier retrieval + SSM carry = unbounded effective context
   - Constant compute cost regardless of sequence length

4. Emphasized architectural moat: Unbounded understanding with bounded compute
   - Small finite windows + retrieval + SSM = unlimited context
   - Perfect recall via kNN, knowledge via RETRO, state via SSM carry

5. SPCE information density advantage
   - Continuous time + shared cross-modal frequencies + explicit phase offsets
   - MORE information than RoPE at similar computational cost

6. Updated all computational examples
   - Use 16K window as realistic baseline for video understanding
   - 25× memory reduction (52 GB → 2.1 GB active)
   - 620× ops reduction for long sequences

All calculations now reflect production-quality architecture designed
for 6B model training on M4 Max 128GB.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Major reframing throughout documentation:

## Core Vision Update
- Target users: Creative professionals, game developers, AI avatar users
- Long-term goal: Real-time holodeck-like interactive world generation
- Physics grounding is TRAINING METHODOLOGY, not use case
- General-purpose foundation model, not narrow physics domain

## Use Case Examples Replaced Throughout
OLD (physics-focused):
- "What is projectile motion?" → formula application
- Physics overlays, force vectors, trajectory prediction
- Math lectures, equation grounding

NEW (creative professional-focused):
- "Give Mom a funny Christmas sweater in this family photo"
- AI avatar conversations with visual characters
- "Make this sunset more dramatic and add ambient ocean sounds"
- "Add a glowing particle effect to this Unity scene"
- Explore AI-generated virtual worlds
- Instruct-edit capabilities for image/video

## Training Curriculum Updated
- Stage 1: Text-to-image with dialogue (not math lectures)
- Stage 2: Dynamic scenes & interaction (not physics equations)
- Stage 3: Conversational & editing tasks (not physics problems)
- Stage 4: Real-world augmentation for creative workflows

## Success Metrics Revised
- Instruction following & editing accuracy
- Physical plausibility (lighting, materials, not F=ma predictions)
- Conversational AI with visual grounding
- Interactive world generation (real-time, consistent)

## Implementation Roadmap Updated
- Phase 2: Text-to-image foundation (not single-modal streaming)
- Phase 3: Temporal & editing capabilities (not physics video)
- Phase 4: Conversational AI & interactive worlds (holodeck prototype)

## Key Positioning
Physics-grounded training teaches world models → enables physically
plausible generation for creative applications. Not building a physics
tutor; building Adobe Firefly meets holodeck for consumer hardware.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…xibility

CRITICAL FIXES:

1. Remove pretrained model dependencies (LLaMA/Mistral incompatible)
   - Novel SPCE + windowed attention breaks pretrained weights
   - Must train 6B from scratch on M4 Max 128GB
   - Training footprint: 50-60GB (fits comfortably)

2. Document SPCE's unique window flexibility advantage
   - SPCE uses absolute time (θ = ω·t), not position-based encoding
   - Model learns temporal relationships, not position patterns
   - Enables variable window training (4K-32K randomly sampled)
   - Same weights work with any window at inference
   - Cannot do this with RoPE/LLaMA/Mistral architectures

3. Update training timelines (training time, not dev effort)
   - Phase 1: SPCE validation
   - Phase 2: Base model (3-6 months training time)
   - Phase 3: Temporal & editing (additional 2-4 months)
   - Phase 4: Conversational & holodeck (additional 2-4 months)
   - Total: 7-14 months continuous training on M4 Max

4. Computational examples show window flexibility
   - 16K window: 258M ops/layer (video understanding)
   - 4K window: 17M ops/layer (real-time holodeck, 16× faster)
   - Same model, different windows, task-adaptive performance
   - Memory: 0.5-2.1 GB depending on window choice

5. Minor fixes
   - "Physics knowledge" → "relevant examples" (general use cases)
   - Remove QLoRA references (training from scratch, not fine-tuning)
   - Clarify M4 Max 128GB as training hardware

Key insight: SPCE's temporal encoding is window-agnostic, enabling
unprecedented flexibility that RoPE-based models cannot achieve.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
CRITICAL: Restored original vision that was lost in previous revisions.

SPCE is NOT a positional encoding - it's a continuous manifold where all
modalities coexist. Each token's coordinates are a point in spectral phase
field that evolves with time, space, and energy.

Full formulation: θ = ω·t + k·r + φ₀
- ω: Log-spaced spectral frequencies (10⁻⁴ to 10³ Hz)
- t: Absolute continuous time
- k: Spatial frequency vectors (k_x, k_y, k_z)
- r: Spatial coordinates (x, y, z)
- φ₀: Phase offset

Crystal lattice spiral geometry:
- Log-spaced ω atoms form quasi-crystalline lattice in phase space
- Low-ω: Wide spirals (hours, narrative arc)
- High-ω: Tight spirals (milliseconds, frame details)
- All modalities at (t,r) map to same lattice coordinates

Multi-scale temporal decomposition:
- Per-head gates select temporal scales (Fourier-like decomposition)
- Low-freq heads: long-range dependencies
- High-freq heads: short-range patterns
- Mixed heads: cross-scale relationships

Key properties restored:
1. Temporal continuity: phase unwrapped across infinite timesteps
2. Spatial coherence: 3D objects share coordinate system via k·r
3. Spectral control: learnable ω adapts to phenomena
4. Cross-modal alignment: audio/video/text sync via shared phase
5. Memory anchoring: low-ω provides stable SSM carry references
6. Window agnostic: ∆θ = ω·∆t invariant to window size

This is fundamentally different from RoPE/ALiBi - it's geometric embedding
into multi-scale phase manifold, not position-based encoding.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Key additions:

1. SPCE pronounced "space" - mnemonic and literal (encodes space + time)

2. Framed as "field embedding" not positional encoding
   - Continuous spectral phase field spanning space and time
   - Each token = point on helical manifold
   - Shared coordinate system for all modalities

3. Implementation details:
   - Complex exponential basis: e^(iωt)
   - Phase continuity: θ_{t+Δt} = θ_t + ω·Δt
   - Per-head ω distributions for multi-scale temporal sensitivity
   - Re-normalization to prevent long-run drift

4. Expanded SSM coupling property:
   - Low-frequency ω bands persist in SSM carry (narrative state)
   - High-frequency bands refresh with attention (local details)

5. Multi-view potential:
   - SPCE = shared coordinate frame for multiple cameras
   - Spatial frequencies encode same 3D world from different viewpoints

This captures the original vision: SPCE is not RoPE++, it's a fundamentally
different approach using field theory to embed events in continuous phase space.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants