adanomad · sunapi386 · Feb 5, 2026
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -0,0 +1,191 @@
+# System Architecture
+
+GLaDOS is built on Marvin Minsky's **Society of Mind** architecture, where multiple specialized agents contribute to a unified intelligence. Rather than a single monolithic AI, GLaDOS assembles a dynamic context from independent subagents (emotion, memory, observation) for each LLM interaction.
+
+## Society of Mind Overview
+
+Each subagent runs its own loop, processes its domain independently, and writes outputs to shared **slots**. The main agent reads all slot contents as part of its context, giving it awareness of emotional state, environment, memory, and more — without coupling these systems together.
+
+```mermaid
+flowchart TB
+    subgraph Minds["Subagents (Minds)"]
+        E[Emotion Agent]
+        O[Observer Agent]
+        C[Compaction Agent]
+        W[Weather / News]
+    end
+
+    subgraph Slots["Shared Slots"]
+        S1["[emotion] excited, engaged"]
+        S2["[observer] modifiers active"]
+        S3["[weather] 22°C, sunny"]
+    end
+
+    Minds --> Slots
+    Slots --> CTX[Context Builder]
+    CTX --> LLM[Main LLM Agent]
+    USER[User Input] --> LLM
+    LLM --> TTS[Speech Output]
+```
+
+## Two-Lane LLM Orchestration
+
+GLaDOS separates user-facing and background inference into two independent lanes:
+
+```mermaid
+flowchart LR
+    A[User Input<br>speech / text] --> B[Priority Lane<br>1 dedicated worker]
+    C[Autonomy Loop<br>subagents / jobs] --> D[Autonomy Lane<br>N pooled workers]
+    B --> E[TTS → Audio]
+    D --> E
+```
+
+- **Priority lane**: A single dedicated LLM worker that handles user input. User requests are never blocked by background work.
+- **Autonomy lane**: A configurable pool of 1–16 workers (default 2) for background processing — autonomy ticks, subagent LLM calls, and background jobs.
+
+Both lanes share the TTS and audio output pipeline.
+
+## Thread Architecture
+
+All components run in dedicated threads connected by `queue.Queue` instances.
+
+| Thread | Class | Daemon | Shutdown Priority | Purpose |
+|--------|-------|--------|-------------------|---------|
+| `SpeechListener` | `SpeechListener` | Yes | INPUT | VAD → ASR transcription |
+| `TextListener` | `TextListener` | Yes | INPUT | stdin / TUI text input |
+| `LLMProcessor` | `LanguageModelProcessor` | No | PROCESSING | Priority lane LLM inference |
+| `LLMProcessorAutonomy-N` | `LanguageModelProcessor` | No | PROCESSING | Autonomy lane LLM inference (1–16 workers) |
+| `ToolExecutor` | `ToolExecutor` | No | PROCESSING | Native + MCP tool dispatch |
+| `TTSSynthesizer` | `TextToSpeechSynthesizer` | No | OUTPUT | Text → audio synthesis |
+| `AudioPlayer` | `SpeechPlayer` | No | OUTPUT | Audio playback via sounddevice |
+| `AutonomyLoop` | `AutonomyLoop` | Yes | BACKGROUND | Autonomy tick orchestration |
+| `AutonomyTicker` | (timer thread) | Yes | BACKGROUND | Periodic tick generation |
+| `VisionProcessor` | `VisionProcessor` | Yes | BACKGROUND | Camera capture → FastVLM inference |
+
+**Daemon vs non-daemon**: Daemon threads (`True`) are stateless input threads that can be killed immediately. Non-daemon threads (`False`) have in-flight state (conversation updates, pending audio) and must complete gracefully.
+
+## Queue-Based Message Flow
+
+```mermaid
+flowchart LR
+    SL[SpeechListener] -->|text| PQ[llm_queue_priority]
+    TL[TextListener] -->|text| PQ
+    AL[AutonomyLoop] -->|tick| AQ[llm_queue_autonomy]
+
+    PQ --> LLM1[LLMProcessor]
+    AQ --> LLM2[LLMProcessor<br>Autonomy 1..N]
+
+    LLM1 -->|tool calls| TCQ[tool_calls_queue]
+    LLM2 -->|tool calls| TCQ
+    TCQ --> TE[ToolExecutor]
+    TE -->|results| PQ
+    TE -->|results| AQ
+
+    LLM1 -->|text| TQ[tts_queue]
+    LLM2 -->|text| TQ
+    TQ --> TTS[TTSSynthesizer]
+    TTS -->|audio| AAQ[audio_queue]
+    AAQ --> AP[AudioPlayer]
+```
+
+### Queue Details
+
+| Queue | Type | Bounded | Connects |
+|-------|------|---------|----------|
+| `llm_queue_priority` | `Queue[dict]` | Unbounded | Input → Priority LLM worker |
+| `llm_queue_autonomy` | `Queue[dict]` | Configurable | Autonomy → Autonomy LLM workers |
+| `tool_calls_queue` | `Queue[dict]` | Unbounded | LLM → ToolExecutor |
+| `tts_queue` | `Queue[str]` | Unbounded | LLM → TTSSynthesizer |
+| `audio_queue` | `Queue[AudioMessage]` | Unbounded | TTSSynthesizer → AudioPlayer |
+
+## Shutdown Orchestration
+
+Shutdown proceeds in priority phases, each fully completing before the next begins:
+
+```mermaid
+flowchart LR
+    A["1. INPUT<br>Stop listeners"] --> B["2. PROCESSING<br>Drain LLM + tools"]
+    B --> C["3. OUTPUT<br>Drain TTS + audio"]
+    C --> D["4. BACKGROUND<br>Abandon autonomy"]
+    D --> E["5. CLEANUP<br>Final teardown"]
+```
+
+| Phase | Priority | Components | Behavior |
+|-------|----------|------------|----------|
+| INPUT | 1 | SpeechListener, TextListener | Stop accepting new work |
+| PROCESSING | 2 | LLMProcessor, ToolExecutor | Complete in-flight work, drain queues |
+| OUTPUT | 3 | TTSSynthesizer, AudioPlayer | Complete pending output |
+| BACKGROUND | 4 | AutonomyLoop, VisionProcessor | Can safely abandon |
+| CLEANUP | 5 | (final operations) | Final teardown |
+
+The `ShutdownOrchestrator` manages this process with configurable timeouts (global: 30s, per-phase: 10s). For each group, it drains component queues first, then joins threads.
+
+## Context Building Pipeline
+
+Each LLM request assembles context from registered sources, ordered by priority (higher = earlier in context):
+
+| Priority | Source | Content |
+|----------|--------|---------|
+| 10 | `preferences` | User preferences (name, language, etc.) |
+| 8 | `slots` | Autonomy slot summaries (weather, news, etc.) |
+| 7 | `memory` | Relevant long-term memories |
+| 5 | `emotion` | Current PAD emotional state |
+| 5 | `knowledge` | Local knowledge notes |
+| 3 | `constitution` | Constitutional behavioral modifiers |
+
+The `ContextBuilder` calls each source function on every request. Sources returning `None` are skipped. The resulting system messages are prepended to the conversation before sending to the LLM.
+
+The full message assembly order:
+1. Personality preprompt (system/user/assistant messages)
+2. Context builder system messages (table above)
+3. MCP resource messages (cached, TTL-based)
+4. Conversation history
+5. Current user message
+
+## Component Interaction Overview
+
+```mermaid
+flowchart TB
+    subgraph Input
+        MIC[Microphone] --> VAD[VAD] --> ASR[ASR Engine]
+        KB[Keyboard/TUI] --> TL[TextListener]
+        CAM[Camera] --> VP[VisionProcessor]
+    end
+
+    subgraph Processing
+        ASR --> SL[SpeechListener]
+        SL --> LLM[LLMProcessor<br>Priority]
+        TL --> LLM
+        VP --> AL[AutonomyLoop]
+        AL --> LLMA[LLMProcessor<br>Autonomy]
+        LLM --> TE[ToolExecutor]
+        LLMA --> TE
+        TE --> MCP[MCP Servers]
+        TE --> NT[Native Tools]
+    end
+
+    subgraph Output
+        LLM --> TTS[TTSSynthesizer]
+        LLMA --> TTS
+        TTS --> SP[SpeechPlayer]
+        SP --> SPKR[Speaker]
+    end
+
+    subgraph Background
+        SM[SubagentManager] --> EA[EmotionAgent]
+        SM --> OA[ObserverAgent]
+        SM --> CA[CompactionAgent]
+        EA --> SS[SlotStore]
+        OA --> SS
+        SS --> CTX[ContextBuilder]
+        CTX --> LLM
+        CTX --> LLMA
+    end
+```
+
+## See Also
+
+- [README](../README.md) — Full project overview
+- [autonomy.md](./autonomy.md) — Autonomy loop and subagent details
+- [mcp.md](./mcp.md) — MCP tool system
+- [audio.md](./audio.md) — Audio pipeline details
diff --git a/docs/audio.md b/docs/audio.md
@@ -0,0 +1,187 @@
+# Audio Pipeline
+
+GLaDOS uses a fully local audio pipeline with ONNX-based models for voice activity detection, speech recognition, and text-to-speech synthesis. All inference runs on-device with no cloud dependencies.
+
+## Pipeline Overview
+
+```mermaid
+flowchart LR
+    MIC[Microphone<br>16kHz mono] --> VAD[Silero VAD<br>32ms chunks]
+    VAD -->|speech detected| BUF[Pre-activation<br>Buffer 800ms]
+    BUF --> ASR[ASR Engine<br>Parakeet ONNX]
+    ASR -->|text| LLM[LLM Processor]
+    LLM -->|text| TTS[TTS Engine<br>GLaDOS / Kokoro]
+    TTS -->|audio| SP[SpeechPlayer<br>sounddevice]
+    SP --> SPKR[Speaker]
+```
+
+## Voice Activity Detection (VAD)
+
+GLaDOS uses **Silero VAD** (ONNX) to detect when the user is speaking.
+
+| Parameter | Value |
+|-----------|-------|
+| Model | Silero VAD (ONNX) |
+| Sample rate | 16,000 Hz |
+| Chunk size | 32ms (512 samples) |
+| Trigger threshold | 0.8 (configurable) |
+| Audio format | 16-bit mono float32 |
+
+The VAD processes audio in 32ms chunks. When the VAD confidence exceeds the threshold (default 0.8), the system transitions to recording mode and begins accumulating audio for ASR.
+
+### Pre-Activation Buffer
+
+A rolling buffer captures audio **before** VAD triggers, preventing the loss of word beginnings:
+
+- **Buffer size**: 800ms (25 chunks at 32ms each)
+- **Implementation**: `deque(maxlen=25)` of 32ms audio chunks
+- When VAD triggers, the buffer contents are prepended to the recording
+
+### Speech Segmentation
+
+Speech is segmented by silence gaps:
+
+- **Pause limit**: 640ms of silence ends a speech segment
+- When the gap counter exceeds `PAUSE_LIMIT / VAD_SIZE` (20 chunks), the accumulated audio is sent to ASR
+
+## ASR Engines
+
+GLaDOS supports two NVIDIA Parakeet ASR engines, selectable via the `asr_engine` config option.
+
+### Parakeet TDT (Token and Duration Transducer)
+
+The default and recommended engine, offering the best accuracy.
+
+| Aspect | Value |
+|--------|-------|
+| Config value | `asr_engine: "tdt"` |
+| Architecture | Encoder + Decoder + Joiner (transducer) |
+| Model size | 0.6B parameters |
+| Models | `parakeet-tdt-0.6b-v3_encoder.onnx`, `_decoder.onnx`, `_joiner.onnx` |
+| Sample rate | 16,000 Hz |
+| Backend | ONNX Runtime (CPU/CUDA) |
+
+### Parakeet CTC (Connectionist Temporal Classification)
+
+A lighter alternative with faster inference at the cost of some accuracy.
+
+| Aspect | Value |
+|--------|-------|
+| Config value | `asr_engine: "ctc"` |
+| Architecture | Single encoder with CTC head |
+| Model size | 110M parameters |
+| Model | `nemo-parakeet_tdt_ctc_110m.onnx` |
+| Sample rate | 16,000 Hz |
+| Backend | ONNX Runtime (CPU/CUDA) |
+
+Both engines use mel spectrogram preprocessing (16kHz, configurable n_fft, window size, and number of mel bins from model config YAML).
+
+## TTS Engines
+
+The TTS engine is selected by the `voice` config option. Setting `voice: "glados"` uses the GLaDOS engine; any other value selects a Kokoro voice.
+
+### GLaDOS Voice (Piper VITS)
+
+The signature GLaDOS voice from the Portal games.
+
+| Aspect | Value |
+|--------|-------|
+| Config value | `voice: "glados"` |
+| Architecture | Piper VITS (ONNX) |
+| Model | `models/TTS/glados.onnx` |
+| Sample rate | 22,050 Hz |
+| Phonemizer | Custom ONNX phonemizer (`phomenizer_en.onnx`) |
+| Pipeline | Text → Phonemizer → VITS → Audio |
+
+### Kokoro (Multi-Voice)
+
+A multi-voice TTS engine supporting various voice styles.
+
+| Aspect | Value |
+|--------|-------|
+| Config value | `voice: "<voice_name>"` (e.g., `af_bella`, `am_adam`) |
+| Architecture | Kokoro ONNX |
+| Model | `models/TTS/kokoro-v1.0.fp16.onnx` |
+| Sample rate | 24,000 Hz |
+| Max phoneme length | 510 |
+| Default voice | `af_alloy` |
+
+Available voice prefixes:
+- `af_` — Female voices (e.g., `af_bella`, `af_alloy`, `af_nova`, `af_shimmer`)
+- `am_` — Male voices (e.g., `am_adam`, `am_echo`, `am_orion`, `am_sage`)
+
+## Interruption Handling
+
+When `interruptible: true` (default), user speech interrupts GLaDOS mid-response:
+
+```mermaid
+sequenceDiagram
+    participant U as User
+    participant VAD as VAD
+    participant SP as SpeechPlayer
+    participant LLM as LLMProcessor
+    participant E as EmotionAgent
+
+    SP->>SP: Playing GLaDOS response
+    U->>VAD: Starts speaking
+    VAD->>SP: Stop playback
+    SP->>SP: Record percentage spoken
+    SP->>LLM: Clip response at interruption point
+    SP->>E: EmotionEvent("user", "User interrupted me mid-sentence")
+    VAD->>LLM: New user input (priority lane)
+```
+
+Key behaviors:
+1. **Playback stops immediately** when VAD detects speech during output
+2. **Response is clipped** — the conversation history records only the portion that was actually spoken
+3. **Emotion event fires** — the emotion agent receives an interruption event, which may increase arousal
+4. **Priority lane** ensures the new user input is processed immediately
+
+## Wake Word Support
+
+When `wake_word` is configured, GLaDOS only processes speech that contains the wake word.
+
+- **Matching**: Uses Levenshtein distance (edit distance) for fuzzy matching
+- **Threshold**: A word matches if its Levenshtein distance to the wake word is small enough
+- **Case-insensitive**: Both the transcription and wake word are lowercased before comparison
+- **Per-word check**: Each word in the transcription is checked independently
+
+```yaml
+wake_word: "glados"  # Only respond when "glados" (or similar) is spoken
+```
+
+If the wake word is not detected in the transcription, the input is silently discarded.
+
+## Audio I/O Backend
+
+GLaDOS uses the `sounddevice` library for audio I/O, wrapped in the `AudioProtocol` interface.
+
+```python
+class AudioProtocol(Protocol):
+    def __init__(self, vad_threshold: float | None = None) -> None: ...
+    def start_speaking(self, audio_data, sample_rate=None, text="") -> None: ...
+    def measure_percentage_spoken(self, total_samples, sample_rate=None) -> tuple[bool, int]: ...
+    def is_speaking(self) -> bool: ...
+    def stop_speaking(self) -> None: ...
+```
+
+The protocol-based design allows swapping audio backends. Currently `sounddevice` is the only implementation; a `websocket` backend is planned.
+
+## Configuration Reference
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `voice` | string | required | TTS voice: `"glados"` or any Kokoro voice name |
+| `asr_engine` | string | required | ASR engine: `"tdt"` (best) or `"ctc"` (faster) |
+| `audio_io` | string | required | Audio backend: `"sounddevice"` |
+| `interruptible` | bool | required | Allow user to interrupt mid-response |
+| `wake_word` | string/null | `null` | Optional wake word for activation |
+| `asr_muted` | bool | `false` | Start with ASR muted |
+| `tts_enabled` | bool | `true` | Enable TTS output |
+| `announcement` | string/null | `null` | Text to speak on startup |
+
+## See Also
+
+- [README](../README.md) — Full project overview
+- [architecture.md](./architecture.md) — System architecture and thread model
+- [configuration.md](./configuration.md) — Complete configuration reference