Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 191 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# System Architecture

GLaDOS is built on Marvin Minsky's **Society of Mind** architecture, where multiple specialized agents contribute to a unified intelligence. Rather than a single monolithic AI, GLaDOS assembles a dynamic context from independent subagents (emotion, memory, observation) for each LLM interaction.

## Society of Mind Overview

Each subagent runs its own loop, processes its domain independently, and writes outputs to shared **slots**. The main agent reads all slot contents as part of its context, giving it awareness of emotional state, environment, memory, and more β€” without coupling these systems together.

```mermaid
flowchart TB
subgraph Minds["Subagents (Minds)"]
E[Emotion Agent]
O[Observer Agent]
C[Compaction Agent]
W[Weather / News]
end

subgraph Slots["Shared Slots"]
S1["[emotion] excited, engaged"]
S2["[observer] modifiers active"]
S3["[weather] 22Β°C, sunny"]
end

Minds --> Slots
Slots --> CTX[Context Builder]
CTX --> LLM[Main LLM Agent]
USER[User Input] --> LLM
LLM --> TTS[Speech Output]
```

## Two-Lane LLM Orchestration

GLaDOS separates user-facing and background inference into two independent lanes:

```mermaid
flowchart LR
A[User Input<br>speech / text] --> B[Priority Lane<br>1 dedicated worker]
C[Autonomy Loop<br>subagents / jobs] --> D[Autonomy Lane<br>N pooled workers]
B --> E[TTS β†’ Audio]
D --> E
```

- **Priority lane**: A single dedicated LLM worker that handles user input. User requests are never blocked by background work.
- **Autonomy lane**: A configurable pool of 1–16 workers (default 2) for background processing β€” autonomy ticks, subagent LLM calls, and background jobs.

Both lanes share the TTS and audio output pipeline.

## Thread Architecture

All components run in dedicated threads connected by `queue.Queue` instances.

| Thread | Class | Daemon | Shutdown Priority | Purpose |
|--------|-------|--------|-------------------|---------|
| `SpeechListener` | `SpeechListener` | Yes | INPUT | VAD β†’ ASR transcription |
| `TextListener` | `TextListener` | Yes | INPUT | stdin / TUI text input |
| `LLMProcessor` | `LanguageModelProcessor` | No | PROCESSING | Priority lane LLM inference |
| `LLMProcessorAutonomy-N` | `LanguageModelProcessor` | No | PROCESSING | Autonomy lane LLM inference (1–16 workers) |
| `ToolExecutor` | `ToolExecutor` | No | PROCESSING | Native + MCP tool dispatch |
| `TTSSynthesizer` | `TextToSpeechSynthesizer` | No | OUTPUT | Text β†’ audio synthesis |
| `AudioPlayer` | `SpeechPlayer` | No | OUTPUT | Audio playback via sounddevice |
| `AutonomyLoop` | `AutonomyLoop` | Yes | BACKGROUND | Autonomy tick orchestration |
| `AutonomyTicker` | (timer thread) | Yes | BACKGROUND | Periodic tick generation |
| `VisionProcessor` | `VisionProcessor` | Yes | BACKGROUND | Camera capture β†’ FastVLM inference |

**Daemon vs non-daemon**: Daemon threads (`True`) are stateless input threads that can be killed immediately. Non-daemon threads (`False`) have in-flight state (conversation updates, pending audio) and must complete gracefully.

## Queue-Based Message Flow

```mermaid
flowchart LR
SL[SpeechListener] -->|text| PQ[llm_queue_priority]
TL[TextListener] -->|text| PQ
AL[AutonomyLoop] -->|tick| AQ[llm_queue_autonomy]

PQ --> LLM1[LLMProcessor]
AQ --> LLM2[LLMProcessor<br>Autonomy 1..N]

LLM1 -->|tool calls| TCQ[tool_calls_queue]
LLM2 -->|tool calls| TCQ
TCQ --> TE[ToolExecutor]
TE -->|results| PQ
TE -->|results| AQ

LLM1 -->|text| TQ[tts_queue]
LLM2 -->|text| TQ
TQ --> TTS[TTSSynthesizer]
TTS -->|audio| AAQ[audio_queue]
AAQ --> AP[AudioPlayer]
```

### Queue Details

| Queue | Type | Bounded | Connects |
|-------|------|---------|----------|
| `llm_queue_priority` | `Queue[dict]` | Unbounded | Input β†’ Priority LLM worker |
| `llm_queue_autonomy` | `Queue[dict]` | Configurable | Autonomy β†’ Autonomy LLM workers |
| `tool_calls_queue` | `Queue[dict]` | Unbounded | LLM β†’ ToolExecutor |
| `tts_queue` | `Queue[str]` | Unbounded | LLM β†’ TTSSynthesizer |
| `audio_queue` | `Queue[AudioMessage]` | Unbounded | TTSSynthesizer β†’ AudioPlayer |

## Shutdown Orchestration

Shutdown proceeds in priority phases, each fully completing before the next begins:

```mermaid
flowchart LR
A["1. INPUT<br>Stop listeners"] --> B["2. PROCESSING<br>Drain LLM + tools"]
B --> C["3. OUTPUT<br>Drain TTS + audio"]
C --> D["4. BACKGROUND<br>Abandon autonomy"]
D --> E["5. CLEANUP<br>Final teardown"]
```

| Phase | Priority | Components | Behavior |
|-------|----------|------------|----------|
| INPUT | 1 | SpeechListener, TextListener | Stop accepting new work |
| PROCESSING | 2 | LLMProcessor, ToolExecutor | Complete in-flight work, drain queues |
| OUTPUT | 3 | TTSSynthesizer, AudioPlayer | Complete pending output |
| BACKGROUND | 4 | AutonomyLoop, VisionProcessor | Can safely abandon |
| CLEANUP | 5 | (final operations) | Final teardown |

The `ShutdownOrchestrator` manages this process with configurable timeouts (global: 30s, per-phase: 10s). For each group, it drains component queues first, then joins threads.

## Context Building Pipeline

Each LLM request assembles context from registered sources, ordered by priority (higher = earlier in context):

| Priority | Source | Content |
|----------|--------|---------|
| 10 | `preferences` | User preferences (name, language, etc.) |
| 8 | `slots` | Autonomy slot summaries (weather, news, etc.) |
| 7 | `memory` | Relevant long-term memories |
| 5 | `emotion` | Current PAD emotional state |
| 5 | `knowledge` | Local knowledge notes |
| 3 | `constitution` | Constitutional behavioral modifiers |

The `ContextBuilder` calls each source function on every request. Sources returning `None` are skipped. The resulting system messages are prepended to the conversation before sending to the LLM.

The full message assembly order:
1. Personality preprompt (system/user/assistant messages)
2. Context builder system messages (table above)
3. MCP resource messages (cached, TTL-based)
4. Conversation history
5. Current user message

## Component Interaction Overview

```mermaid
flowchart TB
subgraph Input
MIC[Microphone] --> VAD[VAD] --> ASR[ASR Engine]
KB[Keyboard/TUI] --> TL[TextListener]
CAM[Camera] --> VP[VisionProcessor]
end

subgraph Processing
ASR --> SL[SpeechListener]
SL --> LLM[LLMProcessor<br>Priority]
TL --> LLM
VP --> AL[AutonomyLoop]
AL --> LLMA[LLMProcessor<br>Autonomy]
LLM --> TE[ToolExecutor]
LLMA --> TE
TE --> MCP[MCP Servers]
TE --> NT[Native Tools]
end

subgraph Output
LLM --> TTS[TTSSynthesizer]
LLMA --> TTS
TTS --> SP[SpeechPlayer]
SP --> SPKR[Speaker]
end

subgraph Background
SM[SubagentManager] --> EA[EmotionAgent]
SM --> OA[ObserverAgent]
SM --> CA[CompactionAgent]
EA --> SS[SlotStore]
OA --> SS
SS --> CTX[ContextBuilder]
CTX --> LLM
CTX --> LLMA
end
```

## See Also

- [README](../README.md) β€” Full project overview
- [autonomy.md](./autonomy.md) β€” Autonomy loop and subagent details
- [mcp.md](./mcp.md) β€” MCP tool system
- [audio.md](./audio.md) β€” Audio pipeline details
187 changes: 187 additions & 0 deletions docs/audio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Audio Pipeline

GLaDOS uses a fully local audio pipeline with ONNX-based models for voice activity detection, speech recognition, and text-to-speech synthesis. All inference runs on-device with no cloud dependencies.

## Pipeline Overview

```mermaid
flowchart LR
MIC[Microphone<br>16kHz mono] --> VAD[Silero VAD<br>32ms chunks]
VAD -->|speech detected| BUF[Pre-activation<br>Buffer 800ms]
BUF --> ASR[ASR Engine<br>Parakeet ONNX]
ASR -->|text| LLM[LLM Processor]
LLM -->|text| TTS[TTS Engine<br>GLaDOS / Kokoro]
TTS -->|audio| SP[SpeechPlayer<br>sounddevice]
SP --> SPKR[Speaker]
```

## Voice Activity Detection (VAD)

GLaDOS uses **Silero VAD** (ONNX) to detect when the user is speaking.

| Parameter | Value |
|-----------|-------|
| Model | Silero VAD (ONNX) |
| Sample rate | 16,000 Hz |
| Chunk size | 32ms (512 samples) |
| Trigger threshold | 0.8 (configurable) |
| Audio format | 16-bit mono float32 |

The VAD processes audio in 32ms chunks. When the VAD confidence exceeds the threshold (default 0.8), the system transitions to recording mode and begins accumulating audio for ASR.

### Pre-Activation Buffer

A rolling buffer captures audio **before** VAD triggers, preventing the loss of word beginnings:

- **Buffer size**: 800ms (25 chunks at 32ms each)
- **Implementation**: `deque(maxlen=25)` of 32ms audio chunks
- When VAD triggers, the buffer contents are prepended to the recording

### Speech Segmentation

Speech is segmented by silence gaps:

- **Pause limit**: 640ms of silence ends a speech segment
- When the gap counter exceeds `PAUSE_LIMIT / VAD_SIZE` (20 chunks), the accumulated audio is sent to ASR

## ASR Engines

GLaDOS supports two NVIDIA Parakeet ASR engines, selectable via the `asr_engine` config option.

### Parakeet TDT (Token and Duration Transducer)

The default and recommended engine, offering the best accuracy.

| Aspect | Value |
|--------|-------|
| Config value | `asr_engine: "tdt"` |
| Architecture | Encoder + Decoder + Joiner (transducer) |
| Model size | 0.6B parameters |
| Models | `parakeet-tdt-0.6b-v3_encoder.onnx`, `_decoder.onnx`, `_joiner.onnx` |
| Sample rate | 16,000 Hz |
| Backend | ONNX Runtime (CPU/CUDA) |

### Parakeet CTC (Connectionist Temporal Classification)

A lighter alternative with faster inference at the cost of some accuracy.

| Aspect | Value |
|--------|-------|
| Config value | `asr_engine: "ctc"` |
| Architecture | Single encoder with CTC head |
| Model size | 110M parameters |
| Model | `nemo-parakeet_tdt_ctc_110m.onnx` |
| Sample rate | 16,000 Hz |
| Backend | ONNX Runtime (CPU/CUDA) |

Both engines use mel spectrogram preprocessing (16kHz, configurable n_fft, window size, and number of mel bins from model config YAML).

## TTS Engines

The TTS engine is selected by the `voice` config option. Setting `voice: "glados"` uses the GLaDOS engine; any other value selects a Kokoro voice.

### GLaDOS Voice (Piper VITS)

The signature GLaDOS voice from the Portal games.

| Aspect | Value |
|--------|-------|
| Config value | `voice: "glados"` |
| Architecture | Piper VITS (ONNX) |
| Model | `models/TTS/glados.onnx` |
| Sample rate | 22,050 Hz |
| Phonemizer | Custom ONNX phonemizer (`phomenizer_en.onnx`) |
| Pipeline | Text β†’ Phonemizer β†’ VITS β†’ Audio |

### Kokoro (Multi-Voice)

A multi-voice TTS engine supporting various voice styles.

| Aspect | Value |
|--------|-------|
| Config value | `voice: "<voice_name>"` (e.g., `af_bella`, `am_adam`) |
| Architecture | Kokoro ONNX |
| Model | `models/TTS/kokoro-v1.0.fp16.onnx` |
| Sample rate | 24,000 Hz |
| Max phoneme length | 510 |
| Default voice | `af_alloy` |

Available voice prefixes:
- `af_` β€” Female voices (e.g., `af_bella`, `af_alloy`, `af_nova`, `af_shimmer`)
- `am_` β€” Male voices (e.g., `am_adam`, `am_echo`, `am_orion`, `am_sage`)

## Interruption Handling

When `interruptible: true` (default), user speech interrupts GLaDOS mid-response:

```mermaid
sequenceDiagram
participant U as User
participant VAD as VAD
participant SP as SpeechPlayer
participant LLM as LLMProcessor
participant E as EmotionAgent

SP->>SP: Playing GLaDOS response
U->>VAD: Starts speaking
VAD->>SP: Stop playback
SP->>SP: Record percentage spoken
SP->>LLM: Clip response at interruption point
SP->>E: EmotionEvent("user", "User interrupted me mid-sentence")
VAD->>LLM: New user input (priority lane)
```

Key behaviors:
1. **Playback stops immediately** when VAD detects speech during output
2. **Response is clipped** β€” the conversation history records only the portion that was actually spoken
3. **Emotion event fires** β€” the emotion agent receives an interruption event, which may increase arousal
4. **Priority lane** ensures the new user input is processed immediately

## Wake Word Support

When `wake_word` is configured, GLaDOS only processes speech that contains the wake word.

- **Matching**: Uses Levenshtein distance (edit distance) for fuzzy matching
- **Threshold**: A word matches if its Levenshtein distance to the wake word is small enough
- **Case-insensitive**: Both the transcription and wake word are lowercased before comparison
- **Per-word check**: Each word in the transcription is checked independently

```yaml
wake_word: "glados" # Only respond when "glados" (or similar) is spoken
```

If the wake word is not detected in the transcription, the input is silently discarded.

## Audio I/O Backend

GLaDOS uses the `sounddevice` library for audio I/O, wrapped in the `AudioProtocol` interface.

```python
class AudioProtocol(Protocol):
def __init__(self, vad_threshold: float | None = None) -> None: ...
def start_speaking(self, audio_data, sample_rate=None, text="") -> None: ...
def measure_percentage_spoken(self, total_samples, sample_rate=None) -> tuple[bool, int]: ...
def is_speaking(self) -> bool: ...
def stop_speaking(self) -> None: ...
```

The protocol-based design allows swapping audio backends. Currently `sounddevice` is the only implementation; a `websocket` backend is planned.

## Configuration Reference

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `voice` | string | required | TTS voice: `"glados"` or any Kokoro voice name |
| `asr_engine` | string | required | ASR engine: `"tdt"` (best) or `"ctc"` (faster) |
| `audio_io` | string | required | Audio backend: `"sounddevice"` |
| `interruptible` | bool | required | Allow user to interrupt mid-response |
| `wake_word` | string/null | `null` | Optional wake word for activation |
| `asr_muted` | bool | `false` | Start with ASR muted |
| `tts_enabled` | bool | `true` | Enable TTS output |
| `announcement` | string/null | `null` | Text to speak on startup |

## See Also

- [README](../README.md) β€” Full project overview
- [architecture.md](./architecture.md) β€” System architecture and thread model
- [configuration.md](./configuration.md) β€” Complete configuration reference
Loading