VoxAI

Give your AI a voice. Talk to Claude Code, local models, and CLI tools — and have them talk back.

The core idea

Most AI coding tools are text-only. You type, they type back.

VoxAI adds the missing voice layer — on both sides:

Input: speak naturally, get real-time transcription
Output: AI responses read aloud through a customizable voice, powered by an MCP server

Connect it to Claude Code, a local model, or any MCP-compatible tool, and you get a complete voice interface for AI-assisted development. Code with your voice.

The MCP server exposes TTS as a tool. Your AI calls speak(text) and VoxAI handles the rest — no audio pipeline plumbing needed on the AI side.

Screenshots

_{Dialog mode — idle}

_{Dialog mode — recording}

_{Meeting mode — multi-speaker transcription with speaker labels}

_{Menu bar}

_{Settings — voice engine, speaker diarization}

Features

🎙️ Voice output for AI — via MCP

Expose TTS as an MCP tool: any AI that supports MCP can call speak(text) to produce voice output
Works with Claude Code, local models (Ollama, LM Studio), and any MCP-compatible CLI tool
Cloud engine: edge-tts — high-quality Microsoft neural voices, no API key needed
Local engine: Qwen3-TTS via mlx-audio — runs fully offline on Apple Silicon
Customizable voice, language, and speech rate per session
stop_speaking tool lets the AI interrupt itself mid-sentence

🗣️ Real-time voice input

Floating overlay window stays on top of any app — speak, see text, keep working
Powered by SFSpeechRecognizer + AVAudioEngine with automatic session restart at the 60s system limit
Lyrics-style display: completed segments fade; the active segment stays bright
Auto-punctuation on macOS 13+

📋 Meeting recorder with speaker diarization

Full-window recording mode for multi-person sessions
After recording, whisperx identifies and labels each speaker's lines
Inline speaker renaming — click a label to rename; propagates across all segments
Export as Markdown, or send directly to an AI for summarization

⚙️ Settings

Switch between cloud and local TTS engine without restarting
Per-language voice selection (separate Chinese / English voices)
Silence detection threshold: 0.5s – 2.0s
HuggingFace token for speaker diarization (optional)
UI language follows macOS system setting (Chinese / English)

Usage

Basic workflow

Launch VoxAI — the app lives in your menubar
Open Dialog mode — a floating window appears, stays on top of everything
Speak — transcription appears in real time; copy the text to your AI tool (Claude Code, terminal, chat interface)
AI responds — if the MCP server is connected, the AI calls speak(text) and VoxAI reads the response aloud automatically

That's the full loop: your voice in, AI voice out.

First-time setup

menubar icon → 偏好设置 / Preferences

Recognition language: set to Auto, Chinese, or English
TTS engine: Cloud (edge-tts, needs internet) or Local (Qwen3-TTS, fully offline)
Voice: pick from available voices for your chosen engine
HuggingFace Token: paste your token to enable speaker diarization in meeting mode (free at huggingface.co)

Meeting mode

Click 会议记录 / Meeting Notes in the menubar
Press + to start a new recording
Speak — transcription segments appear in real time
Click 完成 / Done when finished — whisperx runs offline and labels each speaker
Click any speaker name to rename it; the rename applies to all their segments

How the MCP integration works

You speak
  → VoxAI transcribes in real time
  → text goes to your AI tool (Claude Code, CLI, etc.)
  → AI processes and calls speak(text) via MCP
  → VoxAI reads the response aloud

The MCP server runs locally alongside the app. Configure your AI tool to connect to it, and voice I/O is handled automatically.

Available MCP tools:

Tool	Description
`speak(text, voice?, speed?)`	Speak text aloud with optional voice/speed override
`stop_speaking()`	Interrupt current playback
`list_voices()`	List available voices for current engine
`update_voice_config(...)`	Change engine, voice, or speed at runtime
`list_meetings()`	List all recorded meetings with metadata
`get_meeting(id)`	Get full transcript for a meeting

Tech stack

Layer	Technology
App	Swift 5.9 + SwiftUI, macOS 14+
Real-time ASR	`SFSpeechRecognizer` + `AVAudioEngine`
Offline transcription + diarization	whisperx 3.8.5
Local TTS	mlx-audio + Qwen3-TTS-0.6B (Apple Silicon)
Cloud TTS	edge-tts
AI integration	MCP server (Python)
Python runtime	venv (Python 3.13)

Architecture highlights

Session generation counter — SFSpeechRecognizer sessions timeout at ~60s. Each restart increments a generation counter; stale callbacks compare their captured generation and early-return if mismatched. This prevents old results from writing into the current segment after a restart.

Continuous WAV recording across session restarts — AVAudioFile stays open across recognition session restarts so meeting audio is one contiguous file, ready for whisperx diarization after the session ends.

Window management via stored references — rather than looking up windows by title (fragile with localization), TranscriptionService stores direct NSWindow references for both the floating dialog and meeting windows. Switching modes is an orderOut + makeKeyAndOrderFront on the stored references.

Stable speaker colors — Swift's hashValue is randomized per-launch. Speaker colors use a unicode scalar reduce hash instead, giving each speaker a stable color across sessions.

Setup

Requirements

macOS 14 (Sonoma) or later
Python 3.11+ (for the backend)
Xcode 15+ (to build from source)

Quick start

git clone https://github.com/Ethan-YS/VoxAI.git
cd VoxAI

# Set up Python backend
python3 -m venv venv
source venv/bin/activate
pip install whisperx mlx-audio edge-tts

# Configure
cp config.example.json config.json
# Edit config.json — add your HuggingFace token for speaker diarization (optional)

# Build and run the app
open VoxSage-App/VoxSage.xcodeproj

Grant microphone and speech recognition permissions when prompted.

Connect to Claude Code

Add to your Claude Code MCP config:

{
  "mcpServers": {
    "voxai": {
      "command": "/path/to/VoxAI/venv/bin/python3",
      "args": ["/path/to/VoxAI/src/mcp/server.py"]
    }
  }
}

config.json options

{
  "cn_engine": "cloud",
  "cn_voice": "xiaoxiao",
  "en_voice": "default",
  "speed": 1.0,
  "language": "auto",
  "hf_token": "hf_...",
  "silence_duration": 1.5,
  "recognition_language": "auto"
}

Download

Pre-built .dmg for Apple Silicon is available on the Releases page.

Project structure

VoxSage-App/VoxSage/
├── VoxSageApp.swift              # Entry point — 3 scenes: dialog / meeting / menubar
├── ContentView.swift             # Meeting window — 3-panel layout
├── Views/
│   ├── DialogView.swift          # Floating overlay + lyrics view
│   ├── MeetingView.swift         # Menubar dropdown
│   └── SettingsView.swift        # Settings panel
├── Services/
│   └── TranscriptionService.swift  # Audio engine, ASR session management
├── Models/
│   └── MeetingStore.swift        # Meeting persistence + diarization orchestration
└── zh-Hans.lproj/
    └── Localizable.strings       # Chinese localization

src/
├── mcp/server.py                 # MCP server — TTS tools + meeting data access
└── stt/diarize.py                # whisperx diarization script

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
VoxSage-App		VoxSage-App
docs/screenshots		docs/screenshots
src		src
.gitignore		.gitignore
README.md		README.md
config.example.json		config.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxAI

The core idea

Screenshots

Features

🎙️ Voice output for AI — via MCP

🗣️ Real-time voice input

📋 Meeting recorder with speaker diarization

⚙️ Settings

Usage

Basic workflow

First-time setup

Meeting mode

How the MCP integration works

Tech stack

Architecture highlights

Setup

Requirements

Quick start

Connect to Claude Code

config.json options

Download

Project structure

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoxAI

The core idea

Screenshots

Features

🎙️ Voice output for AI — via MCP

🗣️ Real-time voice input

📋 Meeting recorder with speaker diarization

⚙️ Settings

Usage

Basic workflow

First-time setup

Meeting mode

How the MCP integration works

Tech stack

Architecture highlights

Setup

Requirements

Quick start

Connect to Claude Code

config.json options

Download

Project structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages