Give your AI a voice. Talk to Claude Code, local models, and CLI tools — and have them talk back.
Most AI coding tools are text-only. You type, they type back.
VoxAI adds the missing voice layer — on both sides:
- Input: speak naturally, get real-time transcription
- Output: AI responses read aloud through a customizable voice, powered by an MCP server
Connect it to Claude Code, a local model, or any MCP-compatible tool, and you get a complete voice interface for AI-assisted development. Code with your voice.
The MCP server exposes TTS as a tool. Your AI calls speak(text) and VoxAI handles the rest — no audio pipeline plumbing needed on the AI side.
![]() Dialog mode — idle |
![]() Dialog mode — recording |
![]() Menu bar |
![]() Settings — voice engine, speaker diarization |
- Expose TTS as an MCP tool: any AI that supports MCP can call
speak(text)to produce voice output - Works with Claude Code, local models (Ollama, LM Studio), and any MCP-compatible CLI tool
- Cloud engine:
edge-tts— high-quality Microsoft neural voices, no API key needed - Local engine:
Qwen3-TTSviamlx-audio— runs fully offline on Apple Silicon - Customizable voice, language, and speech rate per session
stop_speakingtool lets the AI interrupt itself mid-sentence
- Floating overlay window stays on top of any app — speak, see text, keep working
- Powered by
SFSpeechRecognizer+AVAudioEnginewith automatic session restart at the 60s system limit - Lyrics-style display: completed segments fade; the active segment stays bright
- Auto-punctuation on macOS 13+
- Full-window recording mode for multi-person sessions
- After recording, whisperx identifies and labels each speaker's lines
- Inline speaker renaming — click a label to rename; propagates across all segments
- Export as Markdown, or send directly to an AI for summarization
- Switch between cloud and local TTS engine without restarting
- Per-language voice selection (separate Chinese / English voices)
- Silence detection threshold: 0.5s – 2.0s
- HuggingFace token for speaker diarization (optional)
- UI language follows macOS system setting (Chinese / English)
- Launch VoxAI — the app lives in your menubar
- Open Dialog mode — a floating window appears, stays on top of everything
- Speak — transcription appears in real time; copy the text to your AI tool (Claude Code, terminal, chat interface)
- AI responds — if the MCP server is connected, the AI calls
speak(text)and VoxAI reads the response aloud automatically
That's the full loop: your voice in, AI voice out.
menubar icon → 偏好设置 / Preferences
- Recognition language: set to Auto, Chinese, or English
- TTS engine: Cloud (edge-tts, needs internet) or Local (Qwen3-TTS, fully offline)
- Voice: pick from available voices for your chosen engine
- HuggingFace Token: paste your token to enable speaker diarization in meeting mode (free at huggingface.co)
- Click 会议记录 / Meeting Notes in the menubar
- Press + to start a new recording
- Speak — transcription segments appear in real time
- Click 完成 / Done when finished — whisperx runs offline and labels each speaker
- Click any speaker name to rename it; the rename applies to all their segments
You speak
→ VoxAI transcribes in real time
→ text goes to your AI tool (Claude Code, CLI, etc.)
→ AI processes and calls speak(text) via MCP
→ VoxAI reads the response aloud
The MCP server runs locally alongside the app. Configure your AI tool to connect to it, and voice I/O is handled automatically.
Available MCP tools:
| Tool | Description |
|---|---|
speak(text, voice?, speed?) |
Speak text aloud with optional voice/speed override |
stop_speaking() |
Interrupt current playback |
list_voices() |
List available voices for current engine |
update_voice_config(...) |
Change engine, voice, or speed at runtime |
list_meetings() |
List all recorded meetings with metadata |
get_meeting(id) |
Get full transcript for a meeting |
| Layer | Technology |
|---|---|
| App | Swift 5.9 + SwiftUI, macOS 14+ |
| Real-time ASR | SFSpeechRecognizer + AVAudioEngine |
| Offline transcription + diarization | whisperx 3.8.5 |
| Local TTS | mlx-audio + Qwen3-TTS-0.6B (Apple Silicon) |
| Cloud TTS | edge-tts |
| AI integration | MCP server (Python) |
| Python runtime | venv (Python 3.13) |
Session generation counter — SFSpeechRecognizer sessions timeout at ~60s. Each restart increments a generation counter; stale callbacks compare their captured generation and early-return if mismatched. This prevents old results from writing into the current segment after a restart.
Continuous WAV recording across session restarts — AVAudioFile stays open across recognition session restarts so meeting audio is one contiguous file, ready for whisperx diarization after the session ends.
Window management via stored references — rather than looking up windows by title (fragile with localization), TranscriptionService stores direct NSWindow references for both the floating dialog and meeting windows. Switching modes is an orderOut + makeKeyAndOrderFront on the stored references.
Stable speaker colors — Swift's hashValue is randomized per-launch. Speaker colors use a unicode scalar reduce hash instead, giving each speaker a stable color across sessions.
- macOS 14 (Sonoma) or later
- Python 3.11+ (for the backend)
- Xcode 15+ (to build from source)
git clone https://github.com/Ethan-YS/VoxAI.git
cd VoxAI
# Set up Python backend
python3 -m venv venv
source venv/bin/activate
pip install whisperx mlx-audio edge-tts
# Configure
cp config.example.json config.json
# Edit config.json — add your HuggingFace token for speaker diarization (optional)
# Build and run the app
open VoxSage-App/VoxSage.xcodeprojGrant microphone and speech recognition permissions when prompted.
Add to your Claude Code MCP config:
{
"mcpServers": {
"voxai": {
"command": "/path/to/VoxAI/venv/bin/python3",
"args": ["/path/to/VoxAI/src/mcp/server.py"]
}
}
}{
"cn_engine": "cloud",
"cn_voice": "xiaoxiao",
"en_voice": "default",
"speed": 1.0,
"language": "auto",
"hf_token": "hf_...",
"silence_duration": 1.5,
"recognition_language": "auto"
}Pre-built .dmg for Apple Silicon is available on the Releases page.
VoxSage-App/VoxSage/
├── VoxSageApp.swift # Entry point — 3 scenes: dialog / meeting / menubar
├── ContentView.swift # Meeting window — 3-panel layout
├── Views/
│ ├── DialogView.swift # Floating overlay + lyrics view
│ ├── MeetingView.swift # Menubar dropdown
│ └── SettingsView.swift # Settings panel
├── Services/
│ └── TranscriptionService.swift # Audio engine, ASR session management
├── Models/
│ └── MeetingStore.swift # Meeting persistence + diarization orchestration
└── zh-Hans.lproj/
└── Localizable.strings # Chinese localization
src/
├── mcp/server.py # MCP server — TTS tools + meeting data access
└── stt/diarize.py # whisperx diarization script
MIT




