A flexible, modular voice assistant server with pluggable STT, LLM, and TTS providers. Supports both cloud and local (CPU-based) providers.
Audio Input β STT β LLM β TTS β Audio Output
β β β
[Provider Interfaces]
β β β
[Multiple Implementations]
-
STT (Speech-to-Text)
- Local: Faster Whisper (CPU, multilingual)
- Cloud: Deepgram
-
LLM (Language Model)
- OpenAI (gpt-4o-mini, gpt-4o)
-
TTS (Text-to-Speech)
- Local: Piper (CPU)
- Cloud: Cartesia
pip install -r requirements.txtModels download automatically on first use. Supported sizes:
tiny- Fastest, least accuratebase- Good balance (recommended)small- Better accuracymedium- High accuracylarge-v3- Best accuracy
Download a voice model from Piper voices:
# Example: Download English US voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.jsonUpdate TTS_CONFIG["model_path"] in server.py with the path to your .onnx file.
Create a .env file:
# Only needed if using cloud providers
OPENAI_API_KEY=your_openai_key
DEEPGRAM_API_KEY=your_deepgram_key
CARTESIA_API_KEY=your_cartesia_keyEdit server.py to switch providers:
STT_CONFIG = {
"provider": "local",
"model_size": "base", # tiny, base, small, medium, large-v3
"language": None, # Auto-detect or specify: "en", "es", "fr", etc.
"device": "cpu",
"compute_type": "int8",
}
TTS_CONFIG = {
"provider": "local",
"model_path": "/path/to/your/piper-model.onnx",
}STT_CONFIG = {
"provider": "deepgram",
"model": "nova-2",
"language": "en",
}
TTS_CONFIG = {
"provider": "cartesia",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"model_id": "sonic-english",
}You can mix and match! For example:
- Local STT + Cloud LLM + Local TTS
- Cloud STT + Cloud LLM + Cloud TTS
python server.pyYou should see:
==================================================
ποΈ Voice Assistant Server
==================================================
STT Provider: local
LLM Provider: openai (gpt-4o-mini)
TTS Provider: local
==================================================
π Server running at ws://0.0.0.0:9000
==================================================
Connect via WebSocket and send raw PCM audio (16-bit, 16kHz, mono):
const ws = new WebSocket('ws://localhost:9000');
// Send audio chunks
ws.send(audioChunk); // Raw PCM bytes
// Receive responses
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'audio') {
const audioBytes = base64ToArrayBuffer(data.data);
// Play audio
}
};Faster Whisper supports 99+ languages. Set language parameter:
STT_CONFIG = {
"provider": "local",
"language": "es", # Spanish
# Or None for auto-detection
}Common codes: en, es, fr, de, it, pt, ru, ja, zh, ko, ar, hi
-
STT (Faster Whisper):
basemodel: ~100-300ms latencytinymodel: ~50-100ms latency
-
TTS (Piper):
- ~50-200ms per sentence
- STT (Deepgram): ~50-150ms
- TTS (Cartesia): ~100-300ms
Faster LLM models (gpt-4o-mini) reduce time to first audio significantly.
Inherit from the base class in providers/base.py:
# providers/stt_whisper_cpp.py
from .base import STTProvider
class WhisperCppSTT(STTProvider):
async def transcribe_stream(self, audio_stream):
# Your implementation
pass
async def close(self):
passEdit providers/factory.py:
elif provider == "whisper_cpp":
from .stt_whisper_cpp import WhisperCppSTT
return WhisperCppSTT(**kwargs)STT_CONFIG = {
"provider": "whisper_cpp",
# Your config
}.
βββ providers/
β βββ __init__.py
β βββ base.py # Abstract base classes
β βββ factory.py # Provider factory
β βββ stt_local.py # Faster Whisper STT
β βββ stt_deepgram.py # Deepgram STT
β βββ llm_openai.py # OpenAI LLM
β βββ tts_local.py # Piper TTS
β βββ tts_cartesia.py # Cartesia TTS
βββ pipeline.py # Main pipeline orchestration
βββ server.py # WebSocket server
βββ requirements.txt
βββ README.md
- Check internet connection
- Models are cached in
~/.cache/huggingface/
- Ensure
.onnxand.onnx.jsonfiles are in the same directory - Provide absolute path to model
- Use smaller models:
tinyfor STT, smaller Piper voices - Reduce
sample_rateif possible
- Input must be 16-bit PCM, 16kHz, mono
- Output format varies by TTS provider (check
get_audio_format())
MIT