Skip to content

himanshusaini0345/calling_agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Modular Voice Assistant Pipeline

A flexible, modular voice assistant server with pluggable STT, LLM, and TTS providers. Supports both cloud and local (CPU-based) providers.

Architecture

Audio Input β†’ STT β†’ LLM β†’ TTS β†’ Audio Output
              ↓     ↓     ↓
         [Provider Interfaces]
              ↓     ↓     ↓
    [Multiple Implementations]

Components

  • STT (Speech-to-Text)

    • Local: Faster Whisper (CPU, multilingual)
    • Cloud: Deepgram
  • LLM (Language Model)

    • OpenAI (gpt-4o-mini, gpt-4o)
  • TTS (Text-to-Speech)

    • Local: Piper (CPU)
    • Cloud: Cartesia

Installation

1. Install Python dependencies

pip install -r requirements.txt

2. Download local models (if using local providers)

Faster Whisper (STT)

Models download automatically on first use. Supported sizes:

  • tiny - Fastest, least accurate
  • base - Good balance (recommended)
  • small - Better accuracy
  • medium - High accuracy
  • large-v3 - Best accuracy

Piper (TTS)

Download a voice model from Piper voices:

# Example: Download English US voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Update TTS_CONFIG["model_path"] in server.py with the path to your .onnx file.

3. Configure environment variables

Create a .env file:

# Only needed if using cloud providers
OPENAI_API_KEY=your_openai_key
DEEPGRAM_API_KEY=your_deepgram_key
CARTESIA_API_KEY=your_cartesia_key

Configuration

Edit server.py to switch providers:

Use Local STT + Local TTS (No API keys needed!)

STT_CONFIG = {
    "provider": "local",
    "model_size": "base",  # tiny, base, small, medium, large-v3
    "language": None,  # Auto-detect or specify: "en", "es", "fr", etc.
    "device": "cpu",
    "compute_type": "int8",
}

TTS_CONFIG = {
    "provider": "local",
    "model_path": "/path/to/your/piper-model.onnx",
}

Use Cloud Providers

STT_CONFIG = {
    "provider": "deepgram",
    "model": "nova-2",
    "language": "en",
}

TTS_CONFIG = {
    "provider": "cartesia",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
    "model_id": "sonic-english",
}

Mixed Configuration

You can mix and match! For example:

  • Local STT + Cloud LLM + Local TTS
  • Cloud STT + Cloud LLM + Cloud TTS

Usage

Start the server

python server.py

You should see:

==================================================
πŸŽ™οΈ  Voice Assistant Server
==================================================
STT Provider: local
LLM Provider: openai (gpt-4o-mini)
TTS Provider: local
==================================================
πŸš€ Server running at ws://0.0.0.0:9000
==================================================

Client connection

Connect via WebSocket and send raw PCM audio (16-bit, 16kHz, mono):

const ws = new WebSocket('ws://localhost:9000');

// Send audio chunks
ws.send(audioChunk);  // Raw PCM bytes

// Receive responses
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'audio') {
    const audioBytes = base64ToArrayBuffer(data.data);
    // Play audio
  }
};

Supported Languages (Local STT)

Faster Whisper supports 99+ languages. Set language parameter:

STT_CONFIG = {
    "provider": "local",
    "language": "es",  # Spanish
    # Or None for auto-detection
}

Common codes: en, es, fr, de, it, pt, ru, ja, zh, ko, ar, hi

Performance

Local Providers (CPU)

  • STT (Faster Whisper):

    • base model: ~100-300ms latency
    • tiny model: ~50-100ms latency
  • TTS (Piper):

    • ~50-200ms per sentence

Cloud Providers

  • STT (Deepgram): ~50-150ms
  • TTS (Cartesia): ~100-300ms

Faster LLM models (gpt-4o-mini) reduce time to first audio significantly.

Adding New Providers

1. Create provider class

Inherit from the base class in providers/base.py:

# providers/stt_whisper_cpp.py
from .base import STTProvider

class WhisperCppSTT(STTProvider):
    async def transcribe_stream(self, audio_stream):
        # Your implementation
        pass
    
    async def close(self):
        pass

2. Add to factory

Edit providers/factory.py:

elif provider == "whisper_cpp":
    from .stt_whisper_cpp import WhisperCppSTT
    return WhisperCppSTT(**kwargs)

3. Use in configuration

STT_CONFIG = {
    "provider": "whisper_cpp",
    # Your config
}

Project Structure

.
β”œβ”€β”€ providers/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py              # Abstract base classes
β”‚   β”œβ”€β”€ factory.py           # Provider factory
β”‚   β”œβ”€β”€ stt_local.py         # Faster Whisper STT
β”‚   β”œβ”€β”€ stt_deepgram.py      # Deepgram STT
β”‚   β”œβ”€β”€ llm_openai.py        # OpenAI LLM
β”‚   β”œβ”€β”€ tts_local.py         # Piper TTS
β”‚   └── tts_cartesia.py      # Cartesia TTS
β”œβ”€β”€ pipeline.py              # Main pipeline orchestration
β”œβ”€β”€ server.py                # WebSocket server
β”œβ”€β”€ requirements.txt
└── README.md

Troubleshooting

Faster Whisper model download fails

  • Check internet connection
  • Models are cached in ~/.cache/huggingface/

Piper model not found

  • Ensure .onnx and .onnx.json files are in the same directory
  • Provide absolute path to model

High CPU usage

  • Use smaller models: tiny for STT, smaller Piper voices
  • Reduce sample_rate if possible

Audio format issues

  • Input must be 16-bit PCM, 16kHz, mono
  • Output format varies by TTS provider (check get_audio_format())

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors