Real-Time, AI-Driven Voice Assistant for College Admissions
Zentry is a real-time AI telephony assistant designed to handle college admission inquiries for TIST (Toc H Institute of Science and Technology). It processes natural spoken Malayalam, retrieves accurate admissions data, and responds contextually over a standard phone call.
The system connects callers via a cloud telephony gateway to a local inference engine. Audio streams are transcribed, translated, processed for intent, and synthesized back into Malayalam speech with sub-second latency targets.
- Telephony Gateway: Twilio handles incoming calls, bridging the SIP/voice traffic to the backend processing server.
- Speech-to-Text (STT): Whisper Medium (Fine-tuned) using the custom Malayalam weights trained by thennal for superior dialect recognition and accuracy.
- Translation Layer: IndicTrans2 bridges the Malayalam audio transcripts with the English-centric reasoning engine.
- Reasoning Engine (LLM): Phi-4 evaluates queries, fetches TIST-specific admissions data, and constructs the response.
- Text-to-Speech (TTS): A hybrid approach utilizing optimized TTS models (incorporating frameworks like Piper and Parler) to generate natural, real-time Malayalam audio.
- Ubuntu 22.04 LTS (Recommended) / Windows with WSL2
- Python 3.10+
- Twilio Account (SID, Auth Token, and active phone number)
- CUDA-compatible GPU for local model inference
-
Clone the repository:
git clone [https://github.com/Habel2005/zentry.git](https://github.com/Habel2005/zentry.git) cd zentry -
Set up the virtual environment:
python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Environment Variables: Create a
.envfile and add your Twilio credentials and server configurations. -
Start the Application:
python -m backend.main_server
Building an AI that speaks native Malayalam and operates over a phone line required navigating a complex landscape of telecom protocols and rapidly evolving open-source models. Here is the story of how the current stack came to be:
The initial vision was a completely on-premise PBX system. The journey started with Asterisk, but the configuration and SIP trunking complexities proved to be a heavy bottleneck. The next logical step was FreeSWITCH, which offered better documentation for modern application integration. However, managing RTP audio streams, compiling modules, and battling firewall NAT issues took focus away from the AI logic. Ultimately, the architecture pivoted to Twilio. Offloading the telecom infrastructure to Twilio's reliable cloud APIs allowed for a streamlined focus purely on the conversational AI and low-latency websocket streaming.
Finding an LLM that could "think" and "speak" Malayalam accurately was the biggest hurdle. Extensive testing was done in Google Colab, heavily evaluating various open-weight models using custom prompts.
- Native Fine-tunes: Models like Sarvam, and various Malayalam fine-tunes of Llama and Gemma were tested. While promising, they often hallucinated, struggled with complex reasoning regarding college data, or lacked the inference speed needed for real-time voice.
- The Pivot: The solution was a translation bridge. By utilizing IndicTrans2, Malayalam input is seamlessly translated to English, processed by the highly capable and fast Phi-4 model, and then translated back. This guaranteed high-quality reasoning without sacrificing linguistic accuracy.
- Hearing (STT): Standard Whisper models struggled with the specific intonations and speed of conversational Malayalam. The breakthrough came by integrating a Whisper Medium model fine-tuned by thennal, which drastically improved transcription accuracy.
- Speaking (TTS): Finding a natural Malayalam voice was an iterative grind. The project cycled through almost every open-source TTS framework available—testing Coqui, exploring MMS (Massively Multilingual Speech), and experimenting with Parler. The final TTS pipeline leverages a tailored configuration (often relying on Piper's efficiency) to balance realistic voice inflection with the strict latency requirements of a live phone call.
Zentry is the result of continuous prototyping, testing, and pivoting to find the perfect balance between local AI inference and reliable telecom infrastructure.