A conversational AI medical assistant that combines vision (image analysis), voice (speech-to-text), and natural language processing to provide medical insights. This project uses Gradio for the web interface and multiple AI APIs to deliver a seamless doctor-patient interaction simulation.
- Voice Input: Record patient queries directly from the microphone
- Medical Image Analysis: Upload and analyze medical images using advanced vision models
- AI-Powered Responses: Get detailed medical insights using Groq's LLaMA model
- Voice Output: Listen to AI doctor responses through text-to-speech conversion
- Web Interface: User-friendly Gradio-based interface accessible via browser
- Real-time Processing: Instant transcription and analysis with streaming responses
- Python 3.13+ (project uses Python 3.13)
- Microphone (for voice recording)
- API Keys:
- Groq API Key - For LLaMA model access and speech-to-text
- ElevenLabs API Key - For high-quality voice synthesis
- Internet Connection - Required for API calls and model inference
cd c:\Users\HP ENVY\Desktop\ai_voicebotpython -m venv .venv
.venv\Scripts\activateUsing pip:
pip install -r requirements.txtOr using Pipenv:
pipenv installRequired packages:
groq- Groq API clientgradio- Web UI frameworkelevenlabs- Text-to-speech synthesisgtts- Google Text-to-Speech (fallback)pygame- Audio playbackpydub- Audio processingspeechrecognition- Speech recognitionpyaudio- Audio input/outputffmpeg- Audio/video processing
Create a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
ELEVENLABS_API_KEY=your_elevenlabs_api_key_hereOr set them in your system environment variables:
Windows PowerShell:
$env:GROQ_API_KEY = "your_groq_api_key_here"
$env:ELEVENLABS_API_KEY = "your_elevenlabs_api_key_here"python gradio_app.pyThe application will start on: http://127.0.0.1:7860
-
Record Audio: Click on the microphone icon and speak your medical query
- The app will automatically transcribe your speech to text
-
Upload Image (Optional): Upload a medical image for analysis
- Supported formats: JPG, PNG, etc.
- Images are analyzed along with your verbal query
-
Submit: Click the "Submit" button to process
-
View Results:
- Speech to Text: Your recorded question transcribed
- Doctor's Response: AI-generated medical insight
- Audio Response: Listen to the doctor's response as voice
ai_voicebot/
├── gradio_app.py # Main Gradio UI application
├── brain_of_doctor.py # Image analysis and LLaMA model integration
├── voice_of_the_patient.py # Audio recording and speech-to-text
├── voice_of_the_doctor.py # Text-to-speech synthesis
├── .env # Environment variables (keep secret!)
├── .gitignore # Git ignore rules
├── Pipfile # Pipenv dependencies
├── Pipfile.lock # Locked dependency versions
├── README.md # This file
└── .venv/ # Virtual environment
Main entry point for the application. Creates the Gradio interface with audio and image inputs, orchestrates the workflow, and displays results.
Key Function:
process_inputs(audio_filepath, image_filepath)- Main processing pipeline
Handles medical image analysis using Groq's vision models.
Key Functions:
encode_image(image_path)- Converts image to base64 for API transmissionanalyze_image_with_query(query, model, encoded_image)- Analyzes image with LLaMA model
Model Used: meta-llama/llama-4-maverick-17b-128e-instruct
Handles audio recording and speech-to-text transcription.
Key Functions:
record_audio(file_path, timeout, phrase_time_limit)- Records audio from microphonetranscribe_with_groq(stt_model, audio_filepath, GROQ_API_KEY)- Transcribes audio using Groq's Whisper model
STT Model: whisper-large-v3
Handles text-to-speech synthesis for AI doctor responses.
Key Functions:
text_to_speech_with_gtts(input_text, output_filepath)- Uses Google TTS (with autoplay)text_to_speech_with_elevenlabs(input_text, output_filepath)- Uses ElevenLabs TTS (with autoplay)
Voice ID (ElevenLabs): 21m00Tcm4TlvDq8ikWAM
User Input (Audio + Image)
↓
1. Audio Recording & Transcription
└─→ voice_of_the_patient.py
└─→ Converts speech to text using Groq Whisper
↓
2. Image Analysis
└─→ brain_of_doctor.py
└─→ Encodes image to base64
└─→ Sends to LLaMA model with patient query
↓
3. AI Response Generation
└─→ LLaMA model generates doctor response
↓
4. Voice Synthesis
└─→ voice_of_the_doctor.py
└─→ Converts response to speech via ElevenLabs
↓
5. Display Results
└─→ Show transcription, response, and play audio
- Used for: LLaMA vision model and Whisper speech-to-text
- Get API Key
- Rate Limits: Check Groq documentation for current limits
- Used for: High-quality voice synthesis
- Get API Key
- Voice ID:
21m00Tcm4TlvDq8ikWAM(Professional male voice)
- Requires internet connection for API calls
- Medical analysis is for educational purposes only - not for actual medical diagnosis
- Image analysis limited to formats supported by the API
- Response time depends on API availability and model inference speed
- Audio recording limited to microphone input
This project is designed for educational purposes to demonstrate:
- Multimodal AI integration (audio + vision)
- API orchestration and integration
- Speech processing pipelines
- Web UI development with Gradio
- Error handling and user feedback
Feel free to extend this project with:
- Additional voice options
- Support for multiple languages
- Medical specialization options
- Session history and context retention
- Advanced image preprocessing
This project uses third-party APIs and services. Ensure compliance with:
- Groq API Terms of Service
- ElevenLabs API Terms of Service
- Google TTS Terms of Service
- Gradio Documentation
- Groq API Docs
- ElevenLabs Documentation
- SpeechRecognition Library
- Pydub Documentation
- Add medical history tracking
- Support for multiple doctor personas
- Real-time transcription display
- Image annotation features
- Multi-language support
- Custom medical knowledge base integration
- Session persistence
- Advanced diagnostics with confidence scores
👨💻 Athar Ramzan
GitHub: @oathar