AI Doctor with Vision and Voice 🏥

A conversational AI medical assistant that combines vision (image analysis), voice (speech-to-text), and natural language processing to provide medical insights. This project uses Gradio for the web interface and multiple AI APIs to deliver a seamless doctor-patient interaction simulation.

🌟 Features

Voice Input: Record patient queries directly from the microphone
Medical Image Analysis: Upload and analyze medical images using advanced vision models
AI-Powered Responses: Get detailed medical insights using Groq's LLaMA model
Voice Output: Listen to AI doctor responses through text-to-speech conversion
Web Interface: User-friendly Gradio-based interface accessible via browser
Real-time Processing: Instant transcription and analysis with streaming responses

📋 Prerequisites

Python 3.13+ (project uses Python 3.13)
Microphone (for voice recording)
API Keys:
- Groq API Key - For LLaMA model access and speech-to-text
- ElevenLabs API Key - For high-quality voice synthesis
Internet Connection - Required for API calls and model inference

🚀 Installation

1. Clone or Setup the Project

cd c:\Users\HP ENVY\Desktop\ai_voicebot

2. Create Virtual Environment

python -m venv .venv
.venv\Scripts\activate

3. Install Dependencies

Using pip:

pip install -r requirements.txt

Or using Pipenv:

pipenv install

Required packages:

groq - Groq API client
gradio - Web UI framework
elevenlabs - Text-to-speech synthesis
gtts - Google Text-to-Speech (fallback)
pygame - Audio playback
pydub - Audio processing
speechrecognition - Speech recognition
pyaudio - Audio input/output
ffmpeg - Audio/video processing

4. Set Environment Variables

Create a .env file in the project root:

GROQ_API_KEY=your_groq_api_key_here
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here

Or set them in your system environment variables:

Windows PowerShell:

$env:GROQ_API_KEY = "your_groq_api_key_here"
$env:ELEVENLABS_API_KEY = "your_elevenlabs_api_key_here"

📖 Usage

Start the Application

python gradio_app.py

The application will start on: http://127.0.0.1:7860

Using the Interface

Record Audio: Click on the microphone icon and speak your medical query
- The app will automatically transcribe your speech to text
Upload Image (Optional): Upload a medical image for analysis
- Supported formats: JPG, PNG, etc.
- Images are analyzed along with your verbal query
Submit: Click the "Submit" button to process
View Results:
- Speech to Text: Your recorded question transcribed
- Doctor's Response: AI-generated medical insight
- Audio Response: Listen to the doctor's response as voice

📁 Project Structure

ai_voicebot/
├── gradio_app.py              # Main Gradio UI application
├── brain_of_doctor.py         # Image analysis and LLaMA model integration
├── voice_of_the_patient.py    # Audio recording and speech-to-text
├── voice_of_the_doctor.py     # Text-to-speech synthesis
├── .env                       # Environment variables (keep secret!)
├── .gitignore                 # Git ignore rules
├── Pipfile                    # Pipenv dependencies
├── Pipfile.lock               # Locked dependency versions
├── README.md                  # This file
└── .venv/                     # Virtual environment

🔧 File Descriptions

`gradio_app.py`

Main entry point for the application. Creates the Gradio interface with audio and image inputs, orchestrates the workflow, and displays results.

Key Function:

process_inputs(audio_filepath, image_filepath) - Main processing pipeline

`brain_of_doctor.py`

Handles medical image analysis using Groq's vision models.

Key Functions:

encode_image(image_path) - Converts image to base64 for API transmission
analyze_image_with_query(query, model, encoded_image) - Analyzes image with LLaMA model

Model Used: meta-llama/llama-4-maverick-17b-128e-instruct

`voice_of_the_patient.py`

Handles audio recording and speech-to-text transcription.

Key Functions:

record_audio(file_path, timeout, phrase_time_limit) - Records audio from microphone
transcribe_with_groq(stt_model, audio_filepath, GROQ_API_KEY) - Transcribes audio using Groq's Whisper model

STT Model: whisper-large-v3

`voice_of_the_doctor.py`

Handles text-to-speech synthesis for AI doctor responses.

Key Functions:

text_to_speech_with_gtts(input_text, output_filepath) - Uses Google TTS (with autoplay)
text_to_speech_with_elevenlabs(input_text, output_filepath) - Uses ElevenLabs TTS (with autoplay)

Voice ID (ElevenLabs): 21m00Tcm4TlvDq8ikWAM

🔄 How It Works

User Input (Audio + Image)
        ↓
1. Audio Recording & Transcription
   └─→ voice_of_the_patient.py
   └─→ Converts speech to text using Groq Whisper
        ↓
2. Image Analysis
   └─→ brain_of_doctor.py
   └─→ Encodes image to base64
   └─→ Sends to LLaMA model with patient query
        ↓
3. AI Response Generation
   └─→ LLaMA model generates doctor response
        ↓
4. Voice Synthesis
   └─→ voice_of_the_doctor.py
   └─→ Converts response to speech via ElevenLabs
        ↓
5. Display Results
   └─→ Show transcription, response, and play audio

🔑 API Configuration

Groq API

Used for: LLaMA vision model and Whisper speech-to-text
Get API Key
Rate Limits: Check Groq documentation for current limits

ElevenLabs API

Used for: High-quality voice synthesis
Get API Key
Voice ID: 21m00Tcm4TlvDq8ikWAM (Professional male voice)

📝 Limitations

Requires internet connection for API calls
Medical analysis is for educational purposes only - not for actual medical diagnosis
Image analysis limited to formats supported by the API
Response time depends on API availability and model inference speed
Audio recording limited to microphone input

🎓 Educational Purpose

This project is designed for educational purposes to demonstrate:

Multimodal AI integration (audio + vision)
API orchestration and integration
Speech processing pipelines
Web UI development with Gradio
Error handling and user feedback

🤝 Contributing

Feel free to extend this project with:

Additional voice options
Support for multiple languages
Medical specialization options
Session history and context retention
Advanced image preprocessing

📄 License

This project uses third-party APIs and services. Ensure compliance with:

Groq API Terms of Service
ElevenLabs API Terms of Service
Google TTS Terms of Service

🔗 Useful Resources

💡 Future Enhancements

Add medical history tracking
Support for multiple doctor personas
Real-time transcription display
Image annotation features
Multi-language support
Custom medical knowledge base integration
Session persistence
Advanced diagnostics with confidence scores

👨‍💻 Athar Ramzan
GitHub: @oathar

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
acne.jpg		acne.jpg
brain_of_doctor.py		brain_of_doctor.py
elevenlabs_testing.mp3		elevenlabs_testing.mp3
elevenlabs_testing_autoplay.mp3		elevenlabs_testing_autoplay.mp3
final.mp3		final.mp3
gradio_app.py		gradio_app.py
gtts_testing.mp3		gtts_testing.mp3
gtts_testing_autoplay.mp3		gtts_testing_autoplay.mp3
patient_voice_test_for_patient.wav		patient_voice_test_for_patient.wav
voice_of_the_doctor.py		voice_of_the_doctor.py
voice_of_the_patient.py		voice_of_the_patient.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Doctor with Vision and Voice 🏥

🌟 Features

📋 Prerequisites

🚀 Installation

1. Clone or Setup the Project

2. Create Virtual Environment

3. Install Dependencies

4. Set Environment Variables

📖 Usage

Start the Application

Using the Interface

📁 Project Structure

🔧 File Descriptions

`gradio_app.py`

`brain_of_doctor.py`

`voice_of_the_patient.py`

`voice_of_the_doctor.py`

🔄 How It Works

🔑 API Configuration

Groq API

ElevenLabs API

📝 Limitations

🎓 Educational Purpose

🤝 Contributing

📄 License

🔗 Useful Resources

💡 Future Enhancements

About

Uh oh!

Releases

Packages

Languages

oathar/AI_Medical_Voicebot

Folders and files

Latest commit

History

Repository files navigation

AI Doctor with Vision and Voice 🏥

🌟 Features

📋 Prerequisites

🚀 Installation

1. Clone or Setup the Project

2. Create Virtual Environment

3. Install Dependencies

4. Set Environment Variables

📖 Usage

Start the Application

Using the Interface

📁 Project Structure

🔧 File Descriptions

gradio_app.py

brain_of_doctor.py

voice_of_the_patient.py

voice_of_the_doctor.py

🔄 How It Works

🔑 API Configuration

Groq API

ElevenLabs API

📝 Limitations

🎓 Educational Purpose

🤝 Contributing

📄 License

🔗 Useful Resources

💡 Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gradio_app.py`

`brain_of_doctor.py`

`voice_of_the_patient.py`

`voice_of_the_doctor.py`

Packages