Voice Assistant Multimodal

A multimodal AI voice assistant with speech recognition, text-to-speech, computer vision, and hardware control capabilities.

Creator/Author: Mohammad Faiz

Repository: https://github.com/Mohammad-Faiz-Cloud-Engineer/Voice-Assistant-Multimodal

Features

Speech Recognition: Whisper-based voice command recognition
Text-to-Speech: Coqui TTS with emotion support and voice cloning
Computer Vision: Image analysis, object detection, face detection
Camera Control: Real-time camera preview and image capture
Screen Capture: Screenshot functionality
Video Recording: Short video recording capability
Hardware Control: Arduino servo motor control
LM Studio Integration: Local LLM for conversational AI

Requirements

Python 3.14+
CUDA-capable GPU (optional, for faster inference)
Webcam (for camera features)
Arduino (optional, for servo control)
LM Studio running locally on port 1234

Installation

Clone the repository

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Copy .env.example to .env and configure:
```
cp .env.example .env
```
Edit .env with your configuration
- Set ARDUINO_PORT only if you want servo control enabled
- Keep LM_STUDIO_BASE_URL on a trusted local or private endpoint
- Adjust OUTPUT_DIR and TEMP_DIR if you need non-default storage paths

Usage

python3 voice_assistant_multimodal.py

Voice Commands

Camera: "Turn camera on/off", "Take picture", "Record video"
Screen: "Take screenshot", "Capture screen"
Vision: "Describe image", "Analyze screenshot"
Detection: "Find car", "Detect faces"
Servo: "Turn left/right", "Look up/down", "Center"
Exit: "Stop listening"

Configuration

All configuration is managed through environment variables in .env:

LM_STUDIO_BASE_URL: LM Studio API endpoint
ARDUINO_PORT: Optional serial port for Arduino (e.g., COM3, /dev/ttyUSB0)
CAMERA_INDEX: Camera device index (usually 0)
WHISPER_MODEL: Whisper model size (tiny, base, small, medium, large)
OUTPUT_DIR / TEMP_DIR: Writable directories for generated media files

Security Notes

Never commit .env file to version control
Use HTTPS for production API endpoints
Keep LM Studio bound to trusted interfaces only
Run with minimal required permissions

License

See LICENSE file for details.

Contributing

Contributions welcome! Please ensure code passes all security and quality checks.

Made by Mohammad Faiz

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_voice_assistant_multimodal.py		test_voice_assistant_multimodal.py
voice_assistant_multimodal.py		voice_assistant_multimodal.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Assistant Multimodal

Features

Requirements

Installation

Usage

Voice Commands

Configuration

Security Notes

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice Assistant Multimodal

Features

Requirements

Installation

Usage

Voice Commands

Configuration

Security Notes

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages