VigilAudio is a production-ready audio analysis engine designed to detect aggression, distress, and safety risks by analyzing the tone of voice. Built as the audio foundation for a multimodal moderation suite, it utilizes fine-tuned Transformers and optimized ONNX inference to deliver high-speed, real-time emotion detection.
- State-of-the-Art Architecture: Fine-tuned
Wav2Vec2Transformer achieving 84% accuracy. - Real-time Streaming: WebSocket API for low-latency audio processing (~199ms).
- Active Learning: Automatically captures low-confidence samples to build a "Data Flywheel" for continuous improvement.
- Moderation Alerts: UI automatically flags high-intensity negative emotions (Anger, Fear).
- Edge Optimized: INT8 Quantized ONNX model runs 2x faster than standard PyTorch models.
- Dual Deployment: Supports both a Monolithic (Standalone) and Microservice (API + UI) architecture.
- Core:
Python 3.10+ - Deep Learning:
PyTorch,Transformers(Hugging Face) - Inference:
ONNX Runtime,Optimum - Audio Processing:
Librosa,Torchaudio,FFmpeg - Backend API:
FastAPI,WebSockets - Frontend UI:
Streamlit - Environment:
uv - Containerization:
Docker
The application is deployed and available for testing on Hugging Face Spaces:
Note: The demo uses the optimized INT8 ONNX model for high-performance inference on CPU.
.
├── data/
│ ├── raw/ # Raw audio files (from Kaggle)
│ └── processed/ # Harmonized metadata and stratified splits
├── models/
│ ├── onnx_quantized/ # Optimized INT8 model ready for deployment
│ └── wav2vec2/ # Original fine-tuned PyTorch weights
├── src/
│ ├── api/
│ │ └── app.py # FastAPI backend with WebSocket support
│ ├── data/
│ │ └── harmonize.py # Data standardization and splitting script
│ ├── features/ # Feature extraction logic
│ ├── models/
│ │ ├── train.py # Head-only training script
│ │ ├── predict.py # Inference script
│ │ └── optimize.py # ONNX export and quantization script
│ └── ui/
│ ├── app.py # Frontend for Microservice mode
│ └── app_standalone.py # Monolithic app (recommended for demos)
├── docs/
│ └── VigilAudio_Fine_Tuning.ipynb # Colab notebook for full model training
├── Dockerfile # Multi-stage Docker build
├── pyproject.toml # Dependency management
└── README.md # Project documentation
The project utilizes the Audio Emotions Dataset sourced from Kaggle.
- Source: Audio Emotions Dataset (Kaggle)
- Task: Multi-class classification of emotional states from speech.
- Classes: Angry, Disgusted, Fearful, Happy, Neutral, Sad, Surprised.
- Processing: The
src/data/harmonize.pyscript standardizes folder structures, validates audio files, and generates a stratified 80/10/10 split for robust training.
The model development followed a rigorous experimental path, moving from a simple baseline to a highly optimized production model.
Initially, a simple Multi-Layer Perceptron (MLP) was trained on top of frozen Wav2Vec2 embeddings. This approach was computationally cheap but yielded poor results, demonstrating the need for full fine-tuning.
We fine-tuned the entire Wav2Vec2 transformer on the dataset using a T4 GPU (via Google Colab). This allowed the model to learn acoustic features specific to emotional expression, significantly boosting accuracy.
To ensure the model could run in real-time on CPUs, we exported it to ONNX and applied INT8 Quantization. This reduced the model size by 3x and improved latency by 1.85x with a slight increase in accuracy (likely due to the regularization effect of quantization).
| Model Version | Accuracy | Latency (ms) | Speedup | Size (MB) |
|---|---|---|---|---|
| Baseline (Head-only) | 52.0% | < 50ms | ~7x | 3.5MB |
| PyTorch (Full) | 82.0% | 370ms | 1.00x | 361MB |
| ONNX (Standard) | 82.00% | 306.52ms | 1.21x | 361.0MB |
| ONNX (INT8) | 84.0% | 199ms | 1.85x | 116MB |
Note: The Baseline represents a simple neural network head trained on frozen Wav2Vec2 embeddings, illustrating the significant gain achieved by fine-tuning the backbone.
- Python 3.10+
uv(recommended) orpipffmpeg(installed via system package manager)
git clone https://github.com/yourusername/vigilaudio.git
cd vigilaudiouv sync- Download the quantized model (
wav2vec2_quantized.zip) from the Releases page. - Extract it to
models/onnx_quantized/.
Best for quick testing or Hugging Face Spaces. Runs Model + UI in a single process.
uv run streamlit run src/ui/app_standalone.py- Access:
http://localhost:8501
Decouples the API from the frontend.
Terminal 1 (Backend API):
uv run uvicorn src.api.app:app --host 0.0.0.0 --port 8000Terminal 2 (Frontend UI):
uv run streamlit run src/ui/app.pyThe application is fully containerized and ready for deployment.
docker build -t vigilaudio .docker run -p 8501:8501 vigilaudioAccess the app at http://localhost:8501.
Contributions are welcome! Please open an issue or submit a PR for improvements to the streaming logic or additional emotion classes.
MIT