VigilAudio: AI-Powered Audio Moderation Engine

VigilAudio is a production-ready audio analysis engine designed to detect aggression, distress, and safety risks by analyzing the tone of voice. Built as the audio foundation for a multimodal moderation suite, it utilizes fine-tuned Transformers and optimized ONNX inference to deliver high-speed, real-time emotion detection.

Features

State-of-the-Art Architecture: Fine-tuned Wav2Vec2 Transformer achieving 84% accuracy.
Real-time Streaming: WebSocket API for low-latency audio processing (~199ms).
Active Learning: Automatically captures low-confidence samples to build a "Data Flywheel" for continuous improvement.
Moderation Alerts: UI automatically flags high-intensity negative emotions (Anger, Fear).
Edge Optimized: INT8 Quantized ONNX model runs 2x faster than standard PyTorch models.
Dual Deployment: Supports both a Monolithic (Standalone) and Microservice (API + UI) architecture.

Tech Stack

Core: Python 3.10+
Deep Learning: PyTorch, Transformers (Hugging Face)
Inference: ONNX Runtime, Optimum
Audio Processing: Librosa, Torchaudio, FFmpeg
Backend API: FastAPI, WebSockets
Frontend UI: Streamlit
Environment: uv
Containerization: Docker

Live Demo

The application is deployed and available for testing on Hugging Face Spaces:

VigilAudio Live Demo

Note: The demo uses the optimized INT8 ONNX model for high-performance inference on CPU.

Project Structure

.
├── data/
│   ├── raw/            # Raw audio files (from Kaggle)
│   └── processed/      # Harmonized metadata and stratified splits
├── models/
│   ├── onnx_quantized/ # Optimized INT8 model ready for deployment
│   └── wav2vec2/       # Original fine-tuned PyTorch weights
├── src/
│   ├── api/
│   │   └── app.py      # FastAPI backend with WebSocket support
│   ├── data/
│   │   └── harmonize.py # Data standardization and splitting script
│   ├── features/       # Feature extraction logic
│   ├── models/
│   │   ├── train.py    # Head-only training script
│   │   ├── predict.py  # Inference script
│   │   └── optimize.py # ONNX export and quantization script
│   └── ui/
│       ├── app.py      # Frontend for Microservice mode
│       └── app_standalone.py # Monolithic app (recommended for demos)
├── docs/
│   └── VigilAudio_Fine_Tuning.ipynb # Colab notebook for full model training
├── Dockerfile          # Multi-stage Docker build
├── pyproject.toml      # Dependency management
└── README.md           # Project documentation

Dataset

The project utilizes the Audio Emotions Dataset sourced from Kaggle.

Source: Audio Emotions Dataset (Kaggle)
Task: Multi-class classification of emotional states from speech.
Classes: Angry, Disgusted, Fearful, Happy, Neutral, Sad, Surprised.
Processing: The src/data/harmonize.py script standardizes folder structures, validates audio files, and generates a stratified 80/10/10 split for robust training.

Model Training & Performance

The model development followed a rigorous experimental path, moving from a simple baseline to a highly optimized production model.

1. Baseline: Head-Only Training

Initially, a simple Multi-Layer Perceptron (MLP) was trained on top of frozen Wav2Vec2 embeddings. This approach was computationally cheap but yielded poor results, demonstrating the need for full fine-tuning.

2. Fine-Tuning

We fine-tuned the entire Wav2Vec2 transformer on the dataset using a T4 GPU (via Google Colab). This allowed the model to learn acoustic features specific to emotional expression, significantly boosting accuracy.

3. Optimization (ONNX + Quantization)

To ensure the model could run in real-time on CPUs, we exported it to ONNX and applied INT8 Quantization. This reduced the model size by 3x and improved latency by 1.85x with a slight increase in accuracy (likely due to the regularization effect of quantization).

Performance Summary

Model Version	Accuracy	Latency (ms)	Speedup	Size (MB)
Baseline (Head-only)	52.0%	< 50ms	~7x	3.5MB
PyTorch (Full)	82.0%	370ms	1.00x	361MB
ONNX (Standard)	82.00%	306.52ms	1.21x	361.0MB
ONNX (INT8)	84.0%	199ms	1.85x	116MB

Note: The Baseline represents a simple neural network head trained on frozen Wav2Vec2 embeddings, illustrating the significant gain achieved by fine-tuning the backbone.

Setup and Usage

Prerequisites

Python 3.10+
uv (recommended) or pip
ffmpeg (installed via system package manager)

1. Clone the Repository

git clone https://github.com/yourusername/vigilaudio.git
cd vigilaudio

2. Environment Setup

uv sync

3. Download Model Weights

Download the quantized model (wav2vec2_quantized.zip) from the Releases page.
Extract it to models/onnx_quantized/.

4. Run the Application

Mode A: Standalone Demo (Recommended)

Best for quick testing or Hugging Face Spaces. Runs Model + UI in a single process.

uv run streamlit run src/ui/app_standalone.py

Access: http://localhost:8501

Mode B: Microservice Architecture (Production)

Decouples the API from the frontend.

Terminal 1 (Backend API):

uv run uvicorn src.api.app:app --host 0.0.0.0 --port 8000

Terminal 2 (Frontend UI):

uv run streamlit run src/ui/app.py

Docker Deployment

The application is fully containerized and ready for deployment.

Build the Image

docker build -t vigilaudio .

Run the Container

docker run -p 8501:8501 vigilaudio

Access the app at http://localhost:8501.

Contributing

Contributions are welcome! Please open an issue or submit a PR for improvements to the streaming logic or additional emotion classes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.streamlit		.streamlit
data/processed		data/processed
docs		docs
models/onnx_quantized		models/onnx_quantized
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VigilAudio: AI-Powered Audio Moderation Engine

Features

Tech Stack

Live Demo

Project Structure

Dataset

Model Training & Performance

1. Baseline: Head-Only Training

2. Fine-Tuning

3. Optimization (ONNX + Quantization)

Performance Summary

Setup and Usage

Prerequisites

1. Clone the Repository

2. Environment Setup

3. Download Model Weights

4. Run the Application

Mode A: Standalone Demo (Recommended)

Mode B: Microservice Architecture (Production)

Docker Deployment

Build the Image

Run the Container

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VigilAudio: AI-Powered Audio Moderation Engine

Features

Tech Stack

Live Demo

Project Structure

Dataset

Model Training & Performance

1. Baseline: Head-Only Training

2. Fine-Tuning

3. Optimization (ONNX + Quantization)

Performance Summary

Setup and Usage

Prerequisites

1. Clone the Repository

2. Environment Setup

3. Download Model Weights

4. Run the Application

Mode A: Standalone Demo (Recommended)

Mode B: Microservice Architecture (Production)

Docker Deployment

Build the Image

Run the Container

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages