This project focuses on the critical task of detecting AI-generated speech (deepfake audio). By leveraging advanced deep learning techniques, the goal is to build a robust and transparent system capable of distinguishing between authentic human speech and synthetically generated voice clips. The core of the project involves a pipeline that combines GAN-based audio generation for dataset enrichment, a powerful wav2vec2-based transformer model for classification, and Explainable AI (XAI) methods to ensure the model's decisions are transparent and trustworthy.
The primary motivation is to combat the escalating threat posed by synthetic voices, which are increasingly used for misinformation campaigns, sophisticated fraud, and malicious media manipulation. This work aims to provide a reliable tool for audio forensics and contribute to the broader research field dedicated to ensuring digital media authenticity.
As deepfake audio technology becomes more realistic and widely accessible, the ability to differentiate between real and synthetic speech is paramount for security and trust. This project directly addresses this challenge by developing and evaluating models to detect AI-generated audio from various sources, including known AI samples and newly generated synthetic audio.
Categories of Audio:
- Real: Authentic speech samples from human speakers.
- Fake: A collection of known AI-generated audio samples.
- GAN_Fake: Additional synthetic audio generated specifically for this project using a WaveGAN to augment the training data.
- GAN-based Fake Audio Generation: Utilizes a WaveGAN model to generate novel, realistic synthetic audio samples, which are then used to expand and diversify the training dataset.
- Advanced Data Augmentation: Employs techniques such as noise addition, time-stretching, pitch-shifting, and temporal shifting to enhance the model's robustness and prevent overfitting.
- High-Performance Wav2Vec2 Classifier: A state-of-the-art pretrained transformer model from Hugging Face is fine-tuned to perform the binary classification task of identifying real vs. fake speech.
- Explainability and Transparency (XAI): Integrates leading XAI frameworks like SHAP, LIME, and Grad-CAM to interpret the model's predictions, providing insight into which audio features drive its decisions.
- Comprehensive Evaluation Metrics: Model performance is assessed using a suite of metrics including Accuracy, F1-Score, Confusion Matrices, and Fréchet Audio Distance (FAD) to measure the quality of generated audio.
- Deployment-Ready: The final model is packaged into a Flask/FastAPI service, allowing for real-world inference through a simple API endpoint.
- Deep Learning: PyTorch, Hugging Face Transformers
- Generative Adversarial Networks (GANs): WaveGAN for fake audio generation
- Audio Processing: Librosa, Audiomentations
- Explainable AI (XAI): SHAP, LIME, Grad-CAM
- Web Deployment: Flask, FastAPI
- Data Handling & Visualization: Pandas, Hugging Face Datasets, Matplotlib, Seaborn
- Development Environment: Google Colab, Google Drive for data storage and model checkpointing
The project is organized into modular components for clarity and scalability.
.
├── data/
│ ├── real_audio/
│ ├── fake_audio/
│ └── processed/
│
├── gan/
│ ├── train_wavegan.py
│ ├── generate_fake_audio.py
│
├── classifier/
│ ├── train_wav2vec2.py
│ ├── evaluate.py
│
├── augmentation/
│ ├── pipeline.py
│
├── explainability/
│ ├── shap_explain.py
│ ├── lime_explain.py
│ ├── gradcam_audio.py
│
├── app/
│ └── server.py
│
└── README.md
The project follows a structured workflow from data acquisition to deployment.
The initial phase involves collecting a dataset of both real human speech samples and existing AI-generated speech. To manage the large dataset (over 120GB) within the constraints of Google Colab, a partial loading strategy was implemented, copying a subset of approximately 5000 files for faster experimentation and iteration.
To improve model robustness and its ability to generalize to unseen data, various data augmentation techniques were applied using the audiomentations
library. The augmentation pipeline randomly applies transformations like:
- Gaussian Noise Addition: Simulates noisy environments.
- Time Stretching: Varies the speed of the audio without changing the pitch.
- Pitch Shifting: Alters the pitch without changing the speed.
- Shifting: Temporally shifts the audio clip.
This process helps prevent overfitting and ensures the model performs well on diverse and potentially degraded audio inputs.
A key component of this project is the generation of a new, custom dataset of fake audio using a WaveGAN. This generative model is trained on real audio to learn the underlying patterns of human speech.
The Generator network is designed with a series of 1D Transposed Convolutional layers that systematically upsample a random noise vector into a realistic audio waveform. The final layer uses a Tanh
activation function to normalize the output waveform to a range of [-1, 1].
Once the WaveGAN is trained, the generator is used to produce thousands of synthetic .wav
files. These GAN_Fake
samples are then added to the training set, teaching the classifier to recognize a wider variety of AI-generated audio artifacts beyond those present in the initial "Fake" dataset.
The real audio files, the pre-existing fake audio files, and the newly generated GAN-fake files are combined into a single dataset. Each file is assigned a label (1 for Real, 0 for Fake). This information is compiled into a master CSV file, which is then loaded using the Hugging Face datasets
library. The library's cast_column("path", Audio())
feature efficiently loads audio data on-the-fly during training. Preprocessing steps also include validation, silence removal, and conversion to a fixed length.
Two primary classification models are trained and evaluated:
The main classifier is a Wav2Vec2ForSequenceClassification model from the Hugging Face library. This powerful, pre-trained transformer model, originally trained on vast amounts of unlabeled speech data, is fine-tuned for the binary task of distinguishing real from fake audio. The Hugging Face Trainer
API is used to manage the training loop, optimization, and logging, simplifying the fine-tuning process.
As an alternative, a lightweight Convolutional Neural Network (CNN) is also trained. This model operates not on the raw waveform but on its Mel-spectrogram representation, which visualizes the spectrum of frequencies in the audio over time. This serves as a baseline to compare against the more complex transformer-based approach.
The performance of the trained classifiers is rigorously evaluated using multiple metrics:
- Accuracy & F1-Score: Standard metrics for classification performance.
- Confusion Matrix: A visualization tool to analyze the model's errors, showing false positives and false negatives.
- Fréchet Audio Distance (FAD): This advanced metric is used to measure the perceptual similarity between the distributions of real and GAN-generated audio, providing a quantitative measure of the quality of the fake samples.
To ensure the model is not a "black box," several XAI techniques are implemented to interpret its predictions:
- SHAP (SHapley Additive exPlanations): Determines the feature importance of different audio segments, showing which parts of a clip contributed most to its final classification.
- LIME (Local Interpretable Model-agnostic Explanations): Provides local explanations by creating an interpretable model around a specific prediction, answering why a particular audio file was flagged as fake.
- Grad-CAM: Generates heatmaps over audio spectrograms, visually highlighting the specific time-frequency regions the model focused on when making its decision.
The best-performing model is exported and exposed via a web API using Flask or FastAPI. This server provides a /predict
endpoint that accepts an uploaded .wav
file and returns a JSON response containing the prediction (Real/Fake) along with a confidence score. This makes the model accessible for real-world applications and integrations.
- Handling Large Datasets: Managing a 120GB audio dataset in a cloud environment like Google Colab required strategies like partial dataset loading and efficient I/O to avoid memory and storage limitations.
- GAN Training Stability: Achieving stable training for GANs is notoriously difficult. It required careful tuning of hyperparameters to prevent issues like mode collapse and ensure the generation of high-quality audio.
- Resource Constraints: Balancing the dataset size, model complexity, and training time with the memory and runtime limits of the available compute resources was a constant challenge.
- Model Interpretability: Applying XAI tools designed for images or text to audio data required custom adaptations to effectively visualize and explain the decisions of a speech model.
- High Classification Accuracy: The fine-tuned Wav2Vec2 model is expected to achieve high accuracy (above 90%) on benchmark datasets, correctly distinguishing between human speech and audio from GANs or Text-to-Speech (TTS) systems.
- Trained Models: The project delivers a fine-tuned Wav2Vec2 model and a trained CNN model, both ready for inference.
- Generated Audio Samples: A collection of GAN-generated fake audio samples created during the project.
- Explainable Visualizations: A set of visualizations from SHAP, LIME, and Grad-CAM that provide transparency into the model's decision-making process, which is crucial for forensic applications.
- Production-Ready Export: The final model artifacts are saved in a format suitable for deployment in production environments.
- Expand Dataset: Train the models on larger, more diverse, and multi-lingual datasets to improve generalization across different languages and accents.
- Adversarial Robustness: Conduct robustness testing against adversarial attacks, where subtle, imperceptible noise is added to audio to fool the classifier.
- Edge/Mobile Optimization: Optimize the model for deployment on resource-constrained devices like mobile phones or edge hardware.
- Compare SOTA Models: Benchmark the Wav2Vec2 classifier against other state-of-the-art audio models such as Whisper or AudioLM.
This project is open-source and available under the MIT License.