AI-Based Fake Speech Detection

📌 Overview & Motivation

This project focuses on the critical task of detecting AI-generated speech (deepfake audio). By leveraging advanced deep learning techniques, the goal is to build a robust and transparent system capable of distinguishing between authentic human speech and synthetically generated voice clips. The core of the project involves a pipeline that combines GAN-based audio generation for dataset enrichment, a powerful wav2vec2-based transformer model for classification, and Explainable AI (XAI) methods to ensure the model's decisions are transparent and trustworthy.

The primary motivation is to combat the escalating threat posed by synthetic voices, which are increasingly used for misinformation campaigns, sophisticated fraud, and malicious media manipulation. This work aims to provide a reliable tool for audio forensics and contribute to the broader research field dedicated to ensuring digital media authenticity.

Problem Statement

As deepfake audio technology becomes more realistic and widely accessible, the ability to differentiate between real and synthetic speech is paramount for security and trust. This project directly addresses this challenge by developing and evaluating models to detect AI-generated audio from various sources, including known AI samples and newly generated synthetic audio.

Categories of Audio:

Real: Authentic speech samples from human speakers.
Fake: A collection of known AI-generated audio samples.
GAN_Fake: Additional synthetic audio generated specifically for this project using a WaveGAN to augment the training data.

🚀 Core Features

GAN-based Fake Audio Generation: Utilizes a WaveGAN model to generate novel, realistic synthetic audio samples, which are then used to expand and diversify the training dataset.
Advanced Data Augmentation: Employs techniques such as noise addition, time-stretching, pitch-shifting, and temporal shifting to enhance the model's robustness and prevent overfitting.
High-Performance Wav2Vec2 Classifier: A state-of-the-art pretrained transformer model from Hugging Face is fine-tuned to perform the binary classification task of identifying real vs. fake speech.
Explainability and Transparency (XAI): Integrates leading XAI frameworks like SHAP, LIME, and Grad-CAM to interpret the model's predictions, providing insight into which audio features drive its decisions.
Comprehensive Evaluation Metrics: Model performance is assessed using a suite of metrics including Accuracy, F1-Score, Confusion Matrices, and Fréchet Audio Distance (FAD) to measure the quality of generated audio.
Deployment-Ready: The final model is packaged into a Flask/FastAPI service, allowing for real-world inference through a simple API endpoint.

🛠️ Tech Stack

Deep Learning: PyTorch, Hugging Face Transformers
Generative Adversarial Networks (GANs): WaveGAN for fake audio generation
Audio Processing: Librosa, Audiomentations
Explainable AI (XAI): SHAP, LIME, Grad-CAM
Web Deployment: Flask, FastAPI
Data Handling & Visualization: Pandas, Hugging Face Datasets, Matplotlib, Seaborn
Development Environment: Google Colab, Google Drive for data storage and model checkpointing

📂 Project Structure

The project is organized into modular components for clarity and scalability.

.
├── data/
│   ├── real_audio/
│   ├── fake_audio/
│   └── processed/
│
├── gan/
│   ├── train_wavegan.py
│   ├── generate_fake_audio.py
│
├── classifier/
│   ├── train_wav2vec2.py
│   ├── evaluate.py
│
├── augmentation/
│   ├── pipeline.py
│
├── explainability/
│   ├── shap_explain.py
│   ├── lime_explain.py
│   ├── gradcam_audio.py
│
├── app/
│   └── server.py
│
└── README.md

⚙️ Workflow & Methodology

The project follows a structured workflow from data acquisition to deployment.

1. Data Preparation and Augmentation

The initial phase involves collecting a dataset of both real human speech samples and existing AI-generated speech. To manage the large dataset (over 120GB) within the constraints of Google Colab, a partial loading strategy was implemented, copying a subset of approximately 5000 files for faster experimentation and iteration.

To improve model robustness and its ability to generalize to unseen data, various data augmentation techniques were applied using the audiomentations library. The augmentation pipeline randomly applies transformations like:

Gaussian Noise Addition: Simulates noisy environments.
Time Stretching: Varies the speed of the audio without changing the pitch.
Pitch Shifting: Alters the pitch without changing the speed.
Shifting: Temporally shifts the audio clip.

This process helps prevent overfitting and ensures the model performs well on diverse and potentially degraded audio inputs.

2. Fake Audio Generation with WaveGAN

A key component of this project is the generation of a new, custom dataset of fake audio using a WaveGAN. This generative model is trained on real audio to learn the underlying patterns of human speech.

The Generator network is designed with a series of 1D Transposed Convolutional layers that systematically upsample a random noise vector into a realistic audio waveform. The final layer uses a Tanh activation function to normalize the output waveform to a range of [-1, 1].

Once the WaveGAN is trained, the generator is used to produce thousands of synthetic .wav files. These GAN_Fake samples are then added to the training set, teaching the classifier to recognize a wider variety of AI-generated audio artifacts beyond those present in the initial "Fake" dataset.

3. Dataset Creation and Preprocessing

Dataset Link

The real audio files, the pre-existing fake audio files, and the newly generated GAN-fake files are combined into a single dataset. Each file is assigned a label (1 for Real, 0 for Fake). This information is compiled into a master CSV file, which is then loaded using the Hugging Face datasets library. The library's cast_column("path", Audio()) feature efficiently loads audio data on-the-fly during training. Preprocessing steps also include validation, silence removal, and conversion to a fixed length.

4. Model Training and Comparison

Two primary classification models are trained and evaluated:

Wav2Vec2 Classifier

The main classifier is a Wav2Vec2ForSequenceClassification model from the Hugging Face library. This powerful, pre-trained transformer model, originally trained on vast amounts of unlabeled speech data, is fine-tuned for the binary task of distinguishing real from fake audio. The Hugging Face Trainer API is used to manage the training loop, optimization, and logging, simplifying the fine-tuning process.

CNN on Mel-Spectrograms

As an alternative, a lightweight Convolutional Neural Network (CNN) is also trained. This model operates not on the raw waveform but on its Mel-spectrogram representation, which visualizes the spectrum of frequencies in the audio over time. This serves as a baseline to compare against the more complex transformer-based approach.

5. Model Evaluation

The performance of the trained classifiers is rigorously evaluated using multiple metrics:

Accuracy & F1-Score: Standard metrics for classification performance.
Confusion Matrix: A visualization tool to analyze the model's errors, showing false positives and false negatives.
Fréchet Audio Distance (FAD): This advanced metric is used to measure the perceptual similarity between the distributions of real and GAN-generated audio, providing a quantitative measure of the quality of the fake samples.

6. Explainable AI (XAI) for Transparency

To ensure the model is not a "black box," several XAI techniques are implemented to interpret its predictions:

SHAP (SHapley Additive exPlanations): Determines the feature importance of different audio segments, showing which parts of a clip contributed most to its final classification.
LIME (Local Interpretable Model-agnostic Explanations): Provides local explanations by creating an interpretable model around a specific prediction, answering why a particular audio file was flagged as fake.
Grad-CAM: Generates heatmaps over audio spectrograms, visually highlighting the specific time-frequency regions the model focused on when making its decision.

7. Deployment (Pending)

The best-performing model is exported and exposed via a web API using Flask or FastAPI. This server provides a /predict endpoint that accepts an uploaded .wav file and returns a JSON response containing the prediction (Real/Fake) along with a confidence score. This makes the model accessible for real-world applications and integrations.

⚡ Challenges Faced

Handling Large Datasets: Managing a 120GB audio dataset in a cloud environment like Google Colab required strategies like partial dataset loading and efficient I/O to avoid memory and storage limitations.
GAN Training Stability: Achieving stable training for GANs is notoriously difficult. It required careful tuning of hyperparameters to prevent issues like mode collapse and ensure the generation of high-quality audio.
Resource Constraints: Balancing the dataset size, model complexity, and training time with the memory and runtime limits of the available compute resources was a constant challenge.
Model Interpretability: Applying XAI tools designed for images or text to audio data required custom adaptations to effectively visualize and explain the decisions of a speech model.

📊 Expected Results & Project Output

High Classification Accuracy: The fine-tuned Wav2Vec2 model is expected to achieve high accuracy (above 90%) on benchmark datasets, correctly distinguishing between human speech and audio from GANs or Text-to-Speech (TTS) systems.
Trained Models: The project delivers a fine-tuned Wav2Vec2 model and a trained CNN model, both ready for inference.
Generated Audio Samples: A collection of GAN-generated fake audio samples created during the project.
Explainable Visualizations: A set of visualizations from SHAP, LIME, and Grad-CAM that provide transparency into the model's decision-making process, which is crucial for forensic applications.
Production-Ready Export: The final model artifacts are saved in a format suitable for deployment in production environments.

📌 Future Work

Expand Dataset: Train the models on larger, more diverse, and multi-lingual datasets to improve generalization across different languages and accents.
Adversarial Robustness: Conduct robustness testing against adversarial attacks, where subtle, imperceptible noise is added to audio to fool the classifier.
Edge/Mobile Optimization: Optimize the model for deployment on resource-constrained devices like mobile phones or edge hardware.
Compare SOTA Models: Benchmark the Wav2Vec2 classifier against other state-of-the-art audio models such as Whisper or AudioLM.

📜 License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
AI_Fake_Speech_Detection.ipynb		AI_Fake_Speech_Detection.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Based Fake Speech Detection

📌 Overview & Motivation

Problem Statement

🚀 Core Features

🛠️ Tech Stack

📂 Project Structure

⚙️ Workflow & Methodology

1. Data Preparation and Augmentation

2. Fake Audio Generation with WaveGAN

3. Dataset Creation and Preprocessing

4. Model Training and Comparison

Wav2Vec2 Classifier

CNN on Mel-Spectrograms

5. Model Evaluation

6. Explainable AI (XAI) for Transparency

7. Deployment (Pending)

⚡ Challenges Faced

📊 Expected Results & Project Output

📌 Future Work

📜 License

About

Uh oh!

Languages

5umitpandey/AI_Fake_Speech_Detection

Folders and files

Latest commit

History

Repository files navigation

AI-Based Fake Speech Detection

📌 Overview & Motivation

Problem Statement

🚀 Core Features

🛠️ Tech Stack

📂 Project Structure

⚙️ Workflow & Methodology

1. Data Preparation and Augmentation

2. Fake Audio Generation with WaveGAN

3. Dataset Creation and Preprocessing

4. Model Training and Comparison

Wav2Vec2 Classifier

CNN on Mel-Spectrograms

5. Model Evaluation

6. Explainable AI (XAI) for Transparency

7. Deployment (Pending)

⚡ Challenges Faced

📊 Expected Results & Project Output

📌 Future Work

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages