WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

WavShape is an open-source framework implementing an information-theoretic approach to speech representation learning. It optimizes speech embeddings for fairness and privacy while preserving task-relevant information. By leveraging mutual information (MI) estimation through the Donsker-Varadhan formulation, WavShape systematically filters sensitive attributes such as speaker identity, accent, and demographic details, ensuring robust and unbiased downstream tasks (e.g., automatic speech recognition, emotion detection).

Overview

The increasing reliance on deep learning for speech processing raises concerns regarding privacy and bias. WavShape addresses these challenges by:

Optimizing for Fairness & Privacy: Removing sensitive attributes from speech embeddings.
Preserving Task-Relevant Information: Retaining essential acoustic and linguistic cues.
Efficient Compression: Reducing speech representation dimensionality without significant performance loss.

Inspired by information theory and self-supervised learning models (e.g., Whisper, wav2vec 2.0), WavShape combines speech encoders with a trainable MI-based projection layer that filters out unwanted information.

Repository Structure

WavShape/
├── README.md                  # This file.
├── docs/                      # Documentation.
├── src/                       # Source code.
│   ├── models/                # Encoder, MI evaluator, etc.
│   ├── data/                  # Data handling scripts.
│   ├── training/              # Training routines.
│   ├── evaluation/            # Evaluation scripts.
│   └── utils/                 # Helper functions.
├── experiments/               # Experiment configurations, logs.
├── requirements.txt           # Python dependencies.
└── LICENSE                    # License file.

Key Features

Fair & Privacy-Aware Speech Encoding
Task-Oriented Embedding Retention
Mutual Information Optimization
Modular & Extensible Architecture
Comprehensive Experimental Validation

Installation

Prerequisites

Python 3.8+
PyTorch 1.9+
CUDA-enabled GPU (recommended for training)
Install dependencies using:

git clone https://github.com/wavshapeinterspeech25/wavshape.git
cd wavshape
pip install -r requirements.txt

Data Preparation

Supported datasets:

Mozilla Common Voice (MCV): Fairness evaluation in ASR.
Google Speech Commands (GSC): Privacy-aware keyword spotting.
MusicNet: Structured spectral learning in music recordings.

Prepare datasets as specified in src/data/.

Usage

Training the Model

python src/training/train_wavshape.py --config experiments/config.yaml

Encoder Training: Updates weights to maximize MI with task-relevant features while suppressing sensitive attributes.
MI Estimation: Iteratively refines MI estimates.

Evaluating the Model

python src/evaluation/evaluate.py --model_path experiments/best_model.pth --dataset common_voice

t-SNE Visualizations
AUROC Scores
Classification Accuracy Before & After Encoding

Visualization

python src/evaluation/visualize_tsne.py --embeddings experiments/embeddings.npy

Architecture Details

WavShape consists of two main components:

1. Speech Encoder

Feature Extraction Layer (Non-trainable): Converts raw audio into an intermediate representation using Whisper/wav2vec 2.0.
Information-Theoretic Embedding Layer (Trainable): Maps extracted features to a lower-dimensional space, ensuring fairness and privacy.

2. Information-Theoretic Evaluator

Estimates MI using the Donsker-Varadhan formulation.
Guides the encoder training to balance utility and privacy.

Experimental Setup & Results

Datasets

Mozilla Common Voice (MCV): Task label = age, sensitive attributes = gender, accent.
Google Speech Commands (GSC): Task label = general command category, filtered labels = granular commands.
MusicNet: Task label = instrument labels, filtered metadata = composer, movement.

Models

Feature extraction with Whisper.
Classification with lightweight neural networks.

Evaluation Metrics

t-SNE for high-dimensional visualization.
AUROC to quantify sensitive attribute leakage.
Classification accuracy comparison.

Results

Privacy & Fairness: MI between embeddings and sensitive attributes reduced from 0.6208 to 0.2671 while retaining task MI at 0.3791.
Efficient Compression: Maintaining up to 95.5% accuracy even with reduced dimensionality.
Bias Mitigation: Sensitive attribute AUROC reduced to 0.47, demonstrating fairness improvements.

Citation

If you use WavShape in your research, please cite:

@inproceedings{wavshape2025,
  title={WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing},
  author={Baser, Oguzhan and Tanriverdi, Ahmet E and Kale, Kaan and Chinchali, Sandeep P and Vishwanath, Sriram},
  booktitle={Proc. Interspeech 2025},
  year={2025}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or contributions, please contact:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
arch.png		arch.png
cv-valid-dev.csv		cv-valid-dev.csv
cv-valid-test.csv		cv-valid-test.csv
cv-valid-train.csv		cv-valid-train.csv
mlp_label1.pth		mlp_label1.pth
mlp_label2.pth		mlp_label2.pth
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test.py		test.py
test_whisper.py		test_whisper.py
train.py		train.py
vctk_model1_enc_state.pth		vctk_model1_enc_state.pth
vctk_model1_state.pth		vctk_model1_state.pth
vctk_model2_enc_state.pth		vctk_model2_enc_state.pth
vctk_model2_state.pth		vctk_model2_state.pth
vis.ipynb		vis.ipynb
wavshape.ipynb		wavshape.ipynb
wavshape.py		wavshape.py
wavshape.sh		wavshape.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

Overview

Repository Structure

Key Features

Installation

Prerequisites

Data Preparation

Usage

Training the Model

Evaluating the Model

Visualization

Architecture Details

1. Speech Encoder

2. Information-Theoretic Evaluator

Experimental Setup & Results

Datasets

Models

Evaluation Metrics

Results

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

UTAustin-SwarmLab/WavShape

Folders and files

Latest commit

History

Repository files navigation

WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

Overview

Repository Structure

Key Features

Installation

Prerequisites

Data Preparation

Usage

Training the Model

Evaluating the Model

Visualization

Architecture Details

1. Speech Encoder

2. Information-Theoretic Evaluator

Experimental Setup & Results

Datasets

Models

Evaluation Metrics

Results

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages