Translatotron: Direct Speech-to-Speech Translation

This project implements the Translatotron model, a sequence-to-sequence approach for direct speech-to-speech translation without relying on intermediate text representations. The implementation is based on the paper "Direct speech-to-speech translation with a sequence-to-sequence model" by Ye Jia et al.

Introduction

Translatotron is an end-to-end model that directly translates speech from one language to another without intermediate text representations. This implementation provides a PyTorch version of the model along with utilities for training and audio playback.

Features

Direct speech-to-speech translation
Encoder-decoder architecture with attention
Auxiliary decoders for multi-task learning
Griffin-Lim algorithm for waveform reconstruction
Utility functions for audio playback and saving

Requirements

Python 3.7+
PyTorch 1.7+
torchaudio
numpy
soundfile
IPython (for notebook environments)

Installation

Clone this repository:

git clone https://github.com/abdouaziz/translatotron.git
cd translatotron

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Here's a basic example of how to use the Translatotron model:

from translatotron import Translatotron, play_audio
import torch

# Initialize the model
model = Translatotron()

# Create a dummy input (replace with your actual input)
dummy_input = torch.randn(1, 100, 80)  # (batch_size, sequence_length, input_features)

# Forward pass
waveform, aux_source, aux_target = model(dummy_input)

# Play or save the generated audio
play_audio(waveform[0], filename="output.wav")

Model Architecture

The Translatotron model consists of:

Encoder: Bidirectional LSTM
Decoder: LSTM
Auxiliary Decoder: For multi-task learning
Spectrogram Generator: Linear projection
Waveform Generator: Griffin-Lim algorithm

For detailed architecture, refer to the original paper : Direct speech-to-speech translation with a sequence-to-sequence model.

Training

To train the model:

Prepare your dataset of paired speech in source and target languages.
Implement a custom Dataset class for your data.
Define loss functions for waveform reconstruction and auxiliary tasks.
Set up a training loop with appropriate optimizers.

Example training loop (pseudo-code):

# ... (setup model, optimizer, data loader)

for epoch in range(num_epochs):
    for batch in data_loader:
        optimizer.zero_grad()
        waveform, aux_source, aux_target = model(batch['input'])
        loss = compute_loss(waveform, aux_source, aux_target, batch['target'])
        loss.backward()
        optimizer.step()
    
    # Validation and checkpointing

Evaluation

Evaluate the model using:

Speech recognition on the translated output
Human evaluation of translation quality

Audio Playback

The play_audio function provides a convenient way to play or save generated audio:

play_audio(waveform, sample_rate=24000, filename="output.wav")

This function works in Jupyter notebooks, Google Colab, and standard Python scripts.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
translatotron		translatotron
.gitignore		.gitignore
README.MD		README.MD
demo.ipynb		demo.ipynb
s2st.png		s2st.png
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Translatotron: Direct Speech-to-Speech Translation

Introduction

Features

Requirements

Installation

Usage

Model Architecture

Training

Evaluation

Audio Playback

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

abdouaziz/translatotron

Folders and files

Latest commit

History

Repository files navigation

Translatotron: Direct Speech-to-Speech Translation

Introduction

Features

Requirements

Installation

Usage

Model Architecture

Training

Evaluation

Audio Playback

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages