This project implements the Translatotron model, a sequence-to-sequence approach for direct speech-to-speech translation without relying on intermediate text representations. The implementation is based on the paper "Direct speech-to-speech translation with a sequence-to-sequence model" by Ye Jia et al.
Translatotron is an end-to-end model that directly translates speech from one language to another without intermediate text representations. This implementation provides a PyTorch version of the model along with utilities for training and audio playback.
- Direct speech-to-speech translation
- Encoder-decoder architecture with attention
- Auxiliary decoders for multi-task learning
- Griffin-Lim algorithm for waveform reconstruction
- Utility functions for audio playback and saving
- Python 3.7+
- PyTorch 1.7+
- torchaudio
- numpy
- soundfile
- IPython (for notebook environments)
-
Clone this repository:
git clone https://github.com/abdouaziz/translatotron.git cd translatotron
-
Install the required packages:
pip install -r requirements.txt
Here's a basic example of how to use the Translatotron model:
from translatotron import Translatotron, play_audio
import torch
# Initialize the model
model = Translatotron()
# Create a dummy input (replace with your actual input)
dummy_input = torch.randn(1, 100, 80) # (batch_size, sequence_length, input_features)
# Forward pass
waveform, aux_source, aux_target = model(dummy_input)
# Play or save the generated audio
play_audio(waveform[0], filename="output.wav")The Translatotron model consists of:
- Encoder: Bidirectional LSTM
- Decoder: LSTM
- Auxiliary Decoder: For multi-task learning
- Spectrogram Generator: Linear projection
- Waveform Generator: Griffin-Lim algorithm
For detailed architecture, refer to the original paper : Direct speech-to-speech translation with a sequence-to-sequence model.
To train the model:
- Prepare your dataset of paired speech in source and target languages.
- Implement a custom
Datasetclass for your data. - Define loss functions for waveform reconstruction and auxiliary tasks.
- Set up a training loop with appropriate optimizers.
Example training loop (pseudo-code):
# ... (setup model, optimizer, data loader)
for epoch in range(num_epochs):
for batch in data_loader:
optimizer.zero_grad()
waveform, aux_source, aux_target = model(batch['input'])
loss = compute_loss(waveform, aux_source, aux_target, batch['target'])
loss.backward()
optimizer.step()
# Validation and checkpointingEvaluate the model using:
- Speech recognition on the translated output
- Human evaluation of translation quality
The play_audio function provides a convenient way to play or save generated audio:
play_audio(waveform, sample_rate=24000, filename="output.wav")This function works in Jupyter notebooks, Google Colab, and standard Python scripts.
Contributions are welcome! Please feel free to submit a Pull Request.
