Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.
2023-07-26: We have released our training recipe for real-time AV-ASR, see here.
2023-06-16: We have released our training recipe for AutoAVSR, see here.
2023-03-27: We have released our AutoAVSR models for LRS3, see here.
This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.
We provide a tutorial to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.
| English -> Mandarin -> Spanish | French -> Portuguese -> Italian |
|---|---|
![]() |
![]() |
- Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages- Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr- Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg- Download and extract a pre-trained model and/or language model from model zoo to:
-
./benchmarks/${dataset}/models -
./benchmarks/${dataset}/language_models
- [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.
python eval.py config_filename=[config_filename] \
labels_filename=[labels_filename] \
data_dir=[data_dir] \
landmarks_dir=[landmarks_dir]-
[config_filename]is the model configuration path, located in./configs. -
[labels_filename]is the labels path, located in${lipreading_root}/benchmarks/${dataset}/labels. -
[data_dir]and[landmarks_dir]are the directories for original dataset and corresponding landmarks. -
gpu_idx=-1can be added to switch fromcuda:0tocpu.
python infer.py config_filename=[config_filename] data_filename=[data_filename]-
data_filenameis the path to the audio/video file. -
detector=mediapipecan be added to switch from RetinaFace to MediaPipe tracker.
python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]dst_filenameis the path where the cropped mouth will be saved.
We support a number of datasets for speech recognition:
- Lip Reading Sentences 2 (LRS2)
- Lip Reading Sentences 3 (LRS3)
- Chinese Mandarin Lip Reading (CMLR)
- CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
- GRID
- Lombard GRID
- TCD-TIMIT
Lip Reading Sentences 3 (LRS3)
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| - | 19.1 | GoogleDrive or BaiduDrive(key: dqsy) | 891 |
| Audio-only | |||
| - | 1.0 | GoogleDrive or BaiduDrive(key: dvf2) | 860 |
| Audio-visual | |||
| - | 0.9 | GoogleDrive or BaiduDrive(key: sai5) | 1540 |
| Language models | |||
| - | - | GoogleDrive or BaiduDrive(key: t9ep) | 191 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
Lip Reading Sentences 2 (LRS2)
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| - | 26.1 | GoogleDrive or BaiduDrive(key: 48l1) | 186 |
| Language models | |||
| - | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: 53rc) | 9358 |
Lip Reading Sentences 3 (LRS3)
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| - | 32.3 | GoogleDrive or BaiduDrive(key: 1b1s) | 186 |
| Language models | |||
| - | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
Chinese Mandarin Lip Reading (CMLR)
| Components | CER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| - | 8.0 | GoogleDrive or BaiduDrive(key: 7eq1) | 195 |
| Language models | |||
| - | - | GoogleDrive or BaiduDrive(key: k8iv) | 187 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: 1ret) | 3721 |
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| Spanish | 44.5 | GoogleDrive or BaiduDrive(key: m35h) | 186 |
| Portuguese | 51.4 | GoogleDrive or BaiduDrive(key: wk2h) | 186 |
| French | 58.6 | GoogleDrive or BaiduDrive(key: t1hf) | 186 |
| Language models | |||
| Spanish | - | GoogleDrive or BaiduDrive(key: 0mii) | 180 |
| Portuguese | - | GoogleDrive or BaiduDrive(key: l6ag) | 179 |
| French | - | GoogleDrive or BaiduDrive(key: 6tan) | 179 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: vsic) | 3040 |
GRID
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| Overlapped | 1.2 | GoogleDrive or BaiduDrive(key: d8d2) | 186 |
| Unseen | 4.8 | GoogleDrive or BaiduDrive(key: ttsh) | 186 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: 16l9) | 1141 |
You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.
Lombard GRID
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| Unseen (Front Plain) | 4.9 | GoogleDrive or BaiduDrive(key: 38ds) | 186 |
| Unseen (Side Plain) | 8.0 | GoogleDrive or BaiduDrive(key: k6m0) | 186 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: cusv) | 309 |
You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.
TCD-TIMIT
| Components | WER | url | size (MB) |
|---|---|---|---|
| Visual-only | |||
| Overlapped | 16.9 | GoogleDrive or BaiduDrive(key: jh65) | 186 |
| Unseen | 21.8 | GoogleDrive or BaiduDrive(key: n2gr) | 186 |
| Language models | |||
| - | - | GoogleDrive or BaiduDrive(key: 59u2) | 180 |
| Landmarks | |||
| - | - | GoogleDrive or BaiduDrive(key: bnm8) | 930 |
If you use the AutoAVSR models training code, please consider citing the following paper:
@inproceedings{ma2023auto,
author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
year={2023},
}If you use the VSR models for multiple languages please consider citing the following paper:
@article{ma2022visual,
title={{Visual Speech Recognition for Multiple Languages in the Wild}},
author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
journal={{Nature Machine Intelligence}},
volume={4},
pages={930--939},
year={2022}
url={https://doi.org/10.1038/s42256-022-00550-z},
doi={10.1038/s42256-022-00550-z}
}It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.
[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)


