DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video (AAAI2023)

Paper demo video Supplementary materials

🤔 How to achive this boost in inference latency?

To achieve this, several changes were implemented:

Removed DeepSpeech and utilized wav2vec for instant feature extraction, leveraging the speed and power of torch.
Trained a lightweight model to map the wav2vec features to DeepSpeech, maintaining the existing process.
Enhanced frames extraction for improved speed.
These adjustments contribute to a reduction of up to 60% in inference latency compared to the original implementation, all while maintaining quality.

Additionally, Docker has been introduced to facilitate faster, simpler, and more automated facial landmarks extraction.

Tested on:

Windows 11
Python version >= 3.9

📖 Prerequisites

To get started, follow these steps:

Download the resources (asserts.zip) n Google drive. Unzip the file and place the directory in the current directory (./). This zip file includes the model for mapping wav2vec to deepspeech, beside all other models.

Install Instructions

Set up a Conda environment by executing the following commands.

  conda create -n dinet python=3.9
  conda activate dinet

Clone repository

  git clone https://github.com/illeng/DINet_optimized_Win.git
  cd DINet

Install Dependencies

  pip install -r requirements.txt

Install torch 1.11.0

  pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 -f https://download.pytorch.org/whl/torch_stable.html

Install tensorflow 2.5.0

  pip install tensorflow==2.5.0

Installing pysoundfile

  conda install -c conda-forge pysoundfile

🚀 Inference

Run inference with example videos:

python inference.py --mouth_region_size=256 --source_video_path=./asserts/examples/testxxx.mp4 --source_openface_landmark_path=./asserts/examples/testxxx.csv --driving_audio_path=./asserts/examples/driving_audio_xxx.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

Use OpenFace to detect smooth facial landmarks of your custom video..

Acknowledge

The AdaAT is borrowed from AdaAT. The deepspeech feature is borrowed from AD-NeRF. The basic module is borrowed from first-order. Thanks for their released code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video (AAAI2023)

🤔 How to achive this boost in inference latency?

📖 Prerequisites

Install Instructions

🚀 Inference

Run inference with example videos:

Acknowledge

Files

README.md

Latest commit

History

README.md

File metadata and controls

DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video (AAAI2023)

🤔 How to achive this boost in inference latency?

📖 Prerequisites

Install Instructions

🚀 Inference

Run inference with example videos:

Acknowledge