Skip to content

naver-ai/RapFlow-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RapFlow-TTS: Rapid High-Fidelity Text-to-Speech with Improved Consistency Flow Matching (Interspeech 2025)

This repository is the official implementation of RapFlow-TTS: Rapid High-Fidelity Text-to-Speech with Improved Consistency Flow Matching.

In this repository, we provide steps for running RapFlow-TTS.

πŸ™ We recommend you visit our demo site. πŸ™

RapFlow-TTS is an ODE-based TTS that can synthesize high-quality speech with fewer steps using improved consistency flow matching. The overall architecture of RapFlow-TTS is as below:

RapFlow-TTS

Requirements & Installation

Clone this repository

git clone https://github.com/naver-ai/RapFlow-TTS.git
cd RapFlow-TTS/

Set your virtual environment. You can specify other Python versions.

conda create -n rapflow python=3.9
conda activate rapflow

or

python3.9 -m venv venv
source venv/bin/activate

To install requirements:

# you can specify other latest torch versions (we used torch==1.11.0 with cuda 11.3) 
pip install -r requirements.txt

For MAS algorithmm run the code below.

cd ./model/monotonic_align; python setup.py build_ext --inplace

For pre-trained HiFi-GAN (LJSpeech and universal vocoder), download, unzip, and place them hifigan/weights/LJSpeech or hifigan/weights/universal.

Prepare datasets

We use the LJSpeech and VCTK datasets.

  • The LJSpeech dataset can be downloaded here.

  • The VCTK dataset (trimmed version) can be downloaded here.

  • We follow the train, valid, and test set split of Grad-TTS for the LJSpeech dataset.

  • We use the random set split for the VCTK dataset.

Preprocess

For set split and text preprocess, run the following code with the config option (LJSpeech or VCTK):

python ./preprocess/preprocess.py --config ./config/{dataset}/preprocess.yaml

Example:
    python ./preprocess/preprocess.py --config ./config/LJSpeech/preprocess.yaml
    python ./preprocess/preprocess.py --config ./config/VCTK/preprocess.yaml

The codes yield the two meta datalist for training.

{train, valid, test}.txt :
    Path|Text|Speaker

cleaned_{train, valid, test}.txt :
    Path|Text|Phoneme|Speaker

To obtain statistics for mel normalization, run the following code.

  • When you run the below code for the first time, the model.data_stats should be set as [0,1].

  • Modify statistics of model.data_stats in the model configuration.

  • In practice, we use mel normalization only for LJSpeech dataset.

python get_mel_stats.py --config ./config/{dataset}/base_stage1.yaml

Example:
    python get_mel_stats.py --config ./config/LJSpeech/base_stage1.yaml

Configurations:
    β”œβ”€β”€ path
    β”œβ”€β”€ preprocess
    β”œβ”€β”€ model
    |     β”œβ”€β”€ data_stats: [mean, std], [0, 1] --> [new stats]
    β”œβ”€β”€ train
    β”œβ”€β”€ test

Training

To train RapFlow-TTS from scratch, run the following code.

  • If you want to change training options such as num_worker, cuda device, and so on, check argument.py.

  • If you want to edit model or training settings, check config/{dataset}/base_stage{1,2,3}.yaml.

  • The training process contains 3 stages (Straight flow, Straight flow & Consistency, Straight flow & Consistency & Adversarial learning).

Configurations:
    β”œβ”€β”€ path
    β”œβ”€β”€ preprocess
    β”œβ”€β”€ model
    |     β”œβ”€β”€ encoder
    β”‚     β”œβ”€β”€ decoder
    β”‚     β”œβ”€β”€ cfm
    β”‚     β”œβ”€β”€ gan
    β”œβ”€β”€ train
    β”œβ”€β”€ test

Total Stage

For training total stages, please refer to train.sh. This repository supports multi-gpu training process.

######## LJSpeech ##########
# stage 1
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi.py --config config/LJSpeech/base_stage1.yaml --num_worker 16 

# stage 2 (improved)
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi.py --config config/LJSpeech/base_stage2_ict.yaml --num_worker 16 

# stage 3 (improved)
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi_adv.py --config config/LJSpeech/base_stage3_ict.yaml --num_worker 16 


######## VCTK ##########
# stage 1
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/VCTK/base_stage1.yaml --num_worker 16 

# stage 2 (improved)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/VCTK/base_stage2_ict.yaml --num_worker 16 

# stage 3 (improved)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi_adv.py --config config/VCTK/base_stage3_ict.yaml --num_worker 16 

Stage 1

For stage 1, run the code below:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/{dataset}/base_stage1.yaml --num_worker 16

Configurations:
    β”œβ”€β”€ path
    β”œβ”€β”€ preprocess
    β”œβ”€β”€ model
    |     β”œβ”€β”€ boundary: 0.0
    β”œβ”€β”€ train
    β”œβ”€β”€ test

Stage 2

For stage 2, run the code below.

  • You should check the configuration if the stage is set to 2.

  • The prev_stage_ckpt should be set to the path of stage 1 training.

  • It enables weight initialization using stage 1 weights.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/{dataset}/base_stage2_ict.yaml --num_worker 16

Configurations:
    β”œβ”€β”€ path
    β”œβ”€β”€ preprocess
    β”œβ”€β”€ model
    |     β”œβ”€β”€ boundary: 0.9 # consistency training on (you can also use 1.0)
    β”œβ”€β”€ train
    |     β”œβ”€β”€ stage: 2
    |     β”œβ”€β”€ prev_stage_ckpt: {path for stage1}
    β”œβ”€β”€ test

Stage 3

For stage 3, run the code below.

  • You should check the configuration if the stage is set to 3.

  • The prev_stage_ckpt should be set to the path of stage 2 training.

  • It enables weight initialization using stage 2 weights.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi_adv.py --config config/{dataset}/base_stage3_ict.yaml --num_worker 16

Configurations:
    β”œβ”€β”€ path
    β”œβ”€β”€ preprocess
    β”œβ”€β”€ model
    |     β”œβ”€β”€ boundary: 0.9 # consistency training on (you can also use 1.0)
    β”œβ”€β”€ train
    |     β”œβ”€β”€ stage: 3
    |     β”œβ”€β”€ prev_stage_ckpt: {path for stage2}
    β”œβ”€β”€ test

For stage 3, we set additional 150 and 50 epochs for LJSpeech and VCTK datasets. Considering the fluctuation of training discriminator loss, epochs are empirically decided. Using models after loss fluctuation increased may result in a poor synthesis quality; thus, loss should be monitored when selecting adversarial learning checkpoints.

Train Disc Loss LJS Train Disc Loss VCTK

Figure: Training Discriminator Loss Comparison (LJSpeech and VCTK Datasets)

Evaluation

You can synthesize test files and check Word Error Rate of the synthesized samples by running the code.

The models and configurations will be loaded based on weight_path option.

python test.py --weight_path {weight_path} --model_name {Output folder name} --n_timesteps {NFE} --weight_name {weight name}

Example:
    python test.py --weight_path ./checkpoints/RapFlow-TTS-LJS-Stage3-Improved --model_name RapFlow-TTS --n_timesteps 2 --weight_name model-last

Synthesize

If you want to synthesize samples, run the following codes with pre-trained models. spk_id option is only for multi-speaker models and 0 is default for a single speaker model.

python synthesize.py --input_text "" --weight_path {weight path} --weight_name {weight name} --spk_id {spk id} --n_timesteps {NFE}

Example:
    python synthesize.py --input_text "This is a test sentence" --weight_path ./checkpoints/RapFlow-TTS-LJS-Stage3-Improved --model_name RapFlow-TTS --n_timesteps 2 --weight_name model-last

Pre-trained Models

We provide pre-trained weights for the LJSpeech and VCTK datasets.

Note:

These checkpoints were trained in a different environment and with modified configurations compared to the original setup described in the paper. As a result, there may be slight performance differences; we observed a slight decrease in performance. For LJSpeech, we found that 700 epochs for Stages 1 and 2 were sufficient. For VCTK, 500 epochs for each stage were sufficient, and we also modified the number of delta bins in this case. The provided Stage 3 checkpoints for each dataset were selected based on loss curves and correspond to 200 and 50 epochs, respectively.

Citation

@article{park2025rapflow,
  title={RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching},
  author={Park, Hyun Joon and Liu, Jeongmin and Kim, Jin Sob and Yang, Jeong Yeol and Han, Sung Won and Song, Eunwoo},
  journal={arXiv preprint arXiv:2506.16741},
  year={2025}
}

License

RapFlow-TTS
Copyright (c) 2025-present NAVER Cloud Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

References

Thanks for the open sources which we referred to as follows.

  • Consistency Flow Matching: The consistency flow matching loss for TTS starts from this reposistory.
  • Matcha-TTS: We used network, dataset, and text normalization codes.
  • Grad-TTS: We used the monotonic alignment search source code, CMUdict, and filelists for train, valid, and test split.
  • StyleTTS: We used discriminator codes, and modified adversarial learning loss from this repository.
  • DEX-TTS: The evaluation and preprocess codes start from this repository.
  • HiFi-GAN, BigVGAN: We used the pre-trained vocoder and implementations for converting waveforms from mel-spectrogram.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages