RapFlow-TTS: Rapid High-Fidelity Text-to-Speech with Improved Consistency Flow Matching (Interspeech 2025)
This repository is the official implementation of RapFlow-TTS: Rapid High-Fidelity Text-to-Speech with Improved Consistency Flow Matching.
In this repository, we provide steps for running RapFlow-TTS.
π We recommend you visit our demo site. π
RapFlow-TTS is an ODE-based TTS that can synthesize high-quality speech with fewer steps using improved consistency flow matching. The overall architecture of RapFlow-TTS is as below:
Clone this repository
git clone https://github.com/naver-ai/RapFlow-TTS.git
cd RapFlow-TTS/
Set your virtual environment. You can specify other Python versions.
conda create -n rapflow python=3.9
conda activate rapflow
or
python3.9 -m venv venv
source venv/bin/activate
To install requirements:
# you can specify other latest torch versions (we used torch==1.11.0 with cuda 11.3)
pip install -r requirements.txt
For MAS algorithmm run the code below.
cd ./model/monotonic_align; python setup.py build_ext --inplace
For pre-trained HiFi-GAN (LJSpeech and universal vocoder), download, unzip, and place them hifigan/weights/LJSpeech
or hifigan/weights/universal
.
We use the LJSpeech and VCTK datasets.
-
The LJSpeech dataset can be downloaded here.
-
The VCTK dataset (trimmed version) can be downloaded here.
-
We follow the train, valid, and test set split of Grad-TTS for the LJSpeech dataset.
-
We use the random set split for the VCTK dataset.
For set split and text preprocess, run the following code with the config option (LJSpeech or VCTK):
python ./preprocess/preprocess.py --config ./config/{dataset}/preprocess.yaml
Example:
python ./preprocess/preprocess.py --config ./config/LJSpeech/preprocess.yaml
python ./preprocess/preprocess.py --config ./config/VCTK/preprocess.yaml
The codes yield the two meta datalist for training.
{train, valid, test}.txt :
Path|Text|Speaker
cleaned_{train, valid, test}.txt :
Path|Text|Phoneme|Speaker
To obtain statistics for mel normalization, run the following code.
-
When you run the below code for the first time, the
model.data_stats
should be set as [0,1]. -
Modify statistics of
model.data_stats
in the model configuration. -
In practice, we use mel normalization only for LJSpeech dataset.
python get_mel_stats.py --config ./config/{dataset}/base_stage1.yaml
Example:
python get_mel_stats.py --config ./config/LJSpeech/base_stage1.yaml
Configurations:
βββ path
βββ preprocess
βββ model
| βββ data_stats: [mean, std], [0, 1] --> [new stats]
βββ train
βββ test
To train RapFlow-TTS from scratch, run the following code.
-
If you want to change training options such as num_worker, cuda device, and so on, check
argument.py
. -
If you want to edit model or training settings, check
config/{dataset}/base_stage{1,2,3}.yaml
. -
The training process contains 3 stages (Straight flow, Straight flow & Consistency, Straight flow & Consistency & Adversarial learning).
Configurations:
βββ path
βββ preprocess
βββ model
| βββ encoder
β βββ decoder
β βββ cfm
β βββ gan
βββ train
βββ test
For training total stages, please refer to train.sh
.
This repository supports multi-gpu training process.
######## LJSpeech ##########
# stage 1
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi.py --config config/LJSpeech/base_stage1.yaml --num_worker 16
# stage 2 (improved)
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi.py --config config/LJSpeech/base_stage2_ict.yaml --num_worker 16
# stage 3 (improved)
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node=1 --master_port=29500 train_multi_adv.py --config config/LJSpeech/base_stage3_ict.yaml --num_worker 16
######## VCTK ##########
# stage 1
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/VCTK/base_stage1.yaml --num_worker 16
# stage 2 (improved)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/VCTK/base_stage2_ict.yaml --num_worker 16
# stage 3 (improved)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi_adv.py --config config/VCTK/base_stage3_ict.yaml --num_worker 16
For stage 1, run the code below:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/{dataset}/base_stage1.yaml --num_worker 16
Configurations:
βββ path
βββ preprocess
βββ model
| βββ boundary: 0.0
βββ train
βββ test
For stage 2, run the code below.
-
You should check the configuration if the
stage
is set to 2. -
The
prev_stage_ckpt
should be set to the path of stage 1 training. -
It enables weight initialization using stage 1 weights.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi.py --config config/{dataset}/base_stage2_ict.yaml --num_worker 16
Configurations:
βββ path
βββ preprocess
βββ model
| βββ boundary: 0.9 # consistency training on (you can also use 1.0)
βββ train
| βββ stage: 2
| βββ prev_stage_ckpt: {path for stage1}
βββ test
For stage 3, run the code below.
-
You should check the configuration if the
stage
is set to 3. -
The
prev_stage_ckpt
should be set to the path of stage 2 training. -
It enables weight initialization using stage 2 weights.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes=1 --nproc_per_node=2 --master_port=29500 train_multi_adv.py --config config/{dataset}/base_stage3_ict.yaml --num_worker 16
Configurations:
βββ path
βββ preprocess
βββ model
| βββ boundary: 0.9 # consistency training on (you can also use 1.0)
βββ train
| βββ stage: 3
| βββ prev_stage_ckpt: {path for stage2}
βββ test
For stage 3, we set additional 150 and 50 epochs for LJSpeech and VCTK datasets. Considering the fluctuation of training discriminator loss, epochs are empirically decided. Using models after loss fluctuation increased may result in a poor synthesis quality; thus, loss should be monitored when selecting adversarial learning checkpoints.
Figure: Training Discriminator Loss Comparison (LJSpeech and VCTK Datasets)
You can synthesize test files and check Word Error Rate of the synthesized samples by running the code.
The models and configurations will be loaded based on weight_path
option.
python test.py --weight_path {weight_path} --model_name {Output folder name} --n_timesteps {NFE} --weight_name {weight name}
Example:
python test.py --weight_path ./checkpoints/RapFlow-TTS-LJS-Stage3-Improved --model_name RapFlow-TTS --n_timesteps 2 --weight_name model-last
If you want to synthesize samples, run the following codes with pre-trained models. spk_id
option is only for multi-speaker models and 0 is default for a single speaker model.
python synthesize.py --input_text "" --weight_path {weight path} --weight_name {weight name} --spk_id {spk id} --n_timesteps {NFE}
Example:
python synthesize.py --input_text "This is a test sentence" --weight_path ./checkpoints/RapFlow-TTS-LJS-Stage3-Improved --model_name RapFlow-TTS --n_timesteps 2 --weight_name model-last
We provide pre-trained weights for the LJSpeech and VCTK datasets.
Note:
These checkpoints were trained in a different environment and with modified configurations compared to the original setup described in the paper. As a result, there may be slight performance differences; we observed a slight decrease in performance. For LJSpeech, we found that 700 epochs for Stages 1 and 2 were sufficient. For VCTK, 500 epochs for each stage were sufficient, and we also modified the number of delta bins in this case. The provided Stage 3 checkpoints for each dataset were selected based on loss curves and correspond to 200 and 50 epochs, respectively.
@article{park2025rapflow,
title={RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching},
author={Park, Hyun Joon and Liu, Jeongmin and Kim, Jin Sob and Yang, Jeong Yeol and Han, Sung Won and Song, Eunwoo},
journal={arXiv preprint arXiv:2506.16741},
year={2025}
}
RapFlow-TTS
Copyright (c) 2025-present NAVER Cloud Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Thanks for the open sources which we referred to as follows.
- Consistency Flow Matching: The consistency flow matching loss for TTS starts from this reposistory.
- Matcha-TTS: We used network, dataset, and text normalization codes.
- Grad-TTS: We used the monotonic alignment search source code, CMUdict, and filelists for train, valid, and test split.
- StyleTTS: We used discriminator codes, and modified adversarial learning loss from this repository.
- DEX-TTS: The evaluation and preprocess codes start from this repository.
- HiFi-GAN, BigVGAN: We used the pre-trained vocoder and implementations for converting waveforms from mel-spectrogram.