k2-fsa · drakempham · Apr 8, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/.gitignore b/.gitignore
@@ -36,3 +36,4 @@ node_modules
 .DS_Store
 *.fst
 *.arpa
+.venv/
diff --git a/egs/vctk/TTS/README.md b/egs/vctk/TTS/README.md
@@ -1,10 +1,36 @@
 # Introduction
 
-This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. 
-The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage. 
-The details of the text selection algorithms are described in the following paper: [C. Veaux, J. Yamagishi and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,"](https://doi.org/10.1109/ICSDA.2013.6709856).
+Key features of VITS:
 
-The above information is from the [CSTR VCTK website](https://datashare.ed.ac.uk/handle/10283/3443).
+Combines VAE (Variational Autoencoder), normalizing flow, and GAN (adversarial training with a discriminator).
+Uses Monotonic Alignment Search (MAS) — the model learns the alignment between text and audio automatically (no need for separate forced alignment like in older models).
+Supports multi-speaker training (VCTK has ~109 different English speakers).
+Generates natural-sounding speech with good prosody and voice quality.
+
+The notebook uses the icefall implementation of VITS (generator + discriminator).
+
+![alt text](image.png)
+
+# Data Preparation
+
+Run `prepare.sh` to download and prepare the data. All stages are run by default.
+
+**Option A — Download automatically (default):**
+```bash
+bash prepare.sh
+```
+
+**Option B — Use pre-existing local data (skip download):**
+
+If you already have the VCTK corpus available locally (e.g. from [Kaggle](https://www.kaggle.com/datasets/pratt3000/vctk-corpus)
+or another source), pass `--local-data-dir` to skip Stage 0 download:
+
+```bash
+bash prepare.sh --local-data-dir /path/to/your/VCTK
+```
+
+This will create a symlink at `download/VCTK` pointing to your local copy,
+so all subsequent stages work without any modification.
 
 # VITS
 
@@ -22,7 +48,7 @@ export CUDA_VISIBLE_DEVICES="0,1,2,3"
   --num-epochs 1000 \
   --start-epoch 1 \
   --exp-dir vits/exp \
-  --tokens data/tokens.txt
+  --tokens data/tokens.txt \
   --max-duration 350
 ```
 

diff --git a/egs/vctk/TTS/image.png b/egs/vctk/TTS/image.png
diff --git a/egs/vctk/TTS/infer_pretrained.md b/egs/vctk/TTS/infer_pretrained.md
@@ -0,0 +1,84 @@
+# Inference with Pretrained VITS Model
+
+### Cell 1: Install Dependencies
+```bash
+# Install icefall repo and requirements
+!git clone https://github.com/k2-fsa/icefall.git /kaggle/working/icefall
+!pip install -r /kaggle/working/icefall/requirements.txt
+!grep -v 'numba' /kaggle/working/icefall/requirements-tts.txt | pip install -r /dev/stdin
+!pip install "numba>=0.59.0"
+
+# Install lhotse (audio dataset toolkit)
+!pip install lhotse
+
+# Install k2 (must match CUDA 12.8 + PyTorch 2.10.0)
+!pip install k2==1.24.4.dev20260306+cuda12.8.torch2.10.0 -f https://k2-fsa.github.io/k2/cuda.html
+
+# Install piper_phonemize and register icefall
+!pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
+!pip install -e /kaggle/working/icefall
+```
+
+### Cell 2: Prepare Dataset
+```bash
+%cd /kaggle/working/icefall/egs/vctk/TTS
+
+# Symlink VCTK data to bypass download stage
+!mkdir -p download
+!ln -sfv /kaggle/input/datasets/ download/VCTK
+
+# Build monotonic_align C extension
+!bash prepare.sh --stage -1 --stop_stage -1
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 4
+```
-!mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
-
-# Create manifests, spectrograms, tokens, and data splits
-!bash prepare.sh --stage 1 --stop_stage 4
-```
+!mkdir -p download
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 4
-!mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
-
-# Create manifests, spectrograms, tokens, and data splits
-!bash prepare.sh --stage 1 --stop_stage 4
-```
+!mkdir -p download
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 4
+
+### Cell 3: Download Pretrained Model
+```python
+from huggingface_hub import hf_hub_download
+import os, shutil
+
+MODEL_ID = "zrjin/icefall-tts-vctk-vits-2024-03-18"
+BASE_DIR  = "/kaggle/working/icefall/egs/vctk/TTS"
+
+os.makedirs(f"{BASE_DIR}/vits/exp", exist_ok=True)
+os.makedirs(f"{BASE_DIR}/data", exist_ok=True)
+
+# Download checkpoint and move to correct directory
+hf_hub_download(repo_id=MODEL_ID, filename="exp/epoch-1000.pt", local_dir=BASE_DIR)
+shutil.copy2(f"{BASE_DIR}/exp/epoch-1000.pt", f"{BASE_DIR}/vits/exp/epoch-1000.pt")
+
+# Download tokens and speakers
+hf_hub_download(repo_id=MODEL_ID, filename="data/tokens.txt", local_dir=BASE_DIR)
+hf_hub_download(repo_id=MODEL_ID, filename="data/speakers.txt", local_dir=BASE_DIR)
+
+print("Pretrained model downloaded and moved to correct directories.")
+```
+
+### Cell 4: Run Inference
+```bash
+%cd /kaggle/working/icefall/egs/vctk/TTS
+
+!CUDA_VISIBLE_DEVICES="0" python vits/infer.py \
+  --epoch 1000 \
+  --exp-dir vits/exp \
+  --tokens data/tokens.txt \
+  --max-duration 500
+```
+
+### Cell 5: Play Generated Audio
+```python
+import os
+from IPython.display import Audio, display
+
+wav_dir = "/kaggle/working/icefall/egs/vctk/TTS/vits/exp/infer/epoch-1000/wav"
+# Choose to play audio from test set directory
+wav_dir_test = os.path.join(wav_dir, "test")
+wav_files = sorted(os.listdir(wav_dir_test))
+
+# Play the first 3 generated audio files
+for f in wav_files[:3]:
+    print(f)
+    display(Audio(os.path.join(wav_dir_test, f)))
+```
diff --git a/egs/vctk/TTS/knowledge.md b/egs/vctk/TTS/knowledge.md
@@ -0,0 +1,87 @@
+# VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
+
+Tài liệu này giải thích các khái niệm kiến trúc, toán học và logic cốt lõi đằng sau mô hình VITS — một trong những mô hình State-of-the-Art (SOTA) trong lĩnh vực tổng hợp giọng nói.
+
+---
+
+## 1. Sự khác biệt của VITS: End-to-End từ Text thẳng ra Waveform
+
+Trước thế hệ của VITS, quy trình TTS thường là một "đường ống" (pipeline) gồm 2 giai đoạn tách biệt:
+1. **Acoustic Model** (Tacotron 2, FastSpeech): Biến **Text** thành **Mel-spectrogram** (dạng hình ảnh biểu diễn âm thanh tần số).
+2. **Vocoder** (WaveNet, HiFi-GAN): Biến **Mel-spectrogram** thành **Waveform** (sóng âm thanh thô để phát ra loa).
+
+**Nhược điểm của cách cũ:** Tích tụ lỗi (Error Accumulation). Nếu Acoustic Model dự đoán Spectrogram hơi mờ, Vocoder sẽ khuyếch đại cái "mờ" đó thành tiếng nhiễu (artifacts) hoặc tiếng robot.
+
+🔥 **VITS giải quyết điều này bằng mô hình End-to-End:**
+VITS kết nối trực tiếp Text và Waveform. Không có sự đứt gãy ở giữa. Thay vì bắt mô hình học cách tạo ra một Mel-spectrogram trung gian cứng nhắc, VITS học cách tạo ra một vùng tiềm ẩn (Latent Space) $z$. 
+- Từ Text, mô hình **đoán** $z$.
+- Từ $z$, mô hình **tạo thẳng** ra sóng âm thanh (Waveform).
+- Nếu sóng âm thanh nghe không giống thật, mô hình tự động điều chỉnh cả bộ đoán $z$ từ Text và bộ tạo âm lượng. Toàn bộ hệ thống tự tối ưu cho nhau.
+
+---
+
+## 2. Luồng Logic (Architecture Flow)
+
+```mermaid
+graph TD
+    %% Training Flow
+    subgraph Posterior [Posterior Encoder - Chỉ dùng lúc Train]
+    Audio[Audio Thật] --> Spec[Linear Spectrogram]
+    Spec --> PEnc[Posterior Encoder]
+    PEnc -- Phân phối z --> Z[Trích xuất Latent z]
+    end
+
+    subgraph Prior [Prior Encoder - Đi từ Text]
+    Text[Phoneme Text] --> TEnc[Text Encoder]
+    TEnc --> MAS[Monotonic Alignment Search]
+    TEnc --> SDP[Stochastic Duration Predictor]
+    Z -- Đào tạo MAS --> MAS
+    MAS -- Khớp độ dài --> Flow[Normalizing Flow]
+    end
+
+    subgraph Generator [Waveform Decoder]
+    Z -- Lúc Train --> Dec[HiFi-GAN Decoder]
+    Flow -- Lúc Sinh (Inference) --> Dec
+    Dec --> Wave[Waveform Audio]
+    end
+```
+
+**Hoạt động lúc Inference (Khi gọi `infer.py`):**
+Text → Text Encoder → Normalizing Flow (biến đổi phân phối) → Decoder (sinh Waveform nhanh chóng).
+
+---
+
+## 3. Các nền tảng Toán học & Logic cốt lõi
+
+VITS là sự kết hợp của 4 kỷ nguyên AI mạnh mẽ nhất:
+
+### A. Variational Autoencoder (VAE)
+VITS xây dựng dựa trên kỹ thuật biến thiên (Variational Inference).
+- Thay vì dự đoán một giá trị chính xác, mô hình dự đoán một **phân phối xác suất** (thường là phân phối chuẩn Gaussian).
+- **Posterior $q(z|x)$**: Khi có âm thanh thật, mô hình giải mã nó thành các tham số $\mu, \sigma$ của $z$.
+- **Prior $p(z|c)$**: Khi có text (c), mô hình dựa vào chữ cái để đoán xem âm thanh $z$ có đặc tính phân phối nào.
+- Trọng tâm của toán học ở đây là **Cực đại hóa ELBO (Evidence Lower Bound)**, rút ngắn lại là giảm thiểu **KL Divergence** giữa Posterior (âm thanh thật) và Prior (text). Ép cho việc đoán từ chữ phải giống như lúc nghe âm thanh thật.
+
+### B. Normalizing Flows
+Giọng nói con người có tính chất *One-to-Many* (Một câu nói có thể đọc trầm, bổng, vui vẻ, buồn bã). Phân phối chuẩn (Gaussian/chuông) là quá đơn giản để đại diện cho sự đa dạng này.
+- **Normalizing Flows** là một chuỗi các hàm biến đổi toán học nghịch đảo (invertible functions) nhằm "nặn" một phân phối Gaussian cơ bản thành một phân phối cực kỳ phức tạp để hợp với giọng thật.
+- Nó giúp Text Encoder từ một dự đoán "chung chung" trở thành một dự đoán có độ chi tiết rất cao về ngữ điệu (prosody).
+
+### C. Stochastic Duration Predictor (Toán học dự đoán thời lượng)
+Chữ 'A' có lúc đọc dài (Aaaaa), có lúc đọc ngắn (A). 
+- Duration Predictor của VITS cũng dựa trên *Flow-based model* chứ không dự đoán một con số cứng nhắc dính liền với chữ. 
+- Nó lấy Noise ngẫu nhiên kết hợp với Text để đẻ ra thời lượng nói một cách tự nhiên. Giúp câu nói nhịp nhàng như người thật (ngắt nghỉ random). Nó dùng MLE (Maximum Likelihood Estimation) để tối ưu.
+
+### D. Monotonic Alignment Search (MAS)
+Thuật toán tìm kiếm sự căn chỉnh **đơn điệu**.
+- *Đơn điệu* nghĩa là thời gian luôn tiến tới: Bạn không thể phát âm chữ thứ 2 trước chữ thứ 1. 
+- MAS sử dụng thuật toán **Dynamic Programming** (Quy hoạch động - giống với Viterbi ở mô hình HMM) để tìm ra đường liên kết (alignment path) xác suất cao nhất giữa dải Spectrogram (âm thanh) và chuỗi chữ cái (Text).
+- Nhờ có MAS, VITS **không cần dữ liệu gán nhãn từng mili-giây** (không cần biết chữ "Xin" dài bao nhiêu giây). Mô hình sẽ tự học cách gập (align) qua các Epoch.
+
+### E. Adversarial Training (Generative Adversarial Network - GAN)
+Vì hàm Loss của VAE (Reconstruction Loss) có xu hướng làm âm thanh bị "mờ" và "đục", VITS dùng Decoder là một Generator của **HiFi-GAN**.
+Nó setup trò chơi 2 phe:
+1. **Decoder (Generator):** Tìm cách tạo âm thanh thô lừa hệ thống.
+2. **Discriminator:** Cố phân biệt đâu là audio tổng hợp, đâu là audio từ ca sĩ/người đọc thật (thông qua Feature Matching Loss và LSGAN Loss).
+
+> VITS chính thức chấm dứt sự phụ thuộc vào các đường ống phức tạp của TTS truyền thống, sử dụng VAE để có lý thuyết thống kê liền mạch, MAS để tự học cách nối chữ và âm thanh, và GAN để Waveform tạo ra nét cắt cực khét, trong trẻo.
diff --git a/egs/vctk/TTS/prepare.sh b/egs/vctk/TTS/prepare.sh
@@ -9,6 +9,12 @@ stage=0
 stop_stage=100
 use_edinburgh_vctk_url=true
 
+# If you have VCTK already downloaded locally (e.g. from Kaggle),
+# set this to the path of the existing VCTK directory to skip downloading.
+# Example:
+#   --local-data-dir /kaggle/input/vctk-corpus
+local_data_dir=
+
 dl_dir=$PWD/download
 
 . shared/parse_options.sh || exit 1
@@ -44,8 +50,18 @@ if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
   #
   #   ln -sfv /path/to/VCTK $dl_dir/VCTK
   #
+  # Alternatively, use --local-data-dir to point to an existing VCTK directory:
+  #
+  #   bash prepare.sh --local-data-dir /path/to/VCTK
+  #
   if [ ! -d $dl_dir/VCTK ]; then
-    lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
+    if [ -n "$local_data_dir" ]; then
+      log "Using local data directory: $local_data_dir"
+      mkdir -p $dl_dir
+      ln -sfv $local_data_dir $dl_dir/VCTK
+    else
+      lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
+    fi
   fi
 fi
 

diff --git a/egs/vctk/TTS/train_from_scratch.md b/egs/vctk/TTS/train_from_scratch.md
@@ -0,0 +1,64 @@
+# Train VITS Model From Scratch
+
+### Cell 1: Install Dependencies
+```bash
+# Install icefall repo and requirements
+!git clone https://github.com/k2-fsa/icefall.git /kaggle/working/icefall
+!pip install -r /kaggle/working/icefall/requirements.txt
+!grep -v 'numba' /kaggle/working/icefall/requirements-tts.txt | pip install -r /dev/stdin
+!pip install "numba>=0.59.0"
+
+# Install lhotse (audio dataset toolkit)
+!pip install lhotse
+
+# Install k2 (must match CUDA 12.8 + PyTorch 2.10.0)
+!pip install k2==1.24.4.dev20260306+cuda12.8.torch2.10.0 -f https://k2-fsa.github.io/k2/cuda.html
+
+# Install piper_phonemize and register icefall
+!pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
+!pip install -e /kaggle/working/icefall
+```
+
+### Cell 2: Prepare Dataset
+```bash
+%cd /kaggle/working/icefall/egs/vctk/TTS
+
+# Symlink VCTK data to bypass download stage
+!mkdir -p download
+!ln -sfv /kaggle/input/datasets/ download/VCTK
+
+# Build monotonic_align C extension
+!bash prepare.sh --stage -1 --stop_stage -1
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 6
+```
-!mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
-
-# Create manifests, spectrograms, tokens, and data splits
-!bash prepare.sh --stage 1 --stop_stage 6
-```
+!mkdir -p download
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 6
-!mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
-
-# Create manifests, spectrograms, tokens, and data splits
-!bash prepare.sh --stage 1 --stop_stage 6
-```
+!mkdir -p download
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
+
+# Create manifests, spectrograms, tokens, and data splits
+!bash prepare.sh --stage 1 --stop_stage 6
+
+### Cell 3: Train Model
+```bash
+%cd /kaggle/working/icefall/egs/vctk/TTS
+
+!CUDA_VISIBLE_DEVICES="0" python vits/train.py \
+  --world-size 1 \
+  --num-epochs 1000 \
+  --start-epoch 1 \
+  --exp-dir vits/exp \
+  --tokens data/tokens.txt \
+  --max-duration 350
+```
+
+### Cell 4: View TensorBoard Logs
+```python
+%load_ext tensorboard
+%tensorboard --logdir /kaggle/working/icefall/egs/vctk/TTS/vits/exp/tensorboard
+```
+
+### Cell 5: Export to ONNX (After Training)
+```bash
+%cd /kaggle/working/icefall/egs/vctk/TTS
+
+!python vits/export-onnx.py \
+  --epoch 1000 \
+  --exp-dir vits/exp \
+  --tokens data/tokens.txt
+```
diff --git a/egs/vctk/TTS/vctk-vits-training.ipynb b/egs/vctk/TTS/vctk-vits-training.ipynb
-Original file line number
+Diff line change
@@ Expand Up / @@ -36,3 +36,4 @@ node_modules @@
     .DS_Store
     *.fst
     *.arpa
+    .venv/