Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ node_modules
.DS_Store
*.fst
*.arpa
.venv/
36 changes: 31 additions & 5 deletions egs/vctk/TTS/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,36 @@
# Introduction

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage.
The details of the text selection algorithms are described in the following paper: [C. Veaux, J. Yamagishi and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,"](https://doi.org/10.1109/ICSDA.2013.6709856).
Key features of VITS:

The above information is from the [CSTR VCTK website](https://datashare.ed.ac.uk/handle/10283/3443).
Combines VAE (Variational Autoencoder), normalizing flow, and GAN (adversarial training with a discriminator).
Uses Monotonic Alignment Search (MAS) — the model learns the alignment between text and audio automatically (no need for separate forced alignment like in older models).
Supports multi-speaker training (VCTK has ~109 different English speakers).
Generates natural-sounding speech with good prosody and voice quality.

The notebook uses the icefall implementation of VITS (generator + discriminator).

![alt text](image.png)

# Data Preparation

Run `prepare.sh` to download and prepare the data. All stages are run by default.

**Option A — Download automatically (default):**
```bash
bash prepare.sh
```

**Option B — Use pre-existing local data (skip download):**

If you already have the VCTK corpus available locally (e.g. from [Kaggle](https://www.kaggle.com/datasets/pratt3000/vctk-corpus)
or another source), pass `--local-data-dir` to skip Stage 0 download:

```bash
bash prepare.sh --local-data-dir /path/to/your/VCTK
```

This will create a symlink at `download/VCTK` pointing to your local copy,
so all subsequent stages work without any modification.

# VITS

Expand All @@ -22,7 +48,7 @@ export CUDA_VISIBLE_DEVICES="0,1,2,3"
--num-epochs 1000 \
--start-epoch 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--tokens data/tokens.txt \
--max-duration 350
```

Expand Down
Binary file added egs/vctk/TTS/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 84 additions & 0 deletions egs/vctk/TTS/infer_pretrained.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Inference with Pretrained VITS Model

### Cell 1: Install Dependencies

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level jump (H1 → H3).

Line 3 should use ## to keep heading levels incremental.

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 3-3: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/infer_pretrained.md` at line 3, The heading "Cell 1: Install
Dependencies" is using an H3 (###) which jumps from the document H1 to H3;
change that heading to H2 (replace "### Cell 1: Install Dependencies" with "##
Cell 1: Install Dependencies") so heading levels are incremental and consistent.

```bash
# Install icefall repo and requirements
!git clone https://github.com/k2-fsa/icefall.git /kaggle/working/icefall
!pip install -r /kaggle/working/icefall/requirements.txt
!grep -v 'numba' /kaggle/working/icefall/requirements-tts.txt | pip install -r /dev/stdin
!pip install "numba>=0.59.0"

# Install lhotse (audio dataset toolkit)
!pip install lhotse

# Install k2 (must match CUDA 12.8 + PyTorch 2.10.0)
!pip install k2==1.24.4.dev20260306+cuda12.8.torch2.10.0 -f https://k2-fsa.github.io/k2/cuda.html

Comment on lines +14 to +16

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For k2 installation, what are the supported CUDA/PyTorch wheel combinations and how should users choose the correct wheel in notebooks?

💡 Result:

Supported CUDA/PyTorch wheel combinations (prebuilt Linux CUDA wheels)

k2’s CUDA wheels are built for specific (CUDA runtime, PyTorch) pairs; you must match both (as encoded in the wheel tag +cudaX.Y.torchA.B.C). The currently listed combinations include, for example:

  • torch 2.10.0 with CUDA 12.6 / 12.8 / 12.9 / 13.0 wheels (multiple Python versions). [1]
  • torch 2.4.0 with CUDA 11.8 / 12.1 / 12.4 wheels (multiple Python versions). [1]

The authoritative, up-to-date list is the wheel index page: https://k2-fsa.github.io/k2/cuda.html. [1]

How users should choose the correct wheel in notebooks

  1. Check the PyTorch you already have in the notebook (this is what you must match):
import torch, sys
print("python:", sys.version)
print("torch :", torch.__version__)
print("cuda  :", torch.version.cuda)  # None if CPU-only torch
  1. Pick the k2 wheel whose tag matches exactly:
  • torch.__version__ → matches the torch… part
  • torch.version.cuda → matches the cuda… part
    (Example naming: ...+cuda11.7.torch2.0.1...). [2]
  1. Install following the documented pattern: install torch first, then install the matching k2 wheel from the wheel index:
pip install k2==<k2_version>+cuda<cuda_version>.torch<torch_version> -f https://k2-fsa.github.io/k2/cuda.html

(And if you’re on Windows, CUDA wheels aren’t provided; you’d use CPU wheels or build from source.) [2]

Rule of thumb for notebooks: don’t “guess” a k2 wheel—read torch.__version__ / torch.version.cuda, then select the matching entry from cuda.html. [1][2]


Add compatibility check before k2 installation to match user's CUDA/PyTorch versions.

The pinned wheel k2==1.24.4.dev20260306+cuda12.8.torch2.10.0 only works for CUDA 12.8 + PyTorch 2.10.0; users with different versions will fail. For PyTorch 2.10.0, k2 also provides wheels for CUDA 12.6, 12.9, and 13.0. Add a check to detect the user's current versions and select the matching wheel, or document which CUDA versions the notebook requires.

import torch
print(f"PyTorch: {torch.__version__}, CUDA: {torch.version.cuda}")
# Then select the k2 wheel from https://k2-fsa.github.io/k2/cuda.html matching your versions
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/infer_pretrained.md` around lines 14 - 16, Replace the hardcoded
k2 pip install line with a compatibility check that inspects the runtime PyTorch
and CUDA versions (e.g., using torch.__version__ and torch.version.cuda) and
selects the matching k2 wheel from the k2 CUDA wheel index (or else print a
clear error/requirement message); alternatively add a short documented note
above the existing pip install explaining the exact required CUDA and PyTorch
versions and listing the alternate wheel tags (CUDA 12.6, 12.8, 12.9, 13.0 for
torch 2.10.0) so users can choose the correct pip target instead of the pinned
cuda12.8.torch2.10.0 wheel.

# Install piper_phonemize and register icefall
!pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
!pip install -e /kaggle/working/icefall
```

### Cell 2: Prepare Dataset
```bash
%cd /kaggle/working/icefall/egs/vctk/TTS

# Symlink VCTK data to bypass download stage
!mkdir -p download
!ln -sfv /kaggle/input/datasets/ download/VCTK

# Build monotonic_align C extension
!bash prepare.sh --stage -1 --stop_stage -1

# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 4
```
Comment on lines +27 to +35

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prefer --local-data-dir over symlinking an ambiguous Kaggle path.

Line 28 symlinks a generic directory and may not resolve to the VCTK root. Use the new prepare.sh --local-data-dir option for a precise, reproducible setup.

Suggested doc update
 !mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
 
 # Create manifests, spectrograms, tokens, and data splits
 !bash prepare.sh --stage 1 --stop_stage 4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
!mkdir -p download
!ln -sfv /kaggle/input/datasets/ download/VCTK
# Build monotonic_align C extension
!bash prepare.sh --stage -1 --stop_stage -1
# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 4
```
!mkdir -p download
!bash prepare.sh --stage -1 --stop_stage -1
!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 4
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/infer_pretrained.md` around lines 27 - 35, The current notebook
creates an ambiguous symlink to /kaggle/input/datasets/ then runs prepare.sh;
instead remove the ln -sfv step and call the script with the explicit
--local-data-dir argument (prepare.sh --local-data-dir /path/to/VCTK) so the
prep stages use the precise VCTK root; update the two prepare.sh invocations
(the build/extension call and the manifest/spectrogram/token/data-split call) to
pass --local-data-dir where needed and delete the symlink creation to make setup
reproducible.


### Cell 3: Download Pretrained Model
```python
from huggingface_hub import hf_hub_download
import os, shutil

MODEL_ID = "zrjin/icefall-tts-vctk-vits-2024-03-18"
BASE_DIR = "/kaggle/working/icefall/egs/vctk/TTS"

os.makedirs(f"{BASE_DIR}/vits/exp", exist_ok=True)
os.makedirs(f"{BASE_DIR}/data", exist_ok=True)

# Download checkpoint and move to correct directory
hf_hub_download(repo_id=MODEL_ID, filename="exp/epoch-1000.pt", local_dir=BASE_DIR)
shutil.copy2(f"{BASE_DIR}/exp/epoch-1000.pt", f"{BASE_DIR}/vits/exp/epoch-1000.pt")

# Download tokens and speakers
hf_hub_download(repo_id=MODEL_ID, filename="data/tokens.txt", local_dir=BASE_DIR)
hf_hub_download(repo_id=MODEL_ID, filename="data/speakers.txt", local_dir=BASE_DIR)

print("Pretrained model downloaded and moved to correct directories.")
```

### Cell 4: Run Inference
```bash
%cd /kaggle/working/icefall/egs/vctk/TTS

!CUDA_VISIBLE_DEVICES="0" python vits/infer.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt \
--max-duration 500
```

### Cell 5: Play Generated Audio
```python
import os
from IPython.display import Audio, display

wav_dir = "/kaggle/working/icefall/egs/vctk/TTS/vits/exp/infer/epoch-1000/wav"
# Choose to play audio from test set directory
wav_dir_test = os.path.join(wav_dir, "test")
wav_files = sorted(os.listdir(wav_dir_test))

# Play the first 3 generated audio files
for f in wav_files[:3]:
print(f)
display(Audio(os.path.join(wav_dir_test, f)))
```
87 changes: 87 additions & 0 deletions egs/vctk/TTS/knowledge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)

Tài liệu này giải thích các khái niệm kiến trúc, toán học và logic cốt lõi đằng sau mô hình VITS — một trong những mô hình State-of-the-Art (SOTA) trong lĩnh vực tổng hợp giọng nói.

---

## 1. Sự khác biệt của VITS: End-to-End từ Text thẳng ra Waveform

Trước thế hệ của VITS, quy trình TTS thường là một "đường ống" (pipeline) gồm 2 giai đoạn tách biệt:
1. **Acoustic Model** (Tacotron 2, FastSpeech): Biến **Text** thành **Mel-spectrogram** (dạng hình ảnh biểu diễn âm thanh tần số).
2. **Vocoder** (WaveNet, HiFi-GAN): Biến **Mel-spectrogram** thành **Waveform** (sóng âm thanh thô để phát ra loa).

**Nhược điểm của cách cũ:** Tích tụ lỗi (Error Accumulation). Nếu Acoustic Model dự đoán Spectrogram hơi mờ, Vocoder sẽ khuyếch đại cái "mờ" đó thành tiếng nhiễu (artifacts) hoặc tiếng robot.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix spelling in the Vietnamese technical term.

Line 13 uses “khuyếch đại”; the standard spelling is “khuếch đại”.

🧰 Tools
🪛 LanguageTool

[grammar] ~13-~13: Ensure spelling is correct
Context: ... dự đoán Spectrogram hơi mờ, Vocoder sẽ khuyếch đại cái "mờ" đó thành tiếng nhiễu (arti...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/knowledge.md` at line 13, Replace the misspelled Vietnamese
technical term "khuyếch đại" with the correct spelling "khuếch đại" in the
sentence describing the drawback (Error Accumulation) so the line reads that the
Vocoder will "khuếch đại" the blurred spectrogram into artifacts or robotic
noise; update the Occurrence of "khuyếch đại" in the TTS knowledge text to the
correct form.


🔥 **VITS giải quyết điều này bằng mô hình End-to-End:**
VITS kết nối trực tiếp Text và Waveform. Không có sự đứt gãy ở giữa. Thay vì bắt mô hình học cách tạo ra một Mel-spectrogram trung gian cứng nhắc, VITS học cách tạo ra một vùng tiềm ẩn (Latent Space) $z$.
- Từ Text, mô hình **đoán** $z$.
- Từ $z$, mô hình **tạo thẳng** ra sóng âm thanh (Waveform).
- Nếu sóng âm thanh nghe không giống thật, mô hình tự động điều chỉnh cả bộ đoán $z$ từ Text và bộ tạo âm lượng. Toàn bộ hệ thống tự tối ưu cho nhau.

---

## 2. Luồng Logic (Architecture Flow)

```mermaid
graph TD
%% Training Flow
subgraph Posterior [Posterior Encoder - Chỉ dùng lúc Train]
Audio[Audio Thật] --> Spec[Linear Spectrogram]
Spec --> PEnc[Posterior Encoder]
PEnc -- Phân phối z --> Z[Trích xuất Latent z]
end

subgraph Prior [Prior Encoder - Đi từ Text]
Text[Phoneme Text] --> TEnc[Text Encoder]
TEnc --> MAS[Monotonic Alignment Search]
TEnc --> SDP[Stochastic Duration Predictor]
Z -- Đào tạo MAS --> MAS
MAS -- Khớp độ dài --> Flow[Normalizing Flow]
end

subgraph Generator [Waveform Decoder]
Z -- Lúc Train --> Dec[HiFi-GAN Decoder]
Flow -- Lúc Sinh (Inference) --> Dec
Dec --> Wave[Waveform Audio]
end
```

**Hoạt động lúc Inference (Khi gọi `infer.py`):**
Text → Text Encoder → Normalizing Flow (biến đổi phân phối) → Decoder (sinh Waveform nhanh chóng).

---

## 3. Các nền tảng Toán học & Logic cốt lõi

VITS là sự kết hợp của 4 kỷ nguyên AI mạnh mẽ nhất:

### A. Variational Autoencoder (VAE)
VITS xây dựng dựa trên kỹ thuật biến thiên (Variational Inference).
- Thay vì dự đoán một giá trị chính xác, mô hình dự đoán một **phân phối xác suất** (thường là phân phối chuẩn Gaussian).
- **Posterior $q(z|x)$**: Khi có âm thanh thật, mô hình giải mã nó thành các tham số $\mu, \sigma$ của $z$.
- **Prior $p(z|c)$**: Khi có text (c), mô hình dựa vào chữ cái để đoán xem âm thanh $z$ có đặc tính phân phối nào.
- Trọng tâm của toán học ở đây là **Cực đại hóa ELBO (Evidence Lower Bound)**, rút ngắn lại là giảm thiểu **KL Divergence** giữa Posterior (âm thanh thật) và Prior (text). Ép cho việc đoán từ chữ phải giống như lúc nghe âm thanh thật.

### B. Normalizing Flows
Giọng nói con người có tính chất *One-to-Many* (Một câu nói có thể đọc trầm, bổng, vui vẻ, buồn bã). Phân phối chuẩn (Gaussian/chuông) là quá đơn giản để đại diện cho sự đa dạng này.
- **Normalizing Flows** là một chuỗi các hàm biến đổi toán học nghịch đảo (invertible functions) nhằm "nặn" một phân phối Gaussian cơ bản thành một phân phối cực kỳ phức tạp để hợp với giọng thật.
- Nó giúp Text Encoder từ một dự đoán "chung chung" trở thành một dự đoán có độ chi tiết rất cao về ngữ điệu (prosody).

### C. Stochastic Duration Predictor (Toán học dự đoán thời lượng)
Chữ 'A' có lúc đọc dài (Aaaaa), có lúc đọc ngắn (A).
- Duration Predictor của VITS cũng dựa trên *Flow-based model* chứ không dự đoán một con số cứng nhắc dính liền với chữ.
- Nó lấy Noise ngẫu nhiên kết hợp với Text để đẻ ra thời lượng nói một cách tự nhiên. Giúp câu nói nhịp nhàng như người thật (ngắt nghỉ random). Nó dùng MLE (Maximum Likelihood Estimation) để tối ưu.

### D. Monotonic Alignment Search (MAS)
Thuật toán tìm kiếm sự căn chỉnh **đơn điệu**.
- *Đơn điệu* nghĩa là thời gian luôn tiến tới: Bạn không thể phát âm chữ thứ 2 trước chữ thứ 1.
- MAS sử dụng thuật toán **Dynamic Programming** (Quy hoạch động - giống với Viterbi ở mô hình HMM) để tìm ra đường liên kết (alignment path) xác suất cao nhất giữa dải Spectrogram (âm thanh) và chuỗi chữ cái (Text).
- Nhờ có MAS, VITS **không cần dữ liệu gán nhãn từng mili-giây** (không cần biết chữ "Xin" dài bao nhiêu giây). Mô hình sẽ tự học cách gập (align) qua các Epoch.

### E. Adversarial Training (Generative Adversarial Network - GAN)
Vì hàm Loss của VAE (Reconstruction Loss) có xu hướng làm âm thanh bị "mờ" và "đục", VITS dùng Decoder là một Generator của **HiFi-GAN**.
Nó setup trò chơi 2 phe:
1. **Decoder (Generator):** Tìm cách tạo âm thanh thô lừa hệ thống.
2. **Discriminator:** Cố phân biệt đâu là audio tổng hợp, đâu là audio từ ca sĩ/người đọc thật (thông qua Feature Matching Loss và LSGAN Loss).

> VITS chính thức chấm dứt sự phụ thuộc vào các đường ống phức tạp của TTS truyền thống, sử dụng VAE để có lý thuyết thống kê liền mạch, MAS để tự học cách nối chữ và âm thanh, và GAN để Waveform tạo ra nét cắt cực khét, trong trẻo.
18 changes: 17 additions & 1 deletion egs/vctk/TTS/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ stage=0
stop_stage=100
use_edinburgh_vctk_url=true

# If you have VCTK already downloaded locally (e.g. from Kaggle),
# set this to the path of the existing VCTK directory to skip downloading.
# Example:
# --local-data-dir /kaggle/input/vctk-corpus
local_data_dir=

dl_dir=$PWD/download

. shared/parse_options.sh || exit 1
Expand Down Expand Up @@ -44,8 +50,18 @@ if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
#
# ln -sfv /path/to/VCTK $dl_dir/VCTK
#
# Alternatively, use --local-data-dir to point to an existing VCTK directory:
#
# bash prepare.sh --local-data-dir /path/to/VCTK
#
if [ ! -d $dl_dir/VCTK ]; then
lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
if [ -n "$local_data_dir" ]; then
log "Using local data directory: $local_data_dir"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check that "$local_data_dir" exists?

mkdir -p $dl_dir
ln -sfv $local_data_dir $dl_dir/VCTK

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quote local-data-dir when creating symlink

The new --local-data-dir path is expanded unquoted in ln -sfv $local_data_dir $dl_dir/VCTK, so any directory containing whitespace (or shell glob characters) is split into multiple arguments and Stage 0 fails before data preparation starts. This makes the new option unreliable for valid filesystem paths such as mounted datasets with spaces in their names; quoting both operands avoids this regression.

Useful? React with 👍 / 👎.

else
lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
fi
Comment on lines +58 to +64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of the --local-data-dir option should be more robust to handle common edge cases:

  1. Quoting: Path variables (like $local_data_dir and $dl_dir) should be double-quoted to prevent the script from breaking if the user provides a path containing spaces.
  2. Absolute Paths: If a relative path is passed to --local-data-dir, the symlink created at $dl_dir/VCTK will likely be broken because symlinks are resolved relative to their parent directory. Converting the path to an absolute one using readlink -f (or similar) ensures the symlink remains valid.
  3. Validation: It is better to verify that the provided directory actually exists before attempting to symlink it, providing a clear error message if it is missing.
References
  1. Always quote variables that contain file names or paths to handle spaces and special characters correctly.

Comment on lines +58 to +64

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation for the local data directory path.

The script doesn't verify that $local_data_dir exists before creating the symlink. If an invalid path is provided, subsequent stages will fail with confusing errors.

🛡️ Proposed fix to validate the directory exists
     if [ -n "$local_data_dir" ]; then
+      if [ ! -d "$local_data_dir" ]; then
+        log "Error: local data directory does not exist: $local_data_dir"
+        exit 1
+      fi
       log "Using local data directory: $local_data_dir"
-      mkdir -p $dl_dir
-      ln -sfv $local_data_dir $dl_dir/VCTK
+      mkdir -p "$dl_dir"
+      ln -sfv "$local_data_dir" "$dl_dir/VCTK"
     else
-      lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
+      lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} "$dl_dir"
     fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 60-60: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 61-61: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 61-61: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 63-63: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/prepare.sh` around lines 58 - 64, In prepare.sh, validate that
the provided local_data_dir exists and is a directory before creating the
symlink: check [ -n "$local_data_dir" ] && [ -d "$local_data_dir" ] (or use test
-d) and if the check fails call log with a clear error and exit non-zero; only
run mkdir -p "$dl_dir" and ln -sfv "$local_data_dir" "$dl_dir/VCTK" when the
directory check passes so invalid paths don't produce confusing errors later.

fi
fi

Expand Down
64 changes: 64 additions & 0 deletions egs/vctk/TTS/train_from_scratch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Train VITS Model From Scratch

### Cell 1: Install Dependencies

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level jump (H1 → H3).

Line 3 should be ## instead of ### to satisfy Markdown heading progression.

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 3-3: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/train_from_scratch.md` at line 3, Heading "Cell 1: Install
Dependencies" uses H3 (###) causing a jump from H1 to H3; change the Markdown
heading to H2 by replacing the leading "###" with "##" so the file's heading
progression is H1 → H2 → ... and validates properly.

```bash
# Install icefall repo and requirements
!git clone https://github.com/k2-fsa/icefall.git /kaggle/working/icefall
!pip install -r /kaggle/working/icefall/requirements.txt
!grep -v 'numba' /kaggle/working/icefall/requirements-tts.txt | pip install -r /dev/stdin
!pip install "numba>=0.59.0"

# Install lhotse (audio dataset toolkit)
!pip install lhotse

# Install k2 (must match CUDA 12.8 + PyTorch 2.10.0)
!pip install k2==1.24.4.dev20260306+cuda12.8.torch2.10.0 -f https://k2-fsa.github.io/k2/cuda.html

# Install piper_phonemize and register icefall
!pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
!pip install -e /kaggle/working/icefall
```

### Cell 2: Prepare Dataset
```bash
%cd /kaggle/working/icefall/egs/vctk/TTS

# Symlink VCTK data to bypass download stage
!mkdir -p download
!ln -sfv /kaggle/input/datasets/ download/VCTK

# Build monotonic_align C extension
!bash prepare.sh --stage -1 --stop_stage -1

# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 6
```
Comment on lines +27 to +35

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use --local-data-dir directly instead of a fragile manual symlink target.

Line 28 points to a broad directory (/kaggle/input/datasets/) rather than a concrete VCTK root, which can break downstream prep. This PR already introduces --local-data-dir; use it in the doc to make the flow deterministic.

Suggested doc update
 !mkdir -p download
-!ln -sfv /kaggle/input/datasets/ download/VCTK
-
-# Build monotonic_align C extension
-!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --stage -1 --stop_stage -1
+!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
 
 # Create manifests, spectrograms, tokens, and data splits
 !bash prepare.sh --stage 1 --stop_stage 6
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
!mkdir -p download
!ln -sfv /kaggle/input/datasets/ download/VCTK
# Build monotonic_align C extension
!bash prepare.sh --stage -1 --stop_stage -1
# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 6
```
!mkdir -p download
!bash prepare.sh --stage -1 --stop_stage -1
!bash prepare.sh --local-data-dir /kaggle/input/vctk-corpus --stage 0 --stop_stage 0
# Create manifests, spectrograms, tokens, and data splits
!bash prepare.sh --stage 1 --stop_stage 6
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/train_from_scratch.md` around lines 27 - 35, The doc currently
creates a fragile symlink to a broad path instead of using the new CLI flag;
update the instructions to remove the mkdir/ln steps and call prepare.sh with
the explicit --local-data-dir pointing at the actual VCTK dataset root (use the
same --local-data-dir flag introduced in the PR) so subsequent steps
(monotonic_align build via prepare.sh and stages 1–6) consume the correct
dataset; reference the prepare.sh invocation and the --local-data-dir flag when
making this change.


### Cell 3: Train Model
```bash
%cd /kaggle/working/icefall/egs/vctk/TTS

!CUDA_VISIBLE_DEVICES="0" python vits/train.py \
--world-size 1 \
--num-epochs 1000 \
--start-epoch 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt \
--max-duration 350
```

### Cell 4: View TensorBoard Logs
```python
%load_ext tensorboard
%tensorboard --logdir /kaggle/working/icefall/egs/vctk/TTS/vits/exp/tensorboard
```

### Cell 5: Export to ONNX (After Training)
```bash
%cd /kaggle/working/icefall/egs/vctk/TTS

!python vits/export-onnx.py \
--epoch 1000 \
--exp-dir vits/exp \
--tokens data/tokens.txt
```
1 change: 1 addition & 0 deletions egs/vctk/TTS/vctk-vits-training.ipynb

Large diffs are not rendered by default.

Loading