Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ node_modules
.DS_Store
*.fst
*.arpa
.venv/
29 changes: 26 additions & 3 deletions egs/vctk/TTS/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,34 @@
# Introduction

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage.
Follow this: https://k2-fsa.github.io/icefall/recipes/TTS/vctk/vits.html

This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage.
The details of the text selection algorithms are described in the following paper: [C. Veaux, J. Yamagishi and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,"](https://doi.org/10.1109/ICSDA.2013.6709856).

The above information is from the [CSTR VCTK website](https://datashare.ed.ac.uk/handle/10283/3443).

# Data Preparation

Run `prepare.sh` to download and prepare the data. All stages are run by default.

**Option A — Download automatically (default):**
```bash
bash prepare.sh
```

**Option B — Use pre-existing local data (skip download):**

If you already have the VCTK corpus available locally (e.g. from [Kaggle](https://www.kaggle.com/datasets/pratt3000/vctk-corpus)
or another source), pass `--local-data-dir` to skip Stage 0 download:

```bash
bash prepare.sh --local-data-dir /path/to/your/VCTK
```

This will create a symlink at `download/VCTK` pointing to your local copy,
so all subsequent stages work without any modification.

# VITS

This recipe provides a VITS model trained on the VCTK dataset.
Expand All @@ -22,7 +45,7 @@ export CUDA_VISIBLE_DEVICES="0,1,2,3"
--num-epochs 1000 \
--start-epoch 1 \
--exp-dir vits/exp \
--tokens data/tokens.txt
--tokens data/tokens.txt \
--max-duration 350
```

Expand Down
18 changes: 17 additions & 1 deletion egs/vctk/TTS/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ stage=0
stop_stage=100
use_edinburgh_vctk_url=true

# If you have VCTK already downloaded locally (e.g. from Kaggle),
# set this to the path of the existing VCTK directory to skip downloading.
# Example:
# --local-data-dir /kaggle/input/vctk-corpus
local_data_dir=

dl_dir=$PWD/download

. shared/parse_options.sh || exit 1
Expand Down Expand Up @@ -44,8 +50,18 @@ if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
#
# ln -sfv /path/to/VCTK $dl_dir/VCTK
#
# Alternatively, use --local-data-dir to point to an existing VCTK directory:
#
# bash prepare.sh --local-data-dir /path/to/VCTK
#
if [ ! -d $dl_dir/VCTK ]; then
lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
if [ -n "$local_data_dir" ]; then
log "Using local data directory: $local_data_dir"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check that "$local_data_dir" exists?

mkdir -p $dl_dir
ln -sfv $local_data_dir $dl_dir/VCTK

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Quote local-data-dir when creating symlink

The new --local-data-dir path is expanded unquoted in ln -sfv $local_data_dir $dl_dir/VCTK, so any directory containing whitespace (or shell glob characters) is split into multiple arguments and Stage 0 fails before data preparation starts. This makes the new option unreliable for valid filesystem paths such as mounted datasets with spaces in their names; quoting both operands avoids this regression.

Useful? React with 👍 / 👎.

else
lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
fi
Comment on lines +58 to +64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of the --local-data-dir option should be more robust to handle common edge cases:

  1. Quoting: Path variables (like $local_data_dir and $dl_dir) should be double-quoted to prevent the script from breaking if the user provides a path containing spaces.
  2. Absolute Paths: If a relative path is passed to --local-data-dir, the symlink created at $dl_dir/VCTK will likely be broken because symlinks are resolved relative to their parent directory. Converting the path to an absolute one using readlink -f (or similar) ensures the symlink remains valid.
  3. Validation: It is better to verify that the provided directory actually exists before attempting to symlink it, providing a clear error message if it is missing.
References
  1. Always quote variables that contain file names or paths to handle spaces and special characters correctly.

Comment on lines +58 to +64

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add validation for the local data directory path.

The script doesn't verify that $local_data_dir exists before creating the symlink. If an invalid path is provided, subsequent stages will fail with confusing errors.

🛡️ Proposed fix to validate the directory exists
     if [ -n "$local_data_dir" ]; then
+      if [ ! -d "$local_data_dir" ]; then
+        log "Error: local data directory does not exist: $local_data_dir"
+        exit 1
+      fi
       log "Using local data directory: $local_data_dir"
-      mkdir -p $dl_dir
-      ln -sfv $local_data_dir $dl_dir/VCTK
+      mkdir -p "$dl_dir"
+      ln -sfv "$local_data_dir" "$dl_dir/VCTK"
     else
-      lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} $dl_dir
+      lhotse download vctk --use-edinburgh-vctk-url ${use_edinburgh_vctk_url} "$dl_dir"
     fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 60-60: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 61-61: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 61-61: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 63-63: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@egs/vctk/TTS/prepare.sh` around lines 58 - 64, In prepare.sh, validate that
the provided local_data_dir exists and is a directory before creating the
symlink: check [ -n "$local_data_dir" ] && [ -d "$local_data_dir" ] (or use test
-d) and if the check fails call log with a clear error and exit non-zero; only
run mkdir -p "$dl_dir" and ln -sfv "$local_data_dir" "$dl_dir/VCTK" when the
directory check passes so invalid paths don't produce confusing errors later.

fi
fi

Expand Down
Loading
Loading