# 1. List available samples
./demo_infer.sh --list
# 2. Run a quick test
./demo_infer.sh --sample 1 --gen-sample 2
# Done! Audio saved to demo_outputs/./demo_infer.sh --sample 1 --gen-sample 2✅ Easiest, works with demo samples, clean output
python demo_cli.py --sample 1 --gen-sample 2✅ Python-based, good for scripting
python demo_inference.py✅ Runs 4 demos automatically
# List samples
./demo_infer.sh --list
# Basic test
./demo_infer.sh --sample 1 --gen-sample 2
# Custom text (must be Pinyin with tone numbers)
./demo_infer.sh --sample 1 --gen-text "ni3 hao3 shi4 jie4"
# High quality (slower)
./demo_infer.sh --sample 2 --gen-sample 3 --nfe 64 --cfg 2.5
# Fast test (lower quality)
./demo_infer.sh --sample 1 --gen-sample 2 --nfe 16
# Custom output location
./demo_infer.sh --sample 1 --gen-sample 2 --output my_test.wav| Parameter | Default | Range | Description |
|---|---|---|---|
--nfe |
32 | 16-64 | Quality (higher=better, slower) |
--cfg |
2.0 | 1.0-3.0 | Text faithfulness |
| Sample | Duration | Description |
|---|---|---|
| 1 | 2.90s | Short utterance |
| 2 | 3.53s | Medium-short |
| 3 | 3.92s | Medium (median) |
| 4 | 4.34s | Medium-long |
| 5 | 5.62s | Long utterance |
All samples are from the Cantonese training dataset with Pinyin transcriptions.
For custom audio files, use Python directly:
python3 << 'EOF'
import sys
sys.path.append('/home/husrcf/Code/AIAA/AIAA2205-assignment2-F5-TTS/')
import torch, soundfile as sf
from src.f5_tts.infer.utils_infer import load_model, load_vocoder, infer_process
from src.f5_tts.Models.DiT import DiT
# Your audio and text (text MUST be in Pinyin with tone numbers)
ref_audio = "/path/to/your/audio.wav"
ref_text = "your ref text in pinyin"
gen_text = "text to generate in pinyin"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = load_model(DiT, dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
"./ckpts/cantonese_data/model_last.pt",
vocab_file="./data/cantonese_data_pinyin/vocab.txt", device=device)
vocoder = load_vocoder("vocos")
audio, sr, _ = infer_process(ref_audio, ref_text, gen_text, model, vocoder,
mel_spec_type="vocos", nfe_step=32, cfg_strength=2.0, device=device)
sf.write("output.wav", audio, sr)
print(f"✓ Generated {len(audio)/sr:.2f}s audio → output.wav")
EOFThe script will automatically use CPU if CUDA fails
Make sure the model is trained: ls -lh ckpts/cantonese_data/model_last.pt
- Ensure reference text exactly matches the reference audio
- Try lowering
--cfgto 1.5 - Text must be in Pinyin format with tone numbers
chmod +x demo_infer.sh
./demo_infer.sh --list- NATIVE_CLI_GUIDE.md - All inference methods
- CLI_USAGE_GUIDE.md - Detailed CLI usage
- DEMO_README.md - Complete demo documentation
- DEMO_SETUP_SUMMARY.md - Overview of what was created
These demos use samples from the training set. This is fine for:
- ✅ Testing that everything works
- ✅ Demonstrating model capabilities
- ✅ Quick sanity checks
But not ideal for:
- ❌ Evaluating generalization
- ❌ Claiming model performance on unseen data
For proper evaluation, use a held-out test set.
Quick Start: ./demo_infer.sh --list then ./demo_infer.sh --sample 1 --gen-sample 2