Skip to content

Latest commit

 

History

History
147 lines (108 loc) · 3.85 KB

File metadata and controls

147 lines (108 loc) · 3.85 KB

Quick Start Guide - F5-TTS Demo Inference

Fastest Way to Get Started

# 1. List available samples
./demo_infer.sh --list

# 2. Run a quick test
./demo_infer.sh --sample 1 --gen-sample 2

# Done! Audio saved to demo_outputs/

All Available Methods

Method 1: Bash Wrapper (Recommended)

./demo_infer.sh --sample 1 --gen-sample 2

✅ Easiest, works with demo samples, clean output

Method 2: Python CLI

python demo_cli.py --sample 1 --gen-sample 2

✅ Python-based, good for scripting

Method 3: Preset Demos

python demo_inference.py

✅ Runs 4 demos automatically

Common Commands

# List samples
./demo_infer.sh --list

# Basic test
./demo_infer.sh --sample 1 --gen-sample 2

# Custom text (must be Pinyin with tone numbers)
./demo_infer.sh --sample 1 --gen-text "ni3 hao3 shi4 jie4"

# High quality (slower)
./demo_infer.sh --sample 2 --gen-sample 3 --nfe 64 --cfg 2.5

# Fast test (lower quality)
./demo_infer.sh --sample 1 --gen-sample 2 --nfe 16

# Custom output location
./demo_infer.sh --sample 1 --gen-sample 2 --output my_test.wav

Parameters

Parameter Default Range Description
--nfe 32 16-64 Quality (higher=better, slower)
--cfg 2.0 1.0-3.0 Text faithfulness

What Are The 5 Demo Samples?

Sample Duration Description
1 2.90s Short utterance
2 3.53s Medium-short
3 3.92s Medium (median)
4 4.34s Medium-long
5 5.62s Long utterance

All samples are from the Cantonese training dataset with Pinyin transcriptions.

Using Your Own Audio

For custom audio files, use Python directly:

python3 << 'EOF'
import sys
sys.path.append('/home/husrcf/Code/AIAA/AIAA2205-assignment2-F5-TTS/')
import torch, soundfile as sf
from src.f5_tts.infer.utils_infer import load_model, load_vocoder, infer_process
from src.f5_tts.Models.DiT import DiT

# Your audio and text (text MUST be in Pinyin with tone numbers)
ref_audio = "/path/to/your/audio.wav"
ref_text = "your ref text in pinyin"
gen_text = "text to generate in pinyin"

device = "cuda" if torch.cuda.is_available() else "cpu"
model = load_model(DiT, dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4),
                   "./ckpts/cantonese_data/model_last.pt",
                   vocab_file="./data/cantonese_data_pinyin/vocab.txt", device=device)
vocoder = load_vocoder("vocos")

audio, sr, _ = infer_process(ref_audio, ref_text, gen_text, model, vocoder,
                             mel_spec_type="vocos", nfe_step=32, cfg_strength=2.0, device=device)
sf.write("output.wav", audio, sr)
print(f"✓ Generated {len(audio)/sr:.2f}s audio → output.wav")
EOF

Troubleshooting

"CUDA out of memory"

The script will automatically use CPU if CUDA fails

"Error: Checkpoint not found"

Make sure the model is trained: ls -lh ckpts/cantonese_data/model_last.pt

Generated audio sounds weird

  • Ensure reference text exactly matches the reference audio
  • Try lowering --cfg to 1.5
  • Text must be in Pinyin format with tone numbers

Permission denied

chmod +x demo_infer.sh
./demo_infer.sh --list

More Documentation

  • NATIVE_CLI_GUIDE.md - All inference methods
  • CLI_USAGE_GUIDE.md - Detailed CLI usage
  • DEMO_README.md - Complete demo documentation
  • DEMO_SETUP_SUMMARY.md - Overview of what was created

Note About Training Data

These demos use samples from the training set. This is fine for:

  • ✅ Testing that everything works
  • ✅ Demonstrating model capabilities
  • ✅ Quick sanity checks

But not ideal for:

  • ❌ Evaluating generalization
  • ❌ Claiming model performance on unseen data

For proper evaluation, use a held-out test set.


Quick Start: ./demo_infer.sh --list then ./demo_infer.sh --sample 1 --gen-sample 2