[Bug] xtts voice generation is better in 0.24.3 than in 0.25 and above #228

C00reNUT · 2024-12-21T13:36:48Z

Describe the bug

Hello, thank you for maintaining this library, this is probably related to #198 - when i am using 0.24.3 version for inference of xtts model I get much better results than in 0.25.1 - there must be still some bug in the inference. I didn't try the exactly same generation with the same seed, but the quality difference is obvious.

I would provide some samples but I am using czech finetuned model and you couldn't really hear the difference unless you are native...

To Reproduce

import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading model...")
config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True)
model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])

print("Inference...")
out = model.inference(
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
"en",
gpt_cond_latent,
speaker_embedding,
temperature=0.7, # Add custom parameters here
)
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

Expected behavior

The outputs shall be similar using same parameters, accounting for diffusion variability in outputs.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "12.4"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.1",
        "TTS": "0.24.3",
        "numpy": "1.26.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.11.9",
        "version": "#49-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov  4 02:06:24 UTC 2024"
    }
}

Additional context

No response

eginhard · 2024-12-22T17:28:26Z

I just double-checked for some examples that 0.25.1 and 0.24.3 (and other previous versions) produce exactly the same output when fixing the seed. If you could share some samples and/or test with a fixed seed that would be helpful.

C00reNUT · 2024-12-28T10:04:20Z

I tried it once again and I was using wrong config for selected model version, sorry about the confusion, happy new year!

C00reNUT added the bug Something isn't working label Dec 21, 2024

eginhard added question Further information is requested XTTS labels Dec 22, 2024

C00reNUT closed this as completed Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] xtts voice generation is better in 0.24.3 than in 0.25 and above #228

[Bug] xtts voice generation is better in 0.24.3 than in 0.25 and above #228

C00reNUT commented Dec 21, 2024 •

edited

Loading

eginhard commented Dec 22, 2024

C00reNUT commented Dec 28, 2024

[Bug] xtts voice generation is better in 0.24.3 than in 0.25 and above #228

[Bug] xtts voice generation is better in 0.24.3 than in 0.25 and above #228

Comments

C00reNUT commented Dec 21, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

eginhard commented Dec 22, 2024

C00reNUT commented Dec 28, 2024

C00reNUT commented Dec 21, 2024 •

edited

Loading