You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for maintaining this library, this is probably related to #198 - when i am using 0.24.3 version for inference of xtts model I get much better results than in 0.25.1 - there must be still some bug in the inference. I didn't try the exactly same generation with the same seed, but the quality difference is obvious.
I would provide some samples but I am using czech finetuned model and you couldn't really hear the difference unless you are native...
To Reproduce
import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
print("Inference...")
out = model.inference(
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
"en",
gpt_cond_latent,
speaker_embedding,
temperature=0.7, # Add custom parameters here
)
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
Expected behavior
The outputs shall be similar using same parameters, accounting for diffusion variability in outputs.
I just double-checked for some examples that 0.25.1 and 0.24.3 (and other previous versions) produce exactly the same output when fixing the seed. If you could share some samples and/or test with a fixed seed that would be helpful.
Describe the bug
Hello, thank you for maintaining this library, this is probably related to #198 - when i am using 0.24.3 version for inference of xtts model I get much better results than in 0.25.1 - there must be still some bug in the inference. I didn't try the exactly same generation with the same seed, but the quality difference is obvious.
I would provide some samples but I am using czech finetuned model and you couldn't really hear the difference unless you are native...
To Reproduce
import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
print("Loading model...")
config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True)
model.cuda()
print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])
print("Inference...")
out = model.inference(
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
"en",
gpt_cond_latent,
speaker_embedding,
temperature=0.7, # Add custom parameters here
)
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)
Expected behavior
The outputs shall be similar using same parameters, accounting for diffusion variability in outputs.
Logs
No response
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: