Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tts.tts_with_vc_to_file cannot use cpu #3797

Closed
pieris98 opened this issue Jun 21, 2024 · 7 comments · Fixed by idiap/coqui-ai-TTS#252
Closed

[Bug] tts.tts_with_vc_to_file cannot use cpu #3797

pieris98 opened this issue Jun 21, 2024 · 7 comments · Fixed by idiap/coqui-ai-TTS#252
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.

Comments

@pieris98
Copy link

Describe the bug

Similar to #3787, but also when running xtts_v2 model with voice cloning (vocoder model), using device='cpu' results to the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` 

To Reproduce

import torch
from TTS.api import TTS

device = "cpu"
print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

Expected behavior

The inference should run without using CUDA or reporting any CUDA/CUDNN/GPU-related errors.

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 12
      7 # !tts --text "hello world" \
      8 # --model_name "tts_models/en/ljspeech/glow-tts" \
      9 # --out_path output.wav
     11 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
---> 12 tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/home/cherry/dev/coqui/steve_taylor.wav", language="en",file_path="/home/cherry/dev/coqui/outputs/xttsv2_en_output.wav")

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:455, in TTS.tts_with_vc_to_file(self, text, language, speaker_wav, file_path, speaker, split_sentences)
    423 def tts_with_vc_to_file(
    424     self,
    425     text: str,
   (...)
    430     split_sentences: bool = True,
    431 ):
    432     """Convert text to speech with voice conversion and save to file.
    433 
    434     Check `tts_with_vc` for more details.
   (...)
    453             applicable to the 🐸TTS models. Defaults to True.
    454     """
--> 455     wav = self.tts_with_vc(
    456         text=text, language=language, speaker_wav=speaker_wav, speaker=speaker, split_sentences=split_sentences
    457     )
    458     save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:419, in TTS.tts_with_vc(self, text, language, speaker_wav, speaker, split_sentences)
    415     self.tts_to_file(
    416         text=text, speaker=speaker, language=language, file_path=fp.name, split_sentences=split_sentences
    417     )
    418 if self.voice_converter is None:
--> 419     self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")
    420 wav = self.voice_converter.voice_conversion(source_wav=fp.name, target_wav=speaker_wav)
    421 return wav

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:157, in TTS.load_vc_model_by_name(self, model_name, gpu)
    155 self.model_name = model_name
    156 model_path, config_path, _, _, _ = self.download_model_by_name(model_name)
--> 157 self.voice_converter = Synthesizer(vc_checkpoint=model_path, vc_config=config_path, use_cuda=gpu)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:101, in Synthesizer.__init__(self, tts_checkpoint, tts_config_path, tts_speakers_file, tts_languages_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, vc_checkpoint, vc_config, model_dir, voice_dir, use_cuda)
     98     self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
    100 if vc_checkpoint:
--> 101     self._load_vc(vc_checkpoint, vc_config, use_cuda)
    102     self.output_sample_rate = self.vc_config.audio["output_sample_rate"]
    104 if model_dir:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:139, in Synthesizer._load_vc(self, vc_checkpoint, vc_config_path, use_cuda)
    137 # pylint: disable=global-statement
    138 self.vc_config = load_config(vc_config_path)
--> 139 self.vc_model = setup_vc_model(config=self.vc_config)
    140 self.vc_model.load_checkpoint(self.vc_config, vc_checkpoint)
    141 if use_cuda:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/__init__.py:16, in setup_model(config, samples)
     14 if "model" in config and config["model"].lower() == "freevc":
     15     MyModel = importlib.import_module("TTS.vc.models.freevc").FreeVC
---> 16     model = MyModel.init_from_config(config, samples)
     17 return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:552, in FreeVC.init_from_config(config, samples, verbose)
    550 @staticmethod
    551 def init_from_config(config: FreeVCConfig, samples: Union[List[List], List[Dict]] = None, verbose=True):
--> 552     model = FreeVC(config)
    553     return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:370, in FreeVC.__init__(self, config, speaker_manager)
    368     self.enc_spk = SpeakerEncoder(model_hidden_size=self.gin_channels, model_embedding_size=self.gin_channels)
    369 else:
--> 370     self.load_pretrained_speaker_encoder()
    372 self.wavlm = get_wavlm()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:381, in FreeVC.load_pretrained_speaker_encoder(self)
    379 """Load pretrained speaker encoder model as mentioned in the paper."""
    380 print(" > Loading pretrained speaker encoder model ...")
--> 381 self.enc_spk_ex = SpeakerEncoderEx(
    382     "https://github.com/coqui-ai/TTS/releases/download/v0.13.0_models/speaker_encoder.pt"
    383 )

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/modules/freevc/speaker_encoder/speaker_encoder.py:45, in SpeakerEncoder.__init__(self, weights_fpath, device, verbose)
     42 checkpoint = load_fsspec(weights_fpath, map_location="cpu")
     44 self.load_state_dict(checkpoint["model_state"], strict=False)
---> 45 self.to(device)
     47 if verbose:
     48     print("Loaded the voice encoder model on %s in %.2f seconds." % (device.type, timer() - start))

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:1173, in Module.to(self, *args, **kwargs)
   1170         else:
   1171             raise
-> 1173 return self._apply(convert)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:779, in Module._apply(self, fn, recurse)
    777 if recurse:
    778     for module in self.children():
--> 779         module._apply(fn)
    781 def compute_should_use_set_data(tensor, tensor_applied):
    782     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    783         # If the new tensor has compatible tensor type as the existing tensor,
    784         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    789         # global flag to let the user control whether they want the future
    790         # behavior of overwriting the existing tensor or not.

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:222, in RNNBase._apply(self, fn, recurse)
    217 ret = super()._apply(fn, recurse)
    219 # Resets _flat_weights
    220 # Note: be v. careful before removing this, as 3rd party device types
    221 # likely rely on this behavior to properly .to() modules like LSTM.
--> 222 self._init_flat_weights()
    224 return ret

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:158, in RNNBase._init_flat_weights(self)
    154 self._flat_weights = [getattr(self, wn) if hasattr(self, wn) else None
    155                       for wn in self._flat_weights_names]
    156 self._flat_weight_refs = [weakref.ref(w) if w is not None else None
    157                           for w in self._flat_weights]
--> 158 self.flatten_parameters()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:209, in RNNBase.flatten_parameters(self)
    207 if self.proj_size > 0:
    208     num_weights += 1
--> 209 torch._cudnn_rnn_flatten_weight(
    210     self._flat_weights, num_weights,
    211     self.input_size, rnn.get_cudnn_mode(self.mode),
    212     self.hidden_size, self.proj_size, self.num_layers,
    213     self.batch_first, bool(self.bidirectional))

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3060 Laptop GPU"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.0",
        "version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)"
    }
}

Additional context

Note: Even though I do have CUDA and an NVIDIA GPU on my laptop, I want to use CPU because the VRAM of my GPU is not enough for the model I wanted to use.

@pieris98 pieris98 added the bug Something isn't working label Jun 21, 2024
@eginhard
Copy link
Contributor

The XTTS model natively supports voice cloning, so just use the following (and pick just one of speaker and speaker_wav, depending on which of them you need):

from TTS.api import TTS

device = "cpu"
print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

This should run correctly on the CPU. The with_vc would pass the already cloned output through an additional voice conversion model (FreeVC), but that's not necessary here and probably leads to worse results.

@pieris98
Copy link
Author

Hey Enno, thanks a lot for the pointer, I didn't realise that some models have voice cloning built in rather than with tts.tts_with_vc_to_file().

I was then trying to run the model in tts-server and noticed this issue #3369 so I just wanted to point it out as it seems more important to solve in the codebase.

Copy link

stale bot commented Aug 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Aug 2, 2024
@pieris98
Copy link
Author

pieris98 commented Aug 2, 2024

Dear coqui devs/community,
Any update on this? I'm still encountering this issue.
Please unmark it as stale

@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Aug 2, 2024
Copy link

stale bot commented Sep 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Sep 18, 2024
@stale stale bot closed this as completed Jan 3, 2025
@eginhard
Copy link
Contributor

@pieris98 Our fork now supports all Coqui TTS models in the server. Specifying a speaker_wav file is not possible yet, this is tracked in idiap#254.

@pieris98
Copy link
Author

Thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants