layout | background-class | body-class | title | summary | category | image | author | tags | github-link | github-id | featured_image_1 | featured_image_2 | accelerator | order | demo-model-link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hub_detail |
hub-background |
hub |
Tacotron 2 |
The Tacotron 2 model for generating mel spectrograms from text |
researchers |
nvidia_logo.png |
NVIDIA |
|
NVIDIA/DeepLearningExamples |
tacotron2_diagram.png |
no-image |
cuda |
10 |
Tacotron 2 λ° WaveGlow λͺ¨λΈμ μΆκ° μ΄μ¨ μ 보 μμ΄ μλ³Έ ν μ€νΈμμ μμ°μ€λ¬μ΄ μμ±μ ν©μ±ν μ μλ ν μ€νΈ μμ± λ³ν μμ€ν μ λ§λλλ€. Tacotron 2 λͺ¨λΈμ μΈμ½λ-λμ½λ μν€ν μ²λ₯Ό μ¬μ©νμ¬ μ λ ₯ ν μ€νΈμμ λ© μ€ννΈλ‘κ·Έλ¨(mel spectrogram)μ μμ±ν©λλ€. WaveGlow (torch.hubλ₯Ό ν΅ν΄μλ μ¬μ© κ°λ₯)λ λ© μ€ννΈλ‘κ·Έλ¨μ μ¬μ©νμ¬ μμ±μ μμ±νλ νλ¦ κΈ°λ°(flow-based) λͺ¨λΈμ λλ€.
μ¬μ νλ ¨λ Tacotron 2 λͺ¨λΈμ λ Όλ¬Έκ³Ό λ€λ₯΄κ² ꡬνλμμ΅λλ€. μ¬κΈ°μ μ 곡νλ λͺ¨λΈμμλ LSTM λ μ΄μ΄λ₯Ό μ κ·ννκΈ° μν΄ Zoneout λμ Dropoutμ μ¬μ©ν©λλ€.
μλ μμ μμλ:
- μ¬μ νλ ¨λ Tacotron2 λ° Waveglow λͺ¨λΈμ torch.hubμμ κ°μ Έμ΅λλ€.
- Tacotron2λ ("Hello world, I miss you so much")μ κ°μ μ λ ₯ ν μ€νΈμ ν μ ννμ΄ μ£Όμ΄μ§λ©΄ κ·Έλ¦Όκ³Ό κ°μ λ© μ€ννΈλ‘κ·Έλ¨μ μμ±ν©λλ€.
- Waveglowλ λ© μ€ννΈλ‘κ·Έλ¨μμ μ¬μ΄λλ₯Ό μμ±ν©λλ€.
- μΆλ ₯ μ¬μ΄λλ 'audio.wav' νμΌμ μ μ₯λ©λλ€.
μ΄ μμ λ₯Ό μ€ννλ €λ©΄ λͺ κ°μ§ μΆκ° νμ΄μ¬ ν¨ν€μ§κ° μ€μΉλμ΄ μμ΄μΌ ν©λλ€. μ΄λ ν μ€νΈ λ° μ€λμ€λ₯Ό μ μ²λ¦¬νλ κ²μ λ¬Όλ‘ λμ€νλ μ΄ λ° μ μΆλ ₯ μ μ²λ¦¬μλ νμν©λλ€.
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1
LJ Speech dataset λ°μ΄ν°μ μμ μ¬μ νλ ¨λ Tacotron2 λͺ¨λΈμ λΆλ¬μ€κ³ μΆλ‘ μ μ€λΉν©λλ€.
import torch
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()
μ¬μ νλ ¨λ WaveGlow λͺ¨λΈ λΆλ¬μ€κΈ°
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()
λͺ¨λΈμ΄ λ€μκ³Ό κ°μ΄ λ§νκ² ν©μλ€.
text = "Hello world, I missed you so much."
μ νΈλ¦¬ν° λ©μλλ₯Ό μ¬μ©νμ¬ μ λ ₯ νμμ μ§μ ν©λλ€.
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])
μ°κ²°λ λͺ¨λΈμ μ€νν©λλ€.
with torch.no_grad():
mel, _, _ = tacotron2.infer(sequences, lengths)
audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050
νμΌλ‘ μ μ₯νμ¬ λ€μ΄λ³Ό μ μμ΅λλ€.
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)
λλ IPythonμ΄ μλ λ ΈνΈλΆμμ λ°λ‘ λ€μ΄λ³Ό μ μμ΅λλ€.
from IPython.display import Audio
Audio(audio_numpy, rate=rate)
λͺ¨λΈ μ λ ₯ λ° μΆλ ₯, νμ΅ λ°©λ², μΆλ‘ λ° μ±λ₯ λ±μ λν λ μμΈν μ 보λ github λ° and/or NGCμμ λ³Ό μ μμ΅λλ€.