Replies: 1 comment 3 replies
-
For alignment: We first calculate logprobab of z_p of every phoneme for every timestep in the spectrogram using the projections m,_p and logs_p. Then we use MAS to find a path which maximizes the total logprobab of chosen z_p at every timestep. So, now for each timestep we have a single m_p and logs_p. This m_p, logs_p of timestamp length vector is where our z_p is supposed to be sampled from. Thus, the KL loss. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have drawn a simple diagram explaining VITS architecture.
Following VITS, NaturalSpeech-1 uses a very similar architecture (https://arxiv.org/pdf/2205.04421.pdf).
The DurationPredictor of VITS is also a very interesting architecture in its own. If anyone has any questions or wants to go indepth, I can help them.
Beta Was this translation helpful? Give feedback.
All reactions