Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicted phase not in range [-pi .. pi], but in range [-1 .. 1] #16

Open
kgoba opened this issue Apr 26, 2023 · 2 comments
Open

Predicted phase not in range [-pi .. pi], but in range [-1 .. 1] #16

kgoba opened this issue Apr 26, 2023 · 2 comments

Comments

@kgoba
Copy link

kgoba commented Apr 26, 2023

The phase output of the generator currently can only range from -1 to 1, which is not enough as full phase in radians is expected later in stft.inverse() (either 0..2*pi or -pi..pi).

The paper mentions somewhat cryptically that "we apply a sine activation function to represent the periodic characteristics of the phase spectrogram", but in any regard the current implementation is faulty since it can not represent the full range of possible phases.

phase = torch.sin(x[:, self.post_n_fft // 2 + 1:, :])

As a suggestion, either try scaling the output by 2*pi, or directly predicting sin(phase) and cos(phase) in the generator (the predicted values can be normalized by dividing both by sin(phase)**2 + cos(phase)**2).

@SynthAether
Copy link

This is a good point and I have looked into it on my own implementation of iSTFTNet which initially output [-pi:+pi] and then I skipped it to produce the output phase [-1 : +1]. In my case using Pi didn't make the synthesis sound better, in fact I noticed a small degradation compared to using [-1:+1] but that could have been a random luck with training. This was very puzzling. I have even went as far as making a trainable scaler so the network would learn the optimal value, which in my case stabilized at [-2.5:+2.5] but again, it was hard to hear there is an improvement. I should stress again this was tested on different but similar implementation. I don't know how this applies to Rishikesh's excellent implementation.

@yl4579
Copy link

yl4579 commented Jun 16, 2023

This is insanely weird. I have tried to train it by multiplying the phase by torch.pi, but it fails to converge, while using the range from -1 to 1 works very well and I could obtain human-level quality on LJSpeech when combined with AdaIN and Snake activation functions for StyleTTS 2. I have no explanation for why this is happening. It makes no sense to me. If anyone has come up with a reason please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants