Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objective evaluation #3

Open
yxlu-0102 opened this issue Oct 20, 2023 · 3 comments
Open

Objective evaluation #3

yxlu-0102 opened this issue Oct 20, 2023 · 3 comments

Comments

@yxlu-0102
Copy link

I synthesise waveforms with your official ckpt on the test set of the VCTK-Corpus-0.92, which contains the audio clips of the last 8 speakers.

I calculated the LSD and SNR scores between the generated and reference test set, but the calculated metrics are not as good as those in your paper.

Additionally, the lsd calculation in util.util.compute_metrics seems strange, the n_fft should be 2048 while your default setting is 1024.

@neoncloud
Copy link
Owner

Thank you for your interest in our work. Please could you elaborate on your reproduction process, including...

  • How you calculated the LSD and other metrics? Did you use the method we provided or another library or software?

  • You mentioned using the "last 8 speakers", which doesn't seem to match the test set we used. Could you please elaborate on your test set partitioning method?

  • If possible, could you provide your reproduction results, including file names and scores?

@yxlu-0102
Copy link
Author

  1. I used the metric_calculator you provided but I changed the n_fft to 2048 for a fair comparison with other systems.

  2. The systems you compared with in your paper (e.g., NU-wave2 and UDM+) used the VCTK-0.92 as the dataset, and their test set contains the last 8 speakers, so I used the same test set for a fair comparison.

  3. For example, for the 24kHz to 48kHz experiment, the metrics I calculated are LSD of 0.72 and SNR of 25.86. Your metrics in the paper are LSD of 0.61 and SNR of 26.26.

@yd8175618
Copy link

Hello, does this mode support real-time voice super-resolution. Split the long speech into multiple 16ms for processing and merge them at the output end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants