-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Hello, I’ve encountered some issues while trying to reproduce the results from your repository. Specifically, I used the original code provided (train_example.sh) for training and evaluated the models using the following metrics:
- WER
- PESQ
- STOI
- SDR
- MelLoss
- UTMOS
- Co-occurrence map between the first-level codebook and phonemes
The final validation losses are reported as:
- Dev Mel Error: 0.357
- Dev Distill Loss: 0.604
However, the results of my retrained model differ significantly from those of the released checkpoint. Below is a summary of the performance metrics:
| Codec | WER(%) ↓ | PESQ ↑ | STOI ↑ | SDR ↑ | MelLoss ↓ | UTMOS ↑ |
|---|---|---|---|---|---|---|
| Ground Truth | 2.87 | - | - | - | - | 4.04 |
| SpeechTokenizer* n_q = 1 | 10.63 | 1.16 | 0.572 | -14.58 | 1.813 | 1.26 |
| SpeechTokenizer n_q = 1 | 44.65 | 1.14 | 0.654 | -3.16 | 1.484 | 1.30 |
| SpeechTokenizer* n_q = 8 | 4.34 | 1.98 | 0.843 | 1.52 | 0.842 | 3.87 |
| SpeechTokenizer n_q = 8 | 4.01 | 2.32 | 0.880 | 6.56 | 0.791 | 3.62 |
(Note: “SpeechTokenizer*” indicates the use of official model weights.)
Additionally, the first-level codebook of my retrained model does not align well with phonemes, which suggests that the distillation process may not have been set up correctly. Here are the codebook co-occurrence maps for comparison:
Retrained Model:

Official Checkpoint:

I’d greatly appreciate any guidance on how to properly reproduce the same performance as the official checkpoint—particularly regarding how the distillation process should be set up. If there are any additional steps or details not covered in the repository that might impact results, please let me know. Thank you in advance for your help!