Unable to Reproduce Checkpoint Results (Especially Distillation)

Hello, I’ve encountered some issues while trying to reproduce the results from your repository. Specifically, I used the original code provided ([train_example.sh](https://github.com/ZhangXInFD/SpeechTokenizer/blob/main/scripts/train_example.sh)) for training and evaluated the models using the following metrics:

- WER
- PESQ
- STOI
- SDR
- MelLoss
- UTMOS
- Co-occurrence map between the first-level codebook and phonemes

The final validation losses are reported as:  
- Dev Mel Error: 0.357  
- Dev Distill Loss: 0.604

However, the results of my retrained model differ significantly from those of the released checkpoint. Below is a summary of the performance metrics:

| Codec                     | WER(%) ↓       | PESQ ↑ | STOI ↑         | SDR ↑  | MelLoss ↓ | UTMOS ↑       |
|---------------------------|:--------------:|:-------:|:--------------:|:-------:|:----------:|:--------------:|
| Ground Truth              | 2.87           | -       | -              | -      | -         | 4.04           |
| SpeechTokenizer* n_q = 1  | 10.63          | 1.16    | 0.572          | -14.58 | 1.813     | 1.26           |
| SpeechTokenizer n_q = 1   | 44.65          | 1.14    | 0.654          | -3.16  | 1.484     | 1.30           |
| SpeechTokenizer* n_q = 8  | 4.34           | 1.98    | 0.843          | 1.52   | 0.842     | 3.87           |
| SpeechTokenizer n_q = 8   | 4.01           | 2.32    | 0.880          | 6.56   | 0.791     | 3.62           |

_(Note: “SpeechTokenizer\*” indicates the use of official model weights.)_

Additionally, the first-level codebook of my retrained model does not align well with phonemes, which suggests that the distillation process may not have been set up correctly. Here are the codebook co-occurrence maps for comparison:

Retrained Model:
![Image](https://github.com/user-attachments/assets/6abe15e0-6975-4408-a884-1afadcb2695e)
Official Checkpoint:
![Image](https://github.com/user-attachments/assets/bdc1b975-0cea-4fe2-bd71-92e8ab973ac1)

**I’d greatly appreciate any guidance on how to properly reproduce the same performance as the official checkpoint—particularly regarding how the distillation process should be set up. If there are any additional steps or details not covered in the repository that might impact results, please let me know. Thank you in advance for your help!**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce Checkpoint Results (Especially Distillation) #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Codec	WER(%) ↓	PESQ ↑	STOI ↑	SDR ↑	MelLoss ↓	UTMOS ↑
Ground Truth	2.87	-	-	-	-	4.04
SpeechTokenizer* n_q = 1	10.63	1.16	0.572	-14.58	1.813	1.26
SpeechTokenizer n_q = 1	44.65	1.14	0.654	-3.16	1.484	1.30
SpeechTokenizer* n_q = 8	4.34	1.98	0.843	1.52	0.842	3.87
SpeechTokenizer n_q = 8	4.01	2.32	0.880	6.56	0.791	3.62

Unable to Reproduce Checkpoint Results (Especially Distillation) #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions