-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better HF-to-CT2 conversion for Whisper model #1546
Conversation
please rebase for tests to go through |
config.suppress_ids_begin = model.config.begin_suppress_tokens | ||
config.alignment_heads = _WHISPER_ALIGNMENT_HEADS.get(model.name_or_path) | ||
|
||
non_lang_special_tokens = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a method for this logic ?
If the generation_config.json is used, then what do you thing of also using it to get the lang_ids (or filter the lang tokens if the ids are not aligned), and fallback with your method if the generation_config is not available ?
2de8b2b
to
b1d21db
Compare
@vince62s rebased |
can you fix the tests ? |
@vince62s just seen the error, should be fixed now |
@vince62s all fixed, good to merge? |
On attempt to convert Whisper-v3 model from HF to ct2 format I encountered two issues:
additional_tokens_ids
lists resulting from the tokenizer checkpoints inlarge-v3
and previous versions, the first language token<|en|>
was not included in thelang_ids
list, which resulted in the inability of the model to transcribe English text - every such transcription became a translation to some other language;generation_config.json
of the HF checkpoint, which resulted in auto-generated alignment heads and non-present set of suppressed tokens.Those issues are fixed in this PR. The logic of selecting language IDs got more complicated, but less prone to bugs stemming from any future updates of HF checkpoints.