Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better HF-to-CT2 conversion for Whisper model #1546

Merged
merged 3 commits into from
Nov 20, 2023

Conversation

flyingleafe
Copy link
Contributor

On attempt to convert Whisper-v3 model from HF to ct2 format I encountered two issues:

  • Due to the difference in the additional_tokens_ids lists resulting from the tokenizer checkpoints in large-v3 and previous versions, the first language token <|en|> was not included in the lang_ids list, which resulted in the inability of the model to transcribe English text - every such transcription became a translation to some other language;
  • The converter ignored the information contained in generation_config.json of the HF checkpoint, which resulted in auto-generated alignment heads and non-present set of suppressed tokens.

Those issues are fixed in this PR. The logic of selecting language IDs got more complicated, but less prone to bugs stemming from any future updates of HF checkpoints.

@vince62s
Copy link
Member

please rebase for tests to go through

config.suppress_ids_begin = model.config.begin_suppress_tokens
config.alignment_heads = _WHISPER_ALIGNMENT_HEADS.get(model.name_or_path)

non_lang_special_tokens = [
Copy link
Contributor

@funboarder13920 funboarder13920 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a method for this logic ?
If the generation_config.json is used, then what do you thing of also using it to get the lang_ids (or filter the lang tokens if the ids are not aligned), and fallback with your method if the generation_config is not available ?

@flyingleafe flyingleafe force-pushed the fix-whisper-converter branch from 2de8b2b to b1d21db Compare November 17, 2023 07:54
@flyingleafe
Copy link
Contributor Author

@vince62s rebased
@funboarder13920 your comment made sense, did that

@vince62s
Copy link
Member

can you fix the tests ?

@flyingleafe
Copy link
Contributor Author

@vince62s just seen the error, should be fixed now

@flyingleafe
Copy link
Contributor Author

@vince62s all fixed, good to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants