Better HF-to-CT2 conversion for Whisper model #1546

flyingleafe · 2023-11-15T11:53:51Z

On attempt to convert Whisper-v3 model from HF to ct2 format I encountered two issues:

Due to the difference in the additional_tokens_ids lists resulting from the tokenizer checkpoints in large-v3 and previous versions, the first language token <|en|> was not included in the lang_ids list, which resulted in the inability of the model to transcribe English text - every such transcription became a translation to some other language;
The converter ignored the information contained in generation_config.json of the HF checkpoint, which resulted in auto-generated alignment heads and non-present set of suppressed tokens.

Those issues are fixed in this PR. The logic of selecting language IDs got more complicated, but less prone to bugs stemming from any future updates of HF checkpoints.

vince62s · 2023-11-16T11:26:44Z

please rebase for tests to go through

funboarder13920 · 2023-11-16T11:51:00Z

python/ctranslate2/converters/transformers.py

+            config.suppress_ids_begin = model.config.begin_suppress_tokens
+            config.alignment_heads = _WHISPER_ALIGNMENT_HEADS.get(model.name_or_path)
+
+        non_lang_special_tokens = [


Can you make a method for this logic ?
If the generation_config.json is used, then what do you thing of also using it to get the lang_ids (or filter the lang tokens if the ids are not aligned), and fallback with your method if the generation_config is not available ?

flyingleafe · 2023-11-17T07:56:34Z

@vince62s rebased
@funboarder13920 your comment made sense, did that

vince62s · 2023-11-17T10:20:36Z

can you fix the tests ?

flyingleafe · 2023-11-17T10:27:18Z

@vince62s just seen the error, should be fixed now

flyingleafe · 2023-11-20T09:36:12Z

@vince62s all fixed, good to merge?

flyingleafe mentioned this pull request Nov 15, 2023

feat: code for whisper-large-v3 SYSTRAN/faster-whisper#548

Closed

funboarder13920 reviewed Nov 16, 2023

View reviewed changes

flyingleafe added 2 commits November 17, 2023 07:41

Better HF-to-CT2 conversion for Whisper model

6b02733

Use lang_ids from gen config if available

b1d21db

flyingleafe force-pushed the fix-whisper-converter branch from 2de8b2b to b1d21db Compare November 17, 2023 07:54

flyingleafe requested a review from funboarder13920 November 17, 2023 09:10

Fix the case when there is no lang_to_ids in gen config

4a3730e

vince62s merged commit e52b295 into OpenNMT:master Nov 20, 2023

takenori-y mentioned this pull request Nov 27, 2023

'GenerationConfig' object has no attribute 'alignment_heads' #1564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better HF-to-CT2 conversion for Whisper model #1546

Better HF-to-CT2 conversion for Whisper model #1546

flyingleafe commented Nov 15, 2023

vince62s commented Nov 16, 2023

funboarder13920 Nov 16, 2023 •

edited

Loading

flyingleafe commented Nov 17, 2023

vince62s commented Nov 17, 2023

flyingleafe commented Nov 17, 2023

flyingleafe commented Nov 20, 2023

Better HF-to-CT2 conversion for Whisper model #1546

Better HF-to-CT2 conversion for Whisper model #1546

Conversation

flyingleafe commented Nov 15, 2023

vince62s commented Nov 16, 2023

funboarder13920 Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

flyingleafe commented Nov 17, 2023

vince62s commented Nov 17, 2023

flyingleafe commented Nov 17, 2023

flyingleafe commented Nov 20, 2023

funboarder13920 Nov 16, 2023 •

edited

Loading