-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper] Add large-v3
version support
#27336
Conversation
Thanks, this can be extremely helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! THanks for the PR, the conversion script is fixed in #26834, which should be merged today cc @sanchit-gandhi.
Otherwise good to add the checkpoint path, and #27338 will add the tokenizier
@ArthurZucker Thanks! Did not see the download fixing PR, and you were quite fast with tokenizer support, congrats) |
Yes for sure! |
@ArthurZucker Added feature extractor export. |
Nice, just merged #27338 can you rebase? |
@ArthurZucker merged, can instead rebase/forcepush if that's preferable. |
Merging should be fine, reviewing now ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the prompt PR and reactivity 🔥 , let's try to isolate the changes (so keep the downloading utils that were just merged) and try to match the mel creation. I think @sanchit-gandhi is having a look at that as well. Otherwise LGTM
@ArthurZucker removed everything related to downloading the pre-computed filters, they are indeed equivalent to the constructed ones ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, num_mel_bins
is properly set to dimensions["n_mels"]
in the config and in the feature extractor. Should be the only places where it's needed. Lgtm, we usually add integration tests to make sure the converted checkpoints match with the original model in the test_modeling_whisper
. Not sure if you have the hardware to do this?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @flyingleafe! The only difference we likely now have is in the generation config. After an offline discussion with @ArthurZucker we concluded that we previously hard-coded these arguments. What might be best is loading the appropriate generation config from the existing ones on the Hub? e.g. from openai/whisper-medium.en
for English, and openai/whisper-large-v2
for multilingual v1 and 2, and openai/whisper-large-v3
(coming soon) for v3:
from transformers import GenerationConfig
generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")
model.generation_config = generation_config
...
@@ -186,6 +188,13 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path): | |||
|
|||
model.save_pretrained(pytorch_dump_folder_path) | |||
|
|||
# Export the feature extractor | |||
feature_extractor = WhisperFeatureExtractor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super small request from me would be to also save the WhisperProcessor
:
from transformers import WhisperProcessor
processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
processor.save_pretrained(pytorch_dump_folder_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay LGTM with the changes to save the processor as a whole now! 🔥 🚀
@ArthurZucker @sanchit-gandhi Your comment for the full preprocessor export is addressed. Since the preprocessor has the tokenizer as its constituent part, I renamed the I also took the liberty of removing additional parameters of @sanchit-gandhi I implemented the fetching of generation config from HF based on the number of languages supported as you have suggested, but it is kind of a chicken-and-egg situation. The alignment heads can be hardcoded into the dictionary as OpenAI does that, and the other parameters are either derived from the tokenizer or hardcoded as well. The only setting I don't quite understand how to derive is the set of suppressed tokens, if you give me a hint on that, I can remove the dependency on downloading extra stuff from HF completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks it's cleaner and relies on the same logic as openai for the number of languages! 🔥
Co-authored-by: Arthur <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @flyingleafe! Looks pretty much ready to go from me. Just one small comment about the generation config below
@@ -51,6 +60,20 @@ | |||
} | |||
|
|||
|
|||
def _get_generation_config(is_multilingual: bool, num_languages: int = 100) -> GenerationConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! The only generation config attribute that is checkpoint specific is the alignment heads: https://gist.github.com/hollance/42e32852f24243b748ae6bc1f985b13a
The alignment can only really be worked out by looking at the cross-attention plots: https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb
=> since it's checkpoint specific, I think we should remove this attribute from the generation config. The user will then be prompted to set it themselves if they require word-level timestamps:
if not hasattr(generation_config, "alignment_heads"): |
This just requires adding the following three lines of code before we return the generation config:
generation_config = GenerationConfig.from_pretrained(repo)
if hasattr(generation_config, "alignment_heads"):
delattr(generation_config, "alignment_heads")
return generation_config
WDYT @flyingleafe @ArthurZucker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 😉 but it's also kinda specific to word timestamps so can add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchit-gandhi The alignment heads can be copy-pasted right from the OpenAI repository, without looking at the cross-attention plots. They are provided there in quite a compact way.
Let us set up the alignment heads directly from this dictionary if the user provided the version of OpenAI model, and skip setting those (with a warning to the user) if the checkpoint is custom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! If you have a clean way of determining whether the checkpoint is 'official' or 'custom' this works!
@sanchit-gandhi |
7a6e18f
to
495bc9f
Compare
@sanchit-gandhi Your point is valid - why do extra work if we are downloading generation configs from HF hub anyway. |
@sanchit-gandhi People complain in the downstream community projects that they expect tokenizer files in fast format ( I added a couple of lines here for conversion and export of fast tokenizer as well. Only you and your colleagues can add that to the official checkpoint though. |
@sanchit-gandhi bump, is that good for merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @flyingleafe! Just one super minor update then good to merge!
@@ -154,6 +201,9 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path): | |||
tie_embeds = True | |||
ffn_dim = state_dict["decoder.layers.0.fc1.weight"].shape[0] | |||
|
|||
# a hacky way to properly set up the bos/eos/pad token ids in the model | |||
endoftext_id = 50257 if dimensions["n_vocab"] > 51865 else 50256 | |||
|
|||
config = WhisperConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! The only missing config to update here is the decoder_start_token_id
(endoftext_id + 1
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchit-gandhi done
@sanchit-gandhi Your last suggestion has been done three days ago, let's merge if good to go |
Thanks for bearing with both of us 😉 |
What does this PR do?
Adds the ability to download and convert the fresh
large-v3
version of Whisper (https://github.com/openai/whisper/pull/1761/files).Closes #27331.
The usage of
_download
method inconvert_openai_to_hf.py
turned out to be broken, that was fixed.I also plan to add the processor (feature extractor + tokenizer) automatic file export today and take care that subtle changes in language tag tokenization are supported - hence the draft status.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.