Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper] Add large-v3 version support #27336

Merged
merged 17 commits into from
Nov 20, 2023

Conversation

flyingleafe
Copy link
Contributor

@flyingleafe flyingleafe commented Nov 7, 2023

What does this PR do?

Adds the ability to download and convert the fresh large-v3 version of Whisper (https://github.com/openai/whisper/pull/1761/files).
Closes #27331.

The usage of _download method in convert_openai_to_hf.py turned out to be broken, that was fixed.
I also plan to add the processor (feature extractor + tokenizer) automatic file export today and take care that subtle changes in language tag tokenization are supported - hence the draft status.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SantiDianaClibrain
Copy link

Thanks, this can be extremely helpful!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! THanks for the PR, the conversion script is fixed in #26834, which should be merged today cc @sanchit-gandhi.
Otherwise good to add the checkpoint path, and #27338 will add the tokenizier

@flyingleafe flyingleafe marked this pull request as ready for review November 7, 2023 12:52
@flyingleafe
Copy link
Contributor Author

@ArthurZucker Thanks! Did not see the download fixing PR, and you were quite fast with tokenizer support, congrats)
Except for the tokenizer, the feature extractor parameters should also be fetched and exported, esp. given that v3 uses a different number of melbanks. I can handle that in this PR, if you do not yet have it already implemented somewhere locally.

@ArthurZucker
Copy link
Collaborator

Yes for sure!

@flyingleafe
Copy link
Contributor Author

flyingleafe commented Nov 7, 2023

@ArthurZucker Added feature extractor export.
I reused the pre-computed mel filters from openai/whisper repository, doing so required slight changes in WhisperFeatureExtractor logic. I anticipate that auto-computed filters should be equivalent to ones saved in openai/whisper, but I am not 100% sure, so I think this is a more reliable way to obtain 100% functional equivalence.

@ArthurZucker
Copy link
Collaborator

Nice, just merged #27338 can you rebase?

@flyingleafe
Copy link
Contributor Author

@ArthurZucker merged, can instead rebase/forcepush if that's preferable.

@ArthurZucker
Copy link
Collaborator

Merging should be fine, reviewing now !

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the prompt PR and reactivity 🔥 , let's try to isolate the changes (so keep the downloading utils that were just merged) and try to match the mel creation. I think @sanchit-gandhi is having a look at that as well. Otherwise LGTM

src/transformers/models/whisper/convert_openai_to_hf.py Outdated Show resolved Hide resolved
src/transformers/models/whisper/convert_openai_to_hf.py Outdated Show resolved Hide resolved
src/transformers/models/whisper/convert_openai_to_hf.py Outdated Show resolved Hide resolved
@flyingleafe
Copy link
Contributor Author

@ArthurZucker removed everything related to downloading the pre-computed filters, they are indeed equivalent to the constructed ones (np.allclose == True).

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, num_mel_bins is properly set to dimensions["n_mels"] in the config and in the feature extractor. Should be the only places where it's needed. Lgtm, we usually add integration tests to make sure the converted checkpoints match with the original model in the test_modeling_whisper. Not sure if you have the hardware to do this?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @flyingleafe! The only difference we likely now have is in the generation config. After an offline discussion with @ArthurZucker we concluded that we previously hard-coded these arguments. What might be best is loading the appropriate generation config from the existing ones on the Hub? e.g. from openai/whisper-medium.en for English, and openai/whisper-large-v2 for multilingual v1 and 2, and openai/whisper-large-v3 (coming soon) for v3:

from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v2")
model.generation_config  = generation_config

...

@@ -186,6 +188,13 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path):

model.save_pretrained(pytorch_dump_folder_path)

# Export the feature extractor
feature_extractor = WhisperFeatureExtractor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super small request from me would be to also save the WhisperProcessor:

from transformers import WhisperProcessor

processor = WhisperProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
processor.save_pretrained(pytorch_dump_folder_path)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay LGTM with the changes to save the processor as a whole now! 🔥 🚀

@flyingleafe
Copy link
Contributor Author

flyingleafe commented Nov 8, 2023

@ArthurZucker @sanchit-gandhi Your comment for the full preprocessor export is addressed. Since the preprocessor has the tokenizer as its constituent part, I renamed the --convert_tokenizer option to --convert_preprocessor.

I also took the liberty of removing additional parameters of --whisper_version and --multilingual, since the actual number of supported languages is derivable from the vocabulary size, which is the part of OpenAI model checkpoint.

@sanchit-gandhi I implemented the fetching of generation config from HF based on the number of languages supported as you have suggested, but it is kind of a chicken-and-egg situation. The alignment heads can be hardcoded into the dictionary as OpenAI does that, and the other parameters are either derived from the tokenizer or hardcoded as well. The only setting I don't quite understand how to derive is the set of suppressed tokens, if you give me a hint on that, I can remove the dependency on downloading extra stuff from HF completely.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks it's cleaner and relies on the same logic as openai for the number of languages! 🔥

src/transformers/models/whisper/convert_openai_to_hf.py Outdated Show resolved Hide resolved
Co-authored-by: Arthur <[email protected]>
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @flyingleafe! Looks pretty much ready to go from me. Just one small comment about the generation config below

@@ -51,6 +60,20 @@
}


def _get_generation_config(is_multilingual: bool, num_languages: int = 100) -> GenerationConfig:
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! The only generation config attribute that is checkpoint specific is the alignment heads: https://gist.github.com/hollance/42e32852f24243b748ae6bc1f985b13a

The alignment can only really be worked out by looking at the cross-attention plots: https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb

=> since it's checkpoint specific, I think we should remove this attribute from the generation config. The user will then be prompted to set it themselves if they require word-level timestamps:

if not hasattr(generation_config, "alignment_heads"):

This just requires adding the following three lines of code before we return the generation config:

generation_config = GenerationConfig.from_pretrained(repo)
if hasattr(generation_config, "alignment_heads"):
    delattr(generation_config, "alignment_heads")

return generation_config

WDYT @flyingleafe @ArthurZucker?

Copy link
Collaborator

@ArthurZucker ArthurZucker Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 😉 but it's also kinda specific to word timestamps so can add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchit-gandhi The alignment heads can be copy-pasted right from the OpenAI repository, without looking at the cross-attention plots. They are provided there in quite a compact way.

Let us set up the alignment heads directly from this dictionary if the user provided the version of OpenAI model, and skip setting those (with a warning to the user) if the checkpoint is custom.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! If you have a clean way of determining whether the checkpoint is 'official' or 'custom' this works!

@flyingleafe
Copy link
Contributor Author

@sanchit-gandhi
Basically what I did is setting the alignment head appropriately if the user provided Whisper model version instead of a local checkpoint, and not setting them (with a warning) otherwise.
It could be possible to also detect if the local checkpoint is equivalent to the official one by checking the hash, but that is probably a non-issue, I cannot think of a genuine use case when the user has the OpenAI checkpoint saved locally but is unable/unwilling to simply re-download it.

@flyingleafe
Copy link
Contributor Author

@sanchit-gandhi Your point is valid - why do extra work if we are downloading generation configs from HF hub anyway.
Removed all logic related to that, simply preserving the alignment heads in the config if the original checkpoint is downloaded.

@flyingleafe
Copy link
Contributor Author

@sanchit-gandhi People complain in the downstream community projects that they expect tokenizer files in fast format (tokenizer.json) to be also present in the HF checkpoint.

I added a couple of lines here for conversion and export of fast tokenizer as well. Only you and your colleagues can add that to the official checkpoint though.

@flyingleafe
Copy link
Contributor Author

@sanchit-gandhi bump, is that good for merge?

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @flyingleafe! Just one super minor update then good to merge!

@@ -154,6 +201,9 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path):
tie_embeds = True
ffn_dim = state_dict["decoder.layers.0.fc1.weight"].shape[0]

# a hacky way to properly set up the bos/eos/pad token ids in the model
endoftext_id = 50257 if dimensions["n_vocab"] > 51865 else 50256

config = WhisperConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The only missing config to update here is the decoder_start_token_id (endoftext_id + 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyingleafe
Copy link
Contributor Author

@sanchit-gandhi Your last suggestion has been done three days ago, let's merge if good to go

@ArthurZucker
Copy link
Collaborator

Thanks for bearing with both of us 😉

@ArthurZucker ArthurZucker merged commit 87e217d into huggingface:main Nov 20, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add OpenAI Whisper Large-v3 weights
5 participants