Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: code for whisper-large-v3 #548

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,24 @@ venv/
# Ignore IDE, Editor Files
.idea/
.vscode/


# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
3 changes: 2 additions & 1 deletion faster_whisper/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def decode_with_timestamps(self, tokens: List[int]) -> str:
def split_to_word_tokens(
self, tokens: List[int]
) -> Tuple[List[str], List[List[int]]]:
if self.language_code in {"zh", "ja", "th", "lo", "my"}:
if self.language_code in {"zh", "ja", "th", "lo", "my", "yue"}:
# These languages don't typically use spaces, so it is difficult to split words
# without morpheme analysis. Here, we instead split words at any
# position where the tokens are decoded as valid unicode points
Expand Down Expand Up @@ -274,4 +274,5 @@ def split_tokens_on_spaces(
"yi",
"yo",
"zh",
"yue",
)
28 changes: 25 additions & 3 deletions faster_whisper/transcribe.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import itertools
import json
import logging
import os
import zlib
Expand Down Expand Up @@ -92,8 +93,8 @@ def __init__(

Args:
model_size_or_path: Size of the model to use (tiny, tiny.en, base, base.en,
small, small.en, medium, medium.en, large-v1, large-v2, or large), a path to a converted
model directory, or a CTranslate2-converted Whisper model ID from the Hugging Face Hub.
small, small.en, medium, medium.en, large-v1, large-v2, large-v3, or large), a path to a
converted model directory, or a CTranslate2-converted Whisper model ID from the HF Hub.
When a size or a model ID is configured, the converted model is downloaded
from the Hugging Face Hub.
device: Device to use for computation ("cpu", "cuda", "auto").
Expand All @@ -113,6 +114,9 @@ def __init__(
are saved in the standard Hugging Face cache directory.
local_files_only: If True, avoid downloading the file and return the path to the
local cached file if it exists.
feature_size: Number of mel filters to use for feature extraction. If not set,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anymore

the number of mel filters is inferred from the model version. The first release
used 80 bins, but the large-v3 model uses 128 bins.
"""
self.logger = get_logger()

Expand Down Expand Up @@ -142,7 +146,25 @@ def __init__(
"openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
)

self.feature_extractor = FeatureExtractor()
feature_extractor_file = os.path.join(model_path, "preprocessor_config.json")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move that into a specific method ?
And use the n_mels from ct2 as a fallback ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to get to this tomorrow. extremely limited bandwidth with American holidays currently

if os.path.isfile(feature_extractor_file):
with open(feature_extractor_file, "r") as f:
config = json.load(f)
feat_kwargs = {
k: config[k]
for k in [
"n_fft",
"hop_length",
"feature_size",
"sampling_rate",
"chunk_length",
]
Comment on lines +155 to +161
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor remark: in this new method could you make these parameters less "hard-code" way because they come from this class: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/feature_extractor.py#L8-L12

if k in config
}
else:
feat_kwargs = {}

self.feature_extractor = FeatureExtractor(**feat_kwargs)
self.num_samples_per_token = self.feature_extractor.hop_length * 2
self.frames_per_second = (
self.feature_extractor.sampling_rate // self.feature_extractor.hop_length
Expand Down
4 changes: 3 additions & 1 deletion faster_whisper/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
"large-v1": "guillaumekln/faster-whisper-large-v1",
"large-v2": "guillaumekln/faster-whisper-large-v2",
"large": "guillaumekln/faster-whisper-large-v2",
"large-v3": "bababababooey/faster-whisper-large-v3",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is owning that ?
Can systran create its hub to keep ownership of the models and easily update old models and upload new models ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is owning that ?

Some user posted that link on issues.

Best would be to keep model under official acc like "systran", maybe move all models there.

Copy link
Collaborator

@nguyendc-systran nguyendc-systran Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we are waiting for the release on CTranslate2, then push the new converted model tomorrow (with the fix by OpenNMT/CTranslate2#1546) to Systran organization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, several models are now available on Systran organization: https://huggingface.co/Systran (including the large-v3 converted by the last CTranslate2 3.22.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Purfview I think you accidentally pasted the same URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AvivSham Fixed it.

Copy link

@blackpolarz blackpolarz Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I missed out any but in tokenizer.json, there is a difference in token 50363, nospeech vs nocaptions.
The same difference is in vocabulary.json.
ctranslate-large v2 uses nocaptions which is the same as what flyingleaf is using.
However, in hf-large-v3, it uses nospeech which is the same as what systran is using.

Copy link
Collaborator

@nguyendc-systran nguyendc-systran Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyendc-systran thank you! So essentially just waiting for @stillmatic to do the last fixes you mentioned before merge?

IMHO, yes.
Another point may be interesting / relevant is about this benchmark: #548 (comment).
Not sure if @funboarder13920 had a chance for that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the inference between openai/hf/faster_whisper are not exactly the same but very similar. I guess I was witnessing the differences between v2 and v3.

}


Expand Down Expand Up @@ -50,7 +51,7 @@ def download_model(
Args:
size_or_id: Size of the model to download from https://huggingface.co/guillaumekln
(tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
large), or a CTranslate2-converted model ID from the Hugging Face Hub
large, large-v3), or a CTranslate2-converted model ID from the Hugging Face Hub
(e.g. guillaumekln/faster-whisper-large-v2).
output_dir: Directory where the model should be saved. If not set, the model is saved in
the cache directory.
Expand All @@ -76,6 +77,7 @@ def download_model(

allow_patterns = [
"config.json",
"preprocessor_config.json",
"model.bin",
"tokenizer.json",
"vocabulary.*",
Expand Down