-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Explicit type-casting the vocab
into a dictionary object
#20
base: main
Are you sure you want to change the base?
Conversation
- type-casting the `vocab` into dict
Hey @Udayk02! 👋 Thanks for opening a PR 🚀 I tried this out, and while it's not giving the error anymore, the encoding for the A simple test to replicate this would be, from autotiktokenizer import AutoTikTokenizer
tik = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tik.encode("hi")
# Output: [ 249021 ]
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tok.encode("hi", add_special_tokens=False).ids
# Output: [ 1274 ] Still trying to understand where the cohere tokenizer is failing at the moment, please let me know if you're able to find the cause for this. Thanks! 😊 |
Hey @bhavnicksm , will check this and inform you if I find out anything. Thank you! |
I think the underlying issue is not related to The same issue is present for every Cohere tokenizer. I think there is something missing here. from autotiktokenizer import AutoTikTokenizer
tokenizer = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi")
# Output: [4048]
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi", add_special_tokens=False).ids
# Output: [7632] If we look at the actual which is not same. There are added special characters in the process. Checking on this further. Update: The issue persists not only with the Cohere models, tokenization is inconsistent among other models too. Another observation is that, while we are performing tokenization through For example, while using
while using
I guess somehow, we are not keeping the longestFirst strategy alive here. |
I think I found the issue here. It is the way we took Firstly, let us look at WordPiece tokenizers, the static regex which you took divides the words something like this:
Which actually won't align with the BPE tokenizing. Generally, in BPE, the words are split using regex something like this:
A word is paired with the prior white-space. This allows the match with "highlights" sequence if it appears in the Instead of this, it is matching up with "##hi" or "hi" which is a sub-token which is actually valid if you take the pattern that is currently being used into consideration. To handle this, I assumed we have to reverse-play. I changed the code in if tokenizer_type == "wordpiece":
if token.startswith("##"):
token = "Ġ" + token[2:]
else:
token = token I exchanged the mapping from sub-token to the individual token and vice-versa. This is extremely wrong.
But, It fell apart with the other examples when individual tokens are appearing as sub-tokens.
Here,
I still not be able to come up with a valid solution here. Maybe, we can try coming up with a pattern for wordpiece in such a way that it will align with the BPE tokenizing. Secondly, the I took that instead and the whole flow worked perfectly.
Maybe, based on the models usage and requirements, we can take those patterns from the tokenizers directly instead of hard-coding them. |
An explicit type-casting of the
vocab
variable into a dictionary will handle this issue if anytokenizer.json
presented thevocab
as a list of lists.