Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Explicit type-casting the vocab into a dictionary object #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Udayk02
Copy link

@Udayk02 Udayk02 commented Jan 6, 2025

An explicit type-casting of the vocab variable into a dictionary will handle this issue if any tokenizer.json presented the vocab as a list of lists.

- type-casting the `vocab` into dict
@bhavnicksm
Copy link
Collaborator

Hey @Udayk02! 👋

Thanks for opening a PR 🚀

I tried this out, and while it's not giving the error anymore, the encoding for the Cohere/Cohere-embed-multilingual-v3.0 don't match as per what they should be~ Do you know why that could be happening?

A simple test to replicate this would be,

from autotiktokenizer import AutoTikTokenizer
tik = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tik.encode("hi")

# Output: [ 249021 ]

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tok.encode("hi", add_special_tokens=False).ids

# Output: [ 1274 ]

Still trying to understand where the cohere tokenizer is failing at the moment, please let me know if you're able to find the cause for this.

Thanks! 😊

@Udayk02
Copy link
Author

Udayk02 commented Jan 10, 2025

Hey @bhavnicksm , will check this and inform you if I find out anything. Thank you!

@Udayk02
Copy link
Author

Udayk02 commented Jan 11, 2025

I think the underlying issue is not related to vocab essentially.

The same issue is present for every Cohere tokenizer. I think there is something missing here.

from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi")

# Output: [4048]

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi", add_special_tokens=False).ids

# Output: [7632]

If we look at the actual tokens for these token_ids in the vocab dictionary,
4048: ##hi
7632: hi

which is not same. There are added special characters in the process. Checking on this further.

Update: The issue persists not only with the Cohere models, tokenization is inconsistent among other models too.

Another observation is that, while we are performing tokenization through autotiktokenizer, the tiktoken encoding is splitting the base token into multiple sub-tokens instead of taking the actual larger token that matches with the base token even though it is present in the vocabulary, which is an incorrect behaviour.

For example,
for the same input: "today's highlights"

while using sentence-transformers/all-MiniLM-L6-v2

autotiktokenizer: [b'to', b'day', b"'", b's', b'hi', b'gh', b'li', b'gh', b'ts']
hf tokenizers: ['today', "'", 's', 'highlights']

while using meta-llama/Llama-3.2-1B-Instruct

autotiktokenizer: [b't', b'oday', b"'", b's', b' highlights']
hf tokenizers: ['today', "'s", 'Ġhighlights']

I guess somehow, we are not keeping the longestFirst strategy alive here.

@Udayk02
Copy link
Author

Udayk02 commented Jan 18, 2025

I think I found the issue here. It is the way we took pre_tokenizers, which are patterns.

Firstly, let us look at WordPiece tokenizers, the static regex which you took divides the words something like this:

"today", "'", "s", " ", "highlights"

Which actually won't align with the BPE tokenizing. Generally, in BPE, the words are split using regex something like this:

"today", "'s", " highlights"

A word is paired with the prior white-space. This allows the match with "highlights" sequence if it appears in the vocabulary.

Instead of this, it is matching up with "##hi" or "hi" which is a sub-token which is actually valid if you take the pattern that is currently being used into consideration.

To handle this, I assumed we have to reverse-play. I changed the code in _get_mergeable_ranks as below:

if tokenizer_type == "wordpiece":
  if token.startswith("##"):
      token = "Ġ" + token[2:]
  else:
      token = token

I exchanged the mapping from sub-token to the individual token and vice-versa. This is extremely wrong.
it worked initially as you can see here:

[2651, 1005, 1055, 11637]
[b'today', b"'", b's', b'highlights']

But, It fell apart with the other examples when individual tokens are appearing as sub-tokens.

[2651, 1005, 1055, 11637, 1998, 2928, 2091, 1012]
[b'today', b"'", b's', b'highlights', b'and', b'mark', b'down', b'.']

Here, down is given with individual down token_id instead of ##down token_id which is obvious.

"down": 2091
"##down": 7698

I still not be able to come up with a valid solution here. Maybe, we can try coming up with a pattern for wordpiece in such a way that it will align with the BPE tokenizing.


Secondly, the pattern for type "bpe" is always static with the gpt-4 pattern. This shouldn't be the case.
In the meta-llama/Llama-3.2-1B-Instruct tokenizer, there is a pattern which is different from the gpt-4 pattern.

I took that instead and the whole flow worked perfectly.

[31213, 596, 22020, 323, 51594, 13]
[b'today', b"'s", b' highlights', b' and', b' markdown', b'.']
[40, 1097, 279, 32797, 323, 420, 374, 856, 19607, 13]
[b'I', b' am', b' the', b' senator', b' and', b' this', b' is', b' my', b' secretary', b'.']

Maybe, based on the models usage and requirements, we can take those patterns from the tokenizers directly instead of hard-coding them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants