[FIX] Explicit type-casting the `vocab` into a dictionary object #20

Udayk02 · 2025-01-06T09:46:51Z

An explicit type-casting of the vocab variable into a dictionary will handle this issue if any tokenizer.json presented the vocab as a list of lists.

- type-casting the `vocab` into dict

bhavnicksm · 2025-01-10T13:42:07Z

Hey @Udayk02! 👋

Thanks for opening a PR 🚀

I tried this out, and while it's not giving the error anymore, the encoding for the Cohere/Cohere-embed-multilingual-v3.0 don't match as per what they should be~ Do you know why that could be happening?

A simple test to replicate this would be,

from autotiktokenizer import AutoTikTokenizer
tik = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tik.encode("hi")

# Output: [ 249021 ]

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
tok.encode("hi", add_special_tokens=False).ids

# Output: [ 1274 ]

Still trying to understand where the cohere tokenizer is failing at the moment, please let me know if you're able to find the cause for this.

Thanks! 😊

Udayk02 · 2025-01-10T16:10:04Z

Hey @bhavnicksm , will check this and inform you if I find out anything. Thank you!

Udayk02 · 2025-01-11T05:18:24Z

I think the underlying issue is not related to vocab essentially.

The same issue is present for every Cohere tokenizer. I think there is something missing here.

from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi")

# Output: [4048]

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("Cohere/Cohere-embed-english-v3.0")
tokenizer.encode("hi", add_special_tokens=False).ids

# Output: [7632]

If we look at the actual tokens for these token_ids in the vocab dictionary,
4048: ##hi
7632: hi

which is not same. There are added special characters in the process. Checking on this further.

Update: The issue persists not only with the Cohere models, tokenization is inconsistent among other models too.

Another observation is that, while we are performing tokenization through autotiktokenizer, the tiktoken encoding is splitting the base token into multiple sub-tokens instead of taking the actual larger token that matches with the base token even though it is present in the vocabulary, which is an incorrect behaviour.

For example,
for the same input: "today's highlights"

while using sentence-transformers/all-MiniLM-L6-v2

autotiktokenizer: [b'to', b'day', b"'", b's', b'hi', b'gh', b'li', b'gh', b'ts']
hf tokenizers: ['today', "'", 's', 'highlights']

while using meta-llama/Llama-3.2-1B-Instruct

autotiktokenizer: [b't', b'oday', b"'", b's', b' highlights']
hf tokenizers: ['today', "'s", 'Ġhighlights']

I guess somehow, we are not keeping the longestFirst strategy alive here.

Udayk02 · 2025-01-18T18:52:56Z

I think I found the issue here. It is the way we took pre_tokenizers, which are patterns.

Firstly, let us look at WordPiece tokenizers, the static regex which you took divides the words something like this:

"today", "'", "s", " ", "highlights"

Which actually won't align with the BPE tokenizing. Generally, in BPE, the words are split using regex something like this:

"today", "'s", " highlights"

A word is paired with the prior white-space. This allows the match with "highlights" sequence if it appears in the vocabulary.

Instead of this, it is matching up with "##hi" or "hi" which is a sub-token which is actually valid if you take the pattern that is currently being used into consideration.

To handle this, I assumed we have to reverse-play. I changed the code in _get_mergeable_ranks as below:

if tokenizer_type == "wordpiece":
  if token.startswith("##"):
      token = "Ġ" + token[2:]
  else:
      token = token

I exchanged the mapping from sub-token to the individual token and vice-versa. This is extremely wrong.
it worked initially as you can see here:

[2651, 1005, 1055, 11637]
[b'today', b"'", b's', b'highlights']

But, It fell apart with the other examples when individual tokens are appearing as sub-tokens.

[2651, 1005, 1055, 11637, 1998, 2928, 2091, 1012]
[b'today', b"'", b's', b'highlights', b'and', b'mark', b'down', b'.']

Here, down is given with individual down token_id instead of ##down token_id which is obvious.

"down": 2091
"##down": 7698

I still not be able to come up with a valid solution here. Maybe, we can try coming up with a pattern for wordpiece in such a way that it will align with the BPE tokenizing.

Secondly, the pattern for type "bpe" is always static with the gpt-4 pattern. This shouldn't be the case.
In the meta-llama/Llama-3.2-1B-Instruct tokenizer, there is a pattern which is different from the gpt-4 pattern.

I took that instead and the whole flow worked perfectly.

[31213, 596, 22020, 323, 51594, 13]
[b'today', b"'s", b' highlights', b' and', b' markdown', b'.']
[40, 1097, 279, 32797, 323, 420, 374, 856, 19607, 13]
[b'I', b' am', b' the', b' senator', b' and', b' this', b' is', b' my', b' secretary', b'.']

Maybe, based on the models usage and requirements, we can take those patterns from the tokenizers directly instead of hard-coding them.

bugfix chonkie-ai#19

a322d9a

- type-casting the `vocab` into dict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Explicit type-casting the `vocab` into a dictionary object #20

[FIX] Explicit type-casting the `vocab` into a dictionary object #20

Udayk02 commented Jan 6, 2025 •

edited

Loading

bhavnicksm commented Jan 10, 2025

Udayk02 commented Jan 10, 2025 •

edited

Loading

Udayk02 commented Jan 11, 2025 •

edited

Loading

Udayk02 commented Jan 18, 2025

[FIX] Explicit type-casting the vocab into a dictionary object #20

Are you sure you want to change the base?

[FIX] Explicit type-casting the vocab into a dictionary object #20

Conversation

Udayk02 commented Jan 6, 2025 • edited Loading

bhavnicksm commented Jan 10, 2025

Udayk02 commented Jan 10, 2025 • edited Loading

Udayk02 commented Jan 11, 2025 • edited Loading

Udayk02 commented Jan 18, 2025

[FIX] Explicit type-casting the `vocab` into a dictionary object #20

[FIX] Explicit type-casting the `vocab` into a dictionary object #20

Udayk02 commented Jan 6, 2025 •

edited

Loading

Udayk02 commented Jan 10, 2025 •

edited

Loading

Udayk02 commented Jan 11, 2025 •

edited

Loading