fix huggingface tokenizer default length function #30185

keshavshrikant · 2025-03-09T05:34:52Z

vercel · 2025-03-09T05:34:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Mar 9, 2025 5:35am

ccurme

Hesitant to do this for two reasons:

.encode is documented as "Same as doing self.convert_tokens_to_ids(self.tokenize(text))". Is this a bug in that tokenizer?
Would this change behavior out from under users of bge-m3? All HF tokenizers?

Is this just an issue with bge-m3, or with all HF tokenizers?

keshavshrikant · 2025-03-09T18:59:32Z

.encode has a parameter add_special_tokens which defaults to True. Changing it to false makes the behaviour same as .tokenize

Tried it out for sentence-transformers/all-mpnet-base-v2 and got the same output

from transformers import AutoTokenizer
tokenizer_mpnet = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
tokenizer_mpnet.tokenize("What is my name")
['what', 'is', 'my', 'name']

tokenizer_mpnet.encode("What is my name")
[0, 2058, 2007, 2030, 2175, 2]

tokenizer_mpnet.encode("What is my name", add_special_tokens=False)
[2058, 2007, 2030, 2175]

This behaviour is consistent with the bug I had raised #30184 in that changing the add_special_tokens from True to False makes the chunks exactly same.

fix huggingface tokenizer default length function

8e5aa2e

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 9, 2025

ccurme reviewed Mar 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix huggingface tokenizer default length function #30185

fix huggingface tokenizer default length function #30185

keshavshrikant commented Mar 9, 2025

vercel bot commented Mar 9, 2025 •

edited

Loading

ccurme left a comment

keshavshrikant commented Mar 9, 2025

fix huggingface tokenizer default length function #30185

Are you sure you want to change the base?

fix huggingface tokenizer default length function #30185

Conversation

keshavshrikant commented Mar 9, 2025

vercel bot commented Mar 9, 2025 • edited Loading

ccurme left a comment

Choose a reason for hiding this comment

keshavshrikant commented Mar 9, 2025

vercel bot commented Mar 9, 2025 •

edited

Loading