Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix huggingface tokenizer default length function #30185

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

keshavshrikant
Copy link

Copy link

vercel bot commented Mar 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Mar 9, 2025 5:35am

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 9, 2025
Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hesitant to do this for two reasons:

  1. .encode is documented as "Same as doing self.convert_tokens_to_ids(self.tokenize(text))". Is this a bug in that tokenizer?
  2. Would this change behavior out from under users of bge-m3? All HF tokenizers?

Is this just an issue with bge-m3, or with all HF tokenizers?

@keshavshrikant
Copy link
Author

.encode has a parameter add_special_tokens which defaults to True. Changing it to false makes the behaviour same as .tokenize

Tried it out for sentence-transformers/all-mpnet-base-v2 and got the same output

from transformers import AutoTokenizer
tokenizer_mpnet = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
tokenizer_mpnet.tokenize("What is my name")
['what', 'is', 'my', 'name']

tokenizer_mpnet.encode("What is my name")
[0, 2058, 2007, 2030, 2175, 2]

tokenizer_mpnet.encode("What is my name", add_special_tokens=False)
[2058, 2007, 2030, 2175]

This behaviour is consistent with the bug I had raised #30184 in that changing the add_special_tokens from True to False makes the chunks exactly same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature size:XS This PR changes 0-9 lines, ignoring generated files. Ɑ: text splitters Related to text splitters package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants