Skip to content

Fix gemma3 token vocab mismatch #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Fix gemma3 token vocab mismatch #128

wants to merge 1 commit into from

Conversation

Saibo-creator
Copy link
Collaborator

@Saibo-creator Saibo-creator commented Apr 9, 2025

Gemma 3 tokenizer has an inconsistency between the vocab_size value (262144) and the size of the vocab dictionary (262145)

It has a special token called <image_soft_token> of token id 262144 and this token is part of the vocab dictionary from tokenizer.get_vocab() but is not counted into the vocab size.

@urroxyz
Copy link
Contributor

urroxyz commented Apr 10, 2025

I think is #126 is more robust because it aligns after analyzing the mismatch rather than always truncating, so that it isn't just a fix for Gemma 3, but for other models with their own niche vocabulary problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants