Fix gemma3 token vocab mismatch #128

Saibo-creator · 2025-04-09T20:24:03Z

Gemma 3 tokenizer has an inconsistency between the vocab_size value (262144) and the size of the vocab dictionary (262145)

It has a special token called <image_soft_token> of token id 262144 and this token is part of the vocab dictionary from tokenizer.get_vocab() but is not counted into the vocab size.

…ailable

urroxyz · 2025-04-10T00:07:33Z

I think is #126 is more robust because it aligns after analyzing the mismatch rather than always truncating, so that it isn't just a fix for Gemma 3, but for other models with their own niche vocabulary problems.

Fix gemma3 token vocab mismatch by using tokenizer.vocab_size when av…

e5a52c0

…ailable

Saibo-creator mentioned this pull request Apr 9, 2025

transformers-CFG incompatible with gemma-3: causes tokenizer and model vocab size mismatch #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gemma3 token vocab mismatch #128

Fix gemma3 token vocab mismatch #128

Saibo-creator commented Apr 9, 2025 •

edited

Loading

urroxyz commented Apr 10, 2025

Fix gemma3 token vocab mismatch #128

Are you sure you want to change the base?

Fix gemma3 token vocab mismatch #128

Conversation

Saibo-creator commented Apr 9, 2025 • edited Loading

urroxyz commented Apr 10, 2025

Saibo-creator commented Apr 9, 2025 •

edited

Loading