Replies: 1 comment 4 replies
-
I'm not sure using sentencepiece directly would solve your issue here. These models most probably rely on HF tokenizers, which have several layers of customization on top of the subwords methods. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I need to use Sentencepiece to tokenize the prompts on the fly, not onmt_tokenize.
We need transforms (including tokenization) as the model is probably trying to use character-level tokenization as a fallback. This would explain why Iʻm not seeing multi-character Tahitian tokens but only one-character tokens, why Iʻm getting NaN values during generation, why the model outputs UNK tokens.
We essentially have a mismatch between:
TowerInstruct's SentencePiece tokenizer that the model was trained with, and Eole's current tokenization capabilities.
Beta Was this translation helpful? Give feedback.
All reactions