Sentencepiece tokenization transform for TowerInstruct-Llama2 #157

HURIMOZ · 2024-12-08T00:22:47Z

HURIMOZ
Dec 8, 2024

Hi, I need to use Sentencepiece to tokenize the prompts on the fly, not onmt_tokenize.
We need transforms (including tokenization) as the model is probably trying to use character-level tokenization as a fallback. This would explain why Iʻm not seeing multi-character Tahitian tokens but only one-character tokens, why Iʻm getting NaN values during generation, why the model outputs UNK tokens.
We essentially have a mismatch between:
TowerInstruct's SentencePiece tokenizer that the model was trained with, and Eole's current tokenization capabilities.

francoishernandez · 2024-12-09T10:47:59Z

francoishernandez
Dec 9, 2024
Maintainer

I'm not sure using sentencepiece directly would solve your issue here. These models most probably rely on HF tokenizers, which have several layers of customization on top of the subwords methods.
You might want to try the recent addition of HF tokenizers support, introduced here: #122.

4 replies

HURIMOZ Dec 13, 2024
Author

Hi Francois, I was able to get the HF tokenizer and use it as a transform in my fine-tuning experiment:

transforms: ['huggingface_tokenize']
transforms_configs:
  huggingface_tokenize:
    huggingface_model: "meta-llama/Llama-2-7b-chat-hf"
    max_length: 512

I also created the tokenizer files:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
tokenizer.save_pretrained("./models/TowerInstruct-7b-v0.2")

Iʻm using the prompt formatting that you recommended from the Unbabel people:

{"conversations": [{"from": "human", "value": "Source: the human race is truly amazing !\nWhat is the Tahitian for:\nTarget: "}, {"from": "gpt", "value": "mea faʻahiahia roa te tātai taʻata !"}], "lang": "en-ty", "split": "dev", "dataset": "general_mt_clean", "task": "machine_translation"}

I donʻt get any error but there is something wrong with the tokenization.
A sentence like "mea faʻahiahia roa te tātai taʻata" (the human race is truly amazing) is tokenized as
['▁me', 'a', '▁fa', 'ʻ', 'ah', 'iah', 'ia', '▁ro', 'a', '▁te', '▁t', 'ā', 'ta', 'i', '▁ta', 'ʻ', 'ata', '▁!']. This is bad tokenization.
The tokenizer is recognizing some Tahitian character combinations:

Basic syllables: 'me', 'a', 'fa', 'ia'
Special characters: 'ʻ', 'ā' are tokenized separately
Some word parts: 'tata', 'ata'
I find the fragmentation of Tahitian words quite aggressive (e.g., 'taʻata' being split into multiple tokens)
Special characters like 'ʻ' and 'ā' are being treated as separate tokens rather than part of larger subwords, which might make it harder to learn natural Tahitian word patterns.
The high vocab size of 32,000 is very likely responsible for this bad tokenization.
Any way I can define a vocab file and vocab tokenizer for the Tahitian language? If not, can I instruct the model to treat some characters as part of another token?
If still not possible, how can I solve this problem?
I mean, I even tried to increase the LoRA layers to 8, thinking it would learn the Tahitian patterns but to no avail. I get a flat training accuracy and a zero learning rate on the Tensorboard graphs.

francoishernandez Dec 13, 2024
Maintainer

1. tokenizer definition

You mention TowerInstruct but you configured the Llama2 tokenizer.
TowerInstruct seems based on Llama2, so that should probably work, but I'd recommend sticking to the same ID for clarity.
Also, not sure why you "created the tokenizer files". In the current implementation, the tokenizer is dynamically retrieved based on the huggingface_model flag.

2. tokenization

"fragmentation"

I find the fragmentation of Tahitian words quite aggressive (e.g., 'taʻata' being split into multiple tokens)
Special characters like 'ʻ' and 'ā' are being treated as separate tokens rather than part of larger subwords, which might make it harder to learn natural Tahitian word patterns.

I don't think there is anything wrong with the tokenization per se. It's just not robust enough for such niche languages. It's a relatively known issue of such LLMs. It's not because they theoretically can handle any language that they will do it properly. Often, the token distribution is skewed towards more common languages, so less common language is often split in many more tokens. Lots of research on this out there, e.g. check Figure 5 on the recent EuroLLM paper.

Also, it's not because the tokenization is not ideal that the model should not be able to handle it. That's the point of using subwords. It might be less robust, and use more resources, but it should in theory be somewhat manageable.

prompt format

{"conversations": [{"from": "human", "value": "Source: the human race is truly amazing !\nWhat is the Tahitian for:\nTarget: "}, {"from": "gpt", "value": "mea faʻahiahia roa te tātai taʻata !"}], "lang": "en-ty", "split": "dev", "dataset": "general_mt_clean", "task": "machine_translation"}

Are you processing this into a more manageable string or is it fed straight to the model? If the latter, that might explain part of your issues. This is just how the base data is structured to then build the prompt.

vocab size

The high vocab size of 32,000 is very likely responsible for this bad tokenization.

That's actually the opposite. The bigger the vocab, the more "long" tokens it can have in theory. It will also depend on how the tokenizer was trained and on which data.

adaptations

Any way I can define a vocab file and vocab tokenizer for the Tahitian language? If not, can I instruct the model to treat some characters as part of another token?

These models are relatively naive in their underlying structure. There is no such thing as "instruct the model to treat some characters as part of another token". In short, each token is embedded, and then attention layers do their magic to make sense out of it. So a model is limited by its defined vocabulary.

It is theoretically possible to update the vocabulary of a pretrained model (check for the update_vocab flag and related codepaths). This will basically update the embeddings mapping to grab the new tokens, which would then be learned by finetuning.
That being said, I'm not sure changing the vocabulary on the fly for such large pretrained models would be very robust, especially if you don't have much data. (Also, this update_vocab path is not really compatible with the huggingface_tokenize transform right now, so there'd be adaptations needed.)

I mean, I even tried to increase the LoRA layers to 8, thinking it would learn the Tahitian patterns but to no avail. I get a flat training accuracy and a zero learning rate on the Tensorboard graphs.

That leads me to believe there are other issues in your setup. Also, "zero learning rate" does not mean much. Looping back to your original issue here, onmt_tokenize was probably not the culprit here.

Conclusion

The handling of low resources languages has always been a challenge and can require quite a few tricks, especially if you're relatively new to the field. Is the data you're working with publicly available? Sharing more about your task might be the best way to move forward here.

HURIMOZ Dec 13, 2024
Author

Thank you François.
Regarding tokenization. With OpenNMT I had modified the Sentencepiece code so that a glottal stop be considered a consonant and not punctuation. That greatly helped with all of our Pacific Islands languages that do use Unicode code point U+02BB. Now with these out-of-the-box models it seems we canʻt touch the tokenization behavior.
Looking at how the HF tokenizer for Llama 2 (I guess itʻs the same for TowerInstruct-Llama 2 but Iʻll give it a shot) tokenizes my Tahitian corpora itʻs evident that weʻre bumping into the same problem. The glottal stop is treated by the tokenizer as punctuation and is never attached to a subword or word.

Are you processing this into a more manageable string or is it fed straight to the model?

Yes, I followed the example you guys gave me from Unbabel. But it looks like it shouldnʻt be used as is, so Iʻll figure out how to do it.

francoishernandez Dec 13, 2024
Maintainer

The glottal stop is treated by the tokenizer as punctuation and is never attached to a subword or word.

I don't think that is as big an issue as you might think it is. If you show the model sufficient data tokenized in this way, it will consider it normal and live with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentencepiece tokenization transform for TowerInstruct-Llama2 #157

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sentencepiece tokenization transform for TowerInstruct-Llama2 #157

HURIMOZ Dec 8, 2024

Replies: 1 comment · 4 replies

francoishernandez Dec 9, 2024 Maintainer

HURIMOZ Dec 13, 2024 Author

francoishernandez Dec 13, 2024 Maintainer

1. tokenizer definition

2. tokenization

"fragmentation"

prompt format

vocab size

adaptations

Conclusion

HURIMOZ Dec 13, 2024 Author

francoishernandez Dec 13, 2024 Maintainer

HURIMOZ
Dec 8, 2024

Replies: 1 comment 4 replies

francoishernandez
Dec 9, 2024
Maintainer

HURIMOZ Dec 13, 2024
Author

francoishernandez Dec 13, 2024
Maintainer

HURIMOZ Dec 13, 2024
Author

francoishernandez Dec 13, 2024
Maintainer