Skip to content

Question about retraining/fine-tuning EncoderModel with new words in t5.get_tokenizer() #358

Open
@Kevin7720

Description

@Kevin7720

That I have added some new words to t5.get_tokenizer() as shown below:

def get_tokenizer(name):
    tokenizer = T5Tokenizer.from_pretrained(name, model_max_length=MAX_LENGTH)
    new_words  =['XXX', 'OOO', ......]
    tokenizer.add_tokens(new_words)
    return tokenizer

I would like to understand if I need to retrain or fine-tune the EncoderModel after adding these new words to the tokenizer. How will this modification affect the model's performance or behavior?

This question is related to the Imagen project, and I want to ensure that I am following the correct approach when incorporating new words into the tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions