-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update for Issue #623 #628
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current understanding of how the tokenizer is chosen is as follows (assuming a new experiment):
- If the config file of the experiment says to update the tokenizer, initialize a new tokenizer to be updated.
- Else if there is a parent model, use the tokenizer from the parent.
- Otherwise initialize a new tokenizer
Is that correct? I don't immediately have an opinion about this, but I'm curious if there was a discussion around when to use the parent tokenizer vs a new one?
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
Thank you for bringing it up! And yes, your understanding is correct. Since this will be used for Laura's project, I'm not sure if she will prefer to specify a tokenizer. That's why I keep the option of specifying a tokenizer in the config file. However, in my testing, I didn't specify it, so all tokenizers were from the parent directory. I'll edit my summary a bit. Thank you! @isaac091 |
Hmm, that's a good question. I think that most of the time it makes sense to use the tokenizer from the parent model (since the parent model has already been trained with that tokenizer). Maybe it's worth keeping the option to specify another tokenizer, if that's the desired behavior, though? |
Hi @laura-burdick-sil, I "designed" it in this way because I was not 100% sure about your project, so I prioritize config in case you may need to use a specific tokenizer or different source of languages. However, I personally would prefer to prioritize parent tokenizer because as you said, the model was trained with it. If you think it is better to just use the parent tokenizer all the time, I can change it. It will just be one line of code so don't worry! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move forward with what you currently have. I think it is logical to allow you to specify a new tokenizer to override the parent tokenizer. We can always change it later, if we find that it doesn't work for us. This is excellent work.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @AmeWenJ)
… is no BEST checkpoint when using a parent model
Thank you for all your reviews! I just made another tiny change in the checkpoint loading part. Now it will load the LAST checkpoint if there is not a BEST one. I've also added @laura-burdick-sil as a reviewer. Once she believes this update can support the project going on, it should be good to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @laura-burdick-sil)
Update the
hugging_face_config.py
:SilSeq2SeqTrainer
to store the model prefix (when the model is from a parent experiment, the model name will be a path, so we can't get directly a prefix).This modification has been tested on:
The remaining potential issues:
Model name becomes a path when the parent is specified, so it is not safe to directly use it anymore (such as in methods that need to get a prefix from the model name). That is why I edited the trainer class. I'm not sure if there are other methods that will be influenced, but I went over the related methods and think it's all good now.
If the parent model prefix is different from the model specified in the config, it will throw an exception. Based on my intuition, since we are supposed to use the parent model, the model specified in the config should be ignored? It feels weird that it is specified, so I see it as a way to double check something like "This is the model I want! It should be an NLLB instead of T5".
Please let me know if there is any further update or modification needed.
This change is