Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for Issue #623 #628

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Update for Issue #623 #628

wants to merge 3 commits into from

Conversation

AmeWenJ
Copy link
Collaborator

@AmeWenJ AmeWenJ commented Jan 15, 2025

Update the hugging_face_config.py:

  • Load the model from the parent experiment if a parent is specified in the config.
  • The tokenizer will be the same as the parent experiment unless it is specified in the config.
  • An optional parameter is added in SilSeq2SeqTrainer to store the model prefix (when the model is from a parent experiment, the model name will be a path, so we can't get directly a prefix).

This modification has been tested on:

  • Training without parent or LoRA.
  • Training without parent but with LoRA.
  • Training with parent and LoRA.
  • Training with parent but without LoRA.

The remaining potential issues:

  • Model name becomes a path when the parent is specified, so it is not safe to directly use it anymore (such as in methods that need to get a prefix from the model name). That is why I edited the trainer class. I'm not sure if there are other methods that will be influenced, but I went over the related methods and think it's all good now.

  • If the parent model prefix is different from the model specified in the config, it will throw an exception. Based on my intuition, since we are supposed to use the parent model, the model specified in the config should be ignored? It feels weird that it is specified, so I see it as a way to double check something like "This is the model I want! It should be an NLLB instead of T5".

Please let me know if there is any further update or modification needed.


This change is Reviewable

Copy link
Collaborator

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current understanding of how the tokenizer is chosen is as follows (assuming a new experiment):

  1. If the config file of the experiment says to update the tokenizer, initialize a new tokenizer to be updated.
  2. Else if there is a parent model, use the tokenizer from the parent.
  3. Otherwise initialize a new tokenizer

Is that correct? I don't immediately have an opinion about this, but I'm curious if there was a discussion around when to use the parent tokenizer vs a new one?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

@AmeWenJ
Copy link
Collaborator Author

AmeWenJ commented Jan 15, 2025

My current understanding of how the tokenizer is chosen is as follows (assuming a new experiment):

  1. If the config file of the experiment says to update the tokenizer, initialize a new tokenizer to be updated.
  2. Else if there is a parent model, use the tokenizer from the parent.
  3. Otherwise initialize a new tokenizer

Is that correct? I don't immediately have an opinion about this, but I'm curious if there was a discussion around when to use the parent tokenizer vs a new one?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

Thank you for bringing it up! And yes, your understanding is correct. Since this will be used for Laura's project, I'm not sure if she will prefer to specify a tokenizer. That's why I keep the option of specifying a tokenizer in the config file. However, in my testing, I didn't specify it, so all tokenizers were from the parent directory.

I'll edit my summary a bit. Thank you! @isaac091

@laura-burdick-sil
Copy link
Collaborator

Hmm, that's a good question. I think that most of the time it makes sense to use the tokenizer from the parent model (since the parent model has already been trained with that tokenizer). Maybe it's worth keeping the option to specify another tokenizer, if that's the desired behavior, though?

@AmeWenJ
Copy link
Collaborator Author

AmeWenJ commented Jan 16, 2025

Hmm, that's a good question. I think that most of the time it makes sense to use the tokenizer from the parent model (since the parent model has already been trained with that tokenizer). Maybe it's worth keeping the option to specify another tokenizer, if that's the desired behavior, though?

Hi @laura-burdick-sil, I "designed" it in this way because I was not 100% sure about your project, so I prioritize config in case you may need to use a specific tokenizer or different source of languages. However, I personally would prefer to prioritize parent tokenizer because as you said, the model was trained with it. If you think it is better to just use the parent tokenizer all the time, I can change it. It will just be one line of code so don't worry!

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move forward with what you currently have. I think it is logical to allow you to specify a new tokenizer to override the parent tokenizer. We can always change it later, if we find that it doesn't work for us. This is excellent work.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @AmeWenJ)

… is no BEST checkpoint when using a parent model
@AmeWenJ
Copy link
Collaborator Author

AmeWenJ commented Jan 17, 2025

Thank you for all your reviews! I just made another tiny change in the checkpoint loading part. Now it will load the LAST checkpoint if there is not a BEST one. I've also added @laura-burdick-sil as a reviewer. Once she believes this update can support the project going on, it should be good to merge!

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @laura-burdick-sil)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a different experiment folder's checkpoints as the base model when running an silnlp experiment on clearml
4 participants