Update for Issue #623 #628

AmeWenJ · 2025-01-15T15:15:31Z

Update the hugging_face_config.py:

Load the model from the parent experiment if a parent is specified in the config.
The tokenizer will be the same as the parent experiment unless it is specified in the config.
An optional parameter is added in SilSeq2SeqTrainer to store the model prefix (when the model is from a parent experiment, the model name will be a path, so we can't get directly a prefix).

This modification has been tested on:

Training without parent or LoRA.
Training without parent but with LoRA.
Training with parent and LoRA.
Training with parent but without LoRA.

The remaining potential issues:

Model name becomes a path when the parent is specified, so it is not safe to directly use it anymore (such as in methods that need to get a prefix from the model name). That is why I edited the trainer class. I'm not sure if there are other methods that will be influenced, but I went over the related methods and think it's all good now.
If the parent model prefix is different from the model specified in the config, it will throw an exception. Based on my intuition, since we are supposed to use the parent model, the model specified in the config should be ignored? It feels weird that it is specified, so I see it as a way to double check something like "This is the model I want! It should be an NLLB instead of T5".

Please let me know if there is any further update or modification needed.

This change is

isaac091

My current understanding of how the tokenizer is chosen is as follows (assuming a new experiment):

If the config file of the experiment says to update the tokenizer, initialize a new tokenizer to be updated.
Else if there is a parent model, use the tokenizer from the parent.
Otherwise initialize a new tokenizer

Is that correct? I don't immediately have an opinion about this, but I'm curious if there was a discussion around when to use the parent tokenizer vs a new one?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

AmeWenJ · 2025-01-15T22:12:38Z

My current understanding of how the tokenizer is chosen is as follows (assuming a new experiment):

If the config file of the experiment says to update the tokenizer, initialize a new tokenizer to be updated.

Else if there is a parent model, use the tokenizer from the parent.

Otherwise initialize a new tokenizer

Is that correct? I don't immediately have an opinion about this, but I'm curious if there was a discussion around when to use the parent tokenizer vs a new one?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

Thank you for bringing it up! And yes, your understanding is correct. Since this will be used for Laura's project, I'm not sure if she will prefer to specify a tokenizer. That's why I keep the option of specifying a tokenizer in the config file. However, in my testing, I didn't specify it, so all tokenizers were from the parent directory.

I'll edit my summary a bit. Thank you! @isaac091

laura-burdick-sil · 2025-01-16T17:59:22Z

Hmm, that's a good question. I think that most of the time it makes sense to use the tokenizer from the parent model (since the parent model has already been trained with that tokenizer). Maybe it's worth keeping the option to specify another tokenizer, if that's the desired behavior, though?

AmeWenJ · 2025-01-16T21:04:42Z

Hmm, that's a good question. I think that most of the time it makes sense to use the tokenizer from the parent model (since the parent model has already been trained with that tokenizer). Maybe it's worth keeping the option to specify another tokenizer, if that's the desired behavior, though?

Hi @laura-burdick-sil, I "designed" it in this way because I was not 100% sure about your project, so I prioritize config in case you may need to use a specific tokenizer or different source of languages. However, I personally would prefer to prioritize parent tokenizer because as you said, the model was trained with it. If you think it is better to just use the parent tokenizer all the time, I can change it. It will just be one line of code so don't worry!

ddaspit

I think we can move forward with what you currently have. I think it is logical to allow you to specify a new tokenizer to override the parent tokenizer. We can always change it later, if we find that it doesn't work for us. This is excellent work.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @AmeWenJ)

… is no BEST checkpoint when using a parent model

AmeWenJ · 2025-01-17T07:42:45Z

Thank you for all your reviews! I just made another tiny change in the checkpoint loading part. Now it will load the LAST checkpoint if there is not a BEST one. I've also added @laura-burdick-sil as a reviewer. Once she believes this update can support the project going on, it should be good to merge!

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @laura-burdick-sil)

…e ClearML session starts

AmeWenJ · 2025-01-21T04:23:40Z

The previous version downloads the whole parent folder during the config initialization, which is slow and consuming local storage. To improve the efficiency, I made the following changes:

Method get_parent_model_prefix is a part of the previous load_parent_exp. It now serves for the purpose of downloading the config.yml file from the parent experiment folder and return the parent model prefix.
Method get_parent_model_name is the other part of the previous load_parent_exp. It now serves for the purpose of downloading the run/trainer_state.json file from the parent experiment folder and return the parent model path.
Method get_parent_last_checkpoint is a helper method for get_parent_model_name, because the previous get_last_checkpoint can only be used when model files actually exist.
Download the remaining files from the parent experiment at the beginning of preprocessing and training.

The potential risk is that get_parent_last_checkpoint assumes there is a model checkpoint at the max_steps, which is usually the case and I think it should be like that - After all, it is supposed to be the last step of training, which is the last version of the model.

Based on the feedback, this should be the last problem, so if everything goes well, it should finally be good to merge...!

laura-burdick-sil

Great work, Wenfan! Thank you for making the changes so the model isn't downloaded locally each time.

ddaspit

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @AmeWenJ)

AmeWenJ added 2 commits January 14, 2025 23:23

Enable the feature of training from the model in a different experiment.

13ae357

Refactor the code and add new exceptions

d784932

AmeWenJ added the enhancement New feature or request label Jan 15, 2025

AmeWenJ requested review from ddaspit and isaac091 January 15, 2025 15:15

AmeWenJ assigned laura-burdick-sil and AmeWenJ and unassigned laura-burdick-sil Jan 15, 2025

AmeWenJ linked an issue Jan 15, 2025 that may be closed by this pull request

Use a different experiment folder's checkpoints as the base model when running an silnlp experiment on clearml #623

Closed

isaac091 reviewed Jan 15, 2025

View reviewed changes

ddaspit approved these changes Jan 16, 2025

View reviewed changes

Load the LAST checkpoint by default instead of the BEST in case there…

69d255c

… is no BEST checkpoint when using a parent model

AmeWenJ requested a review from laura-burdick-sil January 17, 2025 07:24

ddaspit reviewed Jan 17, 2025

View reviewed changes

Improve efficiency by only downloading config and trainer_state befor…

2ad628e

…e ClearML session starts

laura-burdick-sil approved these changes Jan 21, 2025

View reviewed changes

ddaspit reviewed Jan 21, 2025

View reviewed changes

AmeWenJ merged commit d5ec163 into master Jan 22, 2025
1 check passed

AmeWenJ deleted the test_wenfan branch January 22, 2025 02:35

TaperChipmunk32 mentioned this pull request Mar 12, 2025

Added support for MinIO and B2 buckets #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update for Issue #623 #628

Update for Issue #623 #628

Uh oh!

AmeWenJ commented Jan 15, 2025 •

edited

Loading

Uh oh!

isaac091 left a comment

Uh oh!

AmeWenJ commented Jan 15, 2025 •

edited

Loading

Uh oh!

laura-burdick-sil commented Jan 16, 2025

Uh oh!

AmeWenJ commented Jan 16, 2025

Uh oh!

ddaspit left a comment

Uh oh!

AmeWenJ commented Jan 17, 2025 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

AmeWenJ commented Jan 21, 2025

Uh oh!

laura-burdick-sil left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update for Issue #623 #628

Update for Issue #623 #628

Uh oh!

Conversation

AmeWenJ commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac091 left a comment

Choose a reason for hiding this comment

Uh oh!

AmeWenJ commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laura-burdick-sil commented Jan 16, 2025

Uh oh!

AmeWenJ commented Jan 16, 2025

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

AmeWenJ commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

AmeWenJ commented Jan 21, 2025

Uh oh!

laura-burdick-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AmeWenJ commented Jan 15, 2025 •

edited

Loading

AmeWenJ commented Jan 15, 2025 •

edited

Loading

AmeWenJ commented Jan 17, 2025 •

edited

Loading