-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dynamic scoring for LM #188
base: main
Are you sure you want to change the base?
dynamic scoring for LM #188
Conversation
@@ -92,6 +92,9 @@ def tokenize_string(self, string, side="src", is_train=False): | |||
kwargs = {"max_length": self.max_length, "truncation": True} | |||
else: | |||
kwargs = {} | |||
string = string.replace(DefaultTokens.SEP, "\n").replace( | |||
DefaultTokens.MASK_BEFORE, self.tokenizers[side].pad_token | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we handle it the same way for other tokenizers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only tested for that one (with eurollm) at this point; an error will be raised with the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit weird to have such very specific changes in scoring_utils
(response_patterns
handling mostly), but I'm not sure how this can be properly factorized with the current structure. Let's keep it for now and we might reconsider later.
if is_seq2seq: | ||
predictor = Translator.from_config( # we need to review opt/config stuff in translator | ||
model, | ||
self.vocabs, | ||
predict_config, | ||
model_config, | ||
device_id=gpu_rank, | ||
global_scorer=scorer, | ||
report_align=predict_config.report_align, | ||
report_score=False, | ||
logger=None, | ||
) | ||
else: | ||
predictor = GeneratorLM.from_config( | ||
model, | ||
self.vocabs, | ||
predict_config, | ||
model_config, | ||
device_id=gpu_rank, | ||
global_scorer=scorer, | ||
report_align=predict_config.report_align, | ||
report_score=False, | ||
logger=None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cleaner to just define a predictor_class
in the condition, and call predictor_class(*)
once, since they should have the same signature.
This PRs adapts the ScoringPreparator for LM architecture and fixes the ignore_prompt method of the loss for LM validation with left padding.
We noticed that the
filtertoolong
transform should not be used along with thehuggingface_tokenize
transform. See#191