Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic scoring for LM #188

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

l-k-11235
Copy link
Contributor

@l-k-11235 l-k-11235 commented Jan 17, 2025

This PRs adapts the ScoringPreparator for LM architecture and fixes the ignore_prompt method of the loss for LM validation with left padding.
We noticed that the filtertoolong transform should not be used along with the huggingface_tokenize transform. See
#191

@@ -92,6 +92,9 @@ def tokenize_string(self, string, side="src", is_train=False):
kwargs = {"max_length": self.max_length, "truncation": True}
else:
kwargs = {}
string = string.replace(DefaultTokens.SEP, "\n").replace(
DefaultTokens.MASK_BEFORE, self.tokenizers[side].pad_token
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we handle it the same way for other tokenizers ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only tested for that one (with eurollm) at this point; an error will be raised with the others.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to have such very specific changes in scoring_utils (response_patterns handling mostly), but I'm not sure how this can be properly factorized with the current structure. Let's keep it for now and we might reconsider later.

Comment on lines +80 to +103
if is_seq2seq:
predictor = Translator.from_config( # we need to review opt/config stuff in translator
model,
self.vocabs,
predict_config,
model_config,
device_id=gpu_rank,
global_scorer=scorer,
report_align=predict_config.report_align,
report_score=False,
logger=None,
)
else:
predictor = GeneratorLM.from_config(
model,
self.vocabs,
predict_config,
model_config,
device_id=gpu_rank,
global_scorer=scorer,
report_align=predict_config.report_align,
report_score=False,
logger=None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to just define a predictor_class in the condition, and call predictor_class(*) once, since they should have the same signature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants