Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jasper and Stella: distillation of SOTA embedding models #67

Merged
merged 9 commits into from
Jan 6, 2025

Conversation

NLPJCL
Copy link
Member

@NLPJCL NLPJCL commented Dec 28, 2024

  1. support [infgrad/jasper_en_vision_language_v] embedding distill method. (https://huggingface.co/infgrad/jasper_en_vision_language_v1)

todo:

  • support two teacher embedding.(✓)

return f'Instruct: {task_description}\nQuery: {query}'


# Each query must come with a one-sentence instruction that describes the task

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the teacher model doesn't need a prompt?
You can use a more detailed comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I will add it to the readme note.

dic = set()
with open(train_data_path) as f:
for line in tqdm.tqdm(f):
data_dic=json.loads(line.strip())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_dict=json.loads(line)

json.loads does not need strip

Complete variable names will never be an error



if 'pos' in data_dic:
for text_pos in data_dic['pos']:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for text_pos in data_dic.get('pos',[]):

trust_remote_code=True,
device="cuda:7",
model_kwargs={
"torch_dtype": torch.bfloat16, # fp16 容易计算出nan

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove Chinese character

remove this comment; the experience is not suitable for every model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,thanks.

return loss

def pair_inbatch_similarity_loss(
self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果你想更好可以考虑typing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后续引入

student_embeddings, # [batch_size,dim]
teacher_similarity, # [batch_size,dim]
):
loss_fct = nn.MSELoss()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每次计算loss都要初始化一个nn.MSELoss()吗?如果编译器没有自动优化这个,我建议用F.mse_loss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@NLPJCL NLPJCL mentioned this pull request Jan 5, 2025
@NLPJCL NLPJCL merged commit 801c8f1 into master Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants