Jasper and Stella: distillation of SOTA embedding models #67

NLPJCL · 2024-12-28T13:54:19Z

support [infgrad/jasper_en_vision_language_v] embedding distill method. (https://huggingface.co/infgrad/jasper_en_vision_language_v1)

todo:

support two teacher embedding.（✓）

DunZhang · 2024-12-30T06:58:17Z

examples/stella_embedding/create_distill_data.py

+    return f'Instruct: {task_description}\nQuery: {query}'
+
+
+# Each query must come with a one-sentence instruction that describes the task


What if the teacher model doesn't need a prompt?
You can use a more detailed comment.

Thanks for the reminder, I will add it to the readme note.

DunZhang · 2024-12-30T07:00:11Z

examples/stella_embedding/create_distill_data.py

+    dic = set()
+    with open(train_data_path) as f:
+        for line in tqdm.tqdm(f):
+            data_dic=json.loads(line.strip())


data_dict=json.loads(line)

json.loads does not need strip

Complete variable names will never be an error

DunZhang · 2024-12-30T07:06:58Z

examples/stella_embedding/create_distill_data.py

+
+
+            if 'pos' in data_dic:
+                for text_pos in data_dic['pos']:


for text_pos in data_dic.get('pos',[]):

DunZhang · 2024-12-30T07:08:08Z

examples/stella_embedding/create_distill_data.py

+            trust_remote_code=True,
+            device="cuda:7",
+            model_kwargs={
+                "torch_dtype": torch.bfloat16,  # fp16 容易计算出nan


remove Chinese character

remove this comment; the experience is not suitable for every model.

ok，thanks.

DunZhang · 2024-12-30T07:11:47Z

rag_retrieval/train/embedding/model_distill.py

+        return loss
+
+    def pair_inbatch_similarity_loss(
+        self,


如果你想更好可以考虑typing

后续引入

DunZhang · 2024-12-30T07:12:30Z

rag_retrieval/train/embedding/model_distill.py

+        student_embeddings, # [batch_size,dim]
+        teacher_similarity, # [batch_size,dim]
+    ):
+        loss_fct = nn.MSELoss()


每次计算loss都要初始化一个nn.MSELoss()吗？如果编译器没有自动优化这个，我建议用F.mse_loss

jcli added 4 commits December 28, 2024 21:40

stella distill

b327437

fix a bug

d10cd0a

fix typo

d4eb7d2

update trainer print loss

39b70cd

DunZhang reviewed Dec 30, 2024

View reviewed changes

NLPJCL added 5 commits January 4, 2025 16:06

update stella_embedding create_distill_data

d6fb1d6

update stella_embedding create_distill_data

63dfa85

Polish the readme

09e3678

update stella distillion methods

4e085a4

Polish the readme

daf27ca

NLPJCL mentioned this pull request Jan 5, 2025

stella实现问题 #69

Closed

NLPJCL merged commit 801c8f1 into master Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jasper and Stella: distillation of SOTA embedding models #67

Jasper and Stella: distillation of SOTA embedding models #67

NLPJCL commented Dec 28, 2024 •

edited

Loading

DunZhang Dec 30, 2024

NLPJCL Jan 4, 2025

DunZhang Dec 30, 2024

DunZhang Dec 30, 2024

DunZhang Dec 30, 2024

NLPJCL Jan 4, 2025

DunZhang Dec 30, 2024

NLPJCL Jan 4, 2025

DunZhang Dec 30, 2024

NLPJCL Jan 4, 2025

		return f'Instruct: {task_description}\nQuery: {query}'


		# Each query must come with a one-sentence instruction that describes the task

Jasper and Stella: distillation of SOTA embedding models #67

Jasper and Stella: distillation of SOTA embedding models #67

Conversation

NLPJCL commented Dec 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NLPJCL commented Dec 28, 2024 •

edited

Loading