convert_to parameters #18

NohTow · 2024-06-24T13:59:43Z

This PR corrects the behavior of convert_to_numpy and convert_to_tensor parameters of the encode function, by returning either a list of numpy arrays or a list of tensors (as we cannot stack everything, since documents might not have the same length).

I also adjusted the different part of the code relying on the encode function and it does not seems to brings regression.
Also added the padding option parameters, but I am still unsure about it has we create a big tensor to then split it into a list, when it will certainly be used as a tensor in the end, so the overhead is a bit painful.

@raphaelsty if you could please have a look and tell me what you think about this.

…nsors or nparrays

…he new output formats

raphaelsty

Small review with details

raphaelsty · 2024-06-24T14:40:42Z

giga_cherche/indexes/WeaviateIndex.py

@@ -93,7 +93,7 @@ def add_documents(
            # TODO: use dynamic batching insert
            data_objects = [
                wvc.data.DataObject(
-                    properties={"doc_id": doc_id}, vector=token_embedding
+                    properties={"doc_id": doc_id}, vector=token_embedding.tolist()


I would replace doc_id with document_id overall I think it's fine to use plain English to as variable name (not a blocker for merge, just a detail)

raphaelsty · 2024-06-25T12:17:15Z

giga_cherche/scores/colbert_score.py

@@ -22,8 +22,16 @@ def colbert_score(
    Returns:
        Tensor: Matrix with res[i][j] = colbert_score(a[i], b[j])
    """
-    a = _convert_to_batch_tensor(a)
-    b = _convert_to_batch_tensor(b)
+    if not isinstance(a, Tensor):


import numpy as np import torch from torch import Tensor def convert_to_tensor(data): if not isinstance(data, Tensor): if isinstance(data[0], np.ndarray): data = torch.from_numpy(np.array(data, dtype=np.float32)) else: data = torch.stack(data) return data a = convert_to_tensor(a) b = convert_to_tensor(b)

NohTow · 2024-06-26T08:31:31Z

Did the change for the convert_to_tensor function, delegating the cleaning variable name to latter when we will do a big cleaning pass to avoid having to do to many regression tests.

NohTow added 3 commits June 24, 2024 13:37

Add padding option to encode and correctly return a list of either te…

35d20ac

…nsors or nparrays

Adjusting the behavior of index, retriever, reranker and scoring to t…

65e051d

…he new output formats

Adding a comment about the padding

36e45f7

NohTow requested a review from raphaelsty June 24, 2024 14:37

raphaelsty reviewed Jun 25, 2024

View reviewed changes

NohTow mentioned this pull request Jun 26, 2024

Clean variable names #19

Closed

NohTow added 2 commits June 26, 2024 07:43

Adding output models and dataset in gitignore

39cb164

Mutualizing the convert_to_tensor operation in a function

970f0fe

NohTow merged commit e57ea3a into main Jun 26, 2024
1 check passed

raphaelsty deleted the convert_to_parameters branch August 22, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert_to parameters #18

convert_to parameters #18

NohTow commented Jun 24, 2024 •

edited

Loading

raphaelsty left a comment

raphaelsty Jun 24, 2024

raphaelsty Jun 25, 2024

NohTow commented Jun 26, 2024

convert_to parameters #18

convert_to parameters #18

Conversation

NohTow commented Jun 24, 2024 • edited Loading

raphaelsty left a comment

Choose a reason for hiding this comment

raphaelsty Jun 24, 2024

Choose a reason for hiding this comment

raphaelsty Jun 25, 2024

Choose a reason for hiding this comment

NohTow commented Jun 26, 2024

NohTow commented Jun 24, 2024 •

edited

Loading