-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert_to parameters #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small review with details
@@ -93,7 +93,7 @@ def add_documents( | |||
# TODO: use dynamic batching insert | |||
data_objects = [ | |||
wvc.data.DataObject( | |||
properties={"doc_id": doc_id}, vector=token_embedding | |||
properties={"doc_id": doc_id}, vector=token_embedding.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would replace doc_id
with document_id
overall I think it's fine to use plain English to as variable name (not a blocker for merge, just a detail)
giga_cherche/scores/colbert_score.py
Outdated
@@ -22,8 +22,16 @@ def colbert_score( | |||
Returns: | |||
Tensor: Matrix with res[i][j] = colbert_score(a[i], b[j]) | |||
""" | |||
a = _convert_to_batch_tensor(a) | |||
b = _convert_to_batch_tensor(b) | |||
if not isinstance(a, Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import numpy as np
import torch
from torch import Tensor
def convert_to_tensor(data):
if not isinstance(data, Tensor):
if isinstance(data[0], np.ndarray):
data = torch.from_numpy(np.array(data, dtype=np.float32))
else:
data = torch.stack(data)
return data
a = convert_to_tensor(a)
b = convert_to_tensor(b)
Did the change for the |
This PR corrects the behavior of
convert_to_numpy
andconvert_to_tensor
parameters of the encode function, by returning either a list of numpy arrays or a list of tensors (as we cannot stack everything, since documents might not have the same length).I also adjusted the different part of the code relying on the encode function and it does not seems to brings regression.
Also added the padding option parameters, but I am still unsure about it has we create a big tensor to then split it into a list, when it will certainly be used as a tensor in the end, so the overhead is a bit painful.
@raphaelsty if you could please have a look and tell me what you think about this.