Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding document pooling option to the encode function #12

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

NohTow
Copy link
Collaborator

@NohTow NohTow commented Jun 17, 2024

Following our work with Benjamin, I add the option to pool the document embeddings to keep 1/pool_factor of the original document tokens.
Our results show that documents can be pooled to a pool_factor of 2 without degradation in performance.
Further pooling can be used for different memory usage/performance trade-offs.

@NohTow NohTow requested a review from raphaelsty June 17, 2024 14:43
@raphaelsty
Copy link
Collaborator

In the future we could create a dedicated folder dedicated to pooling, it's fine right now to put it here. Amazing feature 👍

@raphaelsty raphaelsty merged commit 85e97bb into main Jun 17, 2024
1 check passed
@tomaarsen
Copy link
Collaborator

Nice work! I wasn't aware of this pooling approach for ColBERT. Does this allow for faster inference and lower storage costs?

@NohTow
Copy link
Collaborator Author

NohTow commented Jun 18, 2024

It's not surprising you did not hear about it: it's a project we have been working on with Benjamin and we still did not communicate on it (so please do not leak, although we already merged it in main ColBERT lib), we'll soon release a blog post and then submit a paper.

Basically, we found that you can pool the document tokens embeddings using their similarity and it does not degrade the performances of search (up to a certain factor) and allow to store half (or less) the tokens and thus greatly reduce the storing cost of ColBERT models (even more than PLAID). This indeed also reduces the number of tokens to score.

@tomaarsen
Copy link
Collaborator

I won't share, no worries.
That sounds quite promising, good stuff.

@raphaelsty raphaelsty deleted the add_pooling branch August 22, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants