-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to index large corpus which cannot be loaded into the memory? #234
Comments
Do you know where it crashes exactly? How much ram do you have? |
My RAM size is 216G. I think the problem happens in Log from the console is as follows:
|
Oh. This seems simple. How many processes are you launching? Each of them is loading the full tsv, so it just seems like your run crashes from just storing the strings in memory |
Sorry, but how to check how many processes I launch? The main part of my code follows the readme file: with Run().context(RunConfig(nranks=args.nranks, experiment=args.experiment_name)):
config = ColBERTConfig(doc_maxlen=args.doc_maxlen, nbits=args.nbits)
indexer = Indexer(checkpoint=args.checkpoint, config=config)
indexer.index(name=args.index_name, collection=collection, overwrite=True) I already set Btw, thank you so much for the prompt reply!! |
My branch allows for in-memory indexing. Instead of passing a list of documents, you could probably also pass an iterator, that iteratively loads documents from disc. It most likely won't work off the bat, and you will most likely have to change a bit more code in the k-means clustering part. But my branch may be a good starting point |
Hey @fschlatt Please share code to train an index using a csv file containing say few million rows, can't be loaded into the RAM, I tried using iterators and generators and even other data structures. I couldn't make it run. All it needs is a list in memory. If you have made it run on a csv file in disk or a txt file . Please share the code for the same. Thanks. |
My branch doesn't support an iterator, but adds support for in-memory collections. With some minor modifications, you should be able to get it working with an iterator: https://github.com/fschlatt/ColBERT I could place to start is here: https://github.com/fschlatt/ColBERT/blob/541b9e73edbe61c7a86e789a87980c4f09bf6053/colbert/data/collection.py#L18 The |
Hi,
Thank you so much for sharing and maintaining the code!
I'm using the off-the-shelf ColBERT v2 for the retrieval part of my system. So, I mainly call the
index()
API,indexer.index(name=index_name, collection=collection, overwrite=True)
. It works well on small corpus but when I try to move to a large corpus (the tsv collection is about 76G), the indexer cannot work. Specifically, the process will get killed when the MEM consumption exceeds the maximum memory of my server.Is there a way to use ColBERT for very large corpus?
I went through previous issues and found #64 is related. However, I cannot find information related to batch retrieval in the readme. Is it still supported or did I miss anything?
Thank you so much in advance!!
The text was updated successfully, but these errors were encountered: