-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embeddings for large KG #154
Comments
Yes, most definitely! Within the fit() function, the extract_walks() function is called that returns a list of walks that are fed to Word2Vec later on. https://github.com/IBCNServices/pyRDF2Vec/blob/main/pyrdf2vec/rdf2vec.py#L107 You could call this function on different chunks of entities to do this iteratively. We also have some mechanisms to speed up extraction and/or reduce memory usage:
However, in the end, you will need to feed all these walks AT ONCE to a Word2Vec or other word embedding network, which could be the main bottleneck... For this, you could write custom data loaders that are responsible for preparing 1 or more batches for the network. This seems to be supported in gensim (and it most definitely is for Keras/Torch/...): https://stackoverflow.com/questions/63459657/how-to-load-large-dataset-to-gensim-word2vec-model So you could read a file from disk in that dataloader and serve it to Word2Vec |
Thank you for the fast answer! Using the is_update=True on chunks of the data is not possible ? Because the underlying model will store the walks for the old entities too ? |
It's not ideal, as Word2Vec doesn't support iterative updating that well (it is possible but suboptimal). I think the custom data loader will provide better results! |
❓ Question
Hello,
I'm currently trying to generate Embeddings for a graph with ~ 13 million entities and 15 million walks. I'm using a machine with 128 Gb RAM, however, the in-memory saved random walks overflow the memory. Is there a way to store them to disk and batch load them like e.g. in image processing pipelines?
The text was updated successfully, but these errors were encountered: