Embeddings for large KG #154

deeplifyde · 2022-10-28T07:44:19Z

❓ Question

Hello,
I'm currently trying to generate Embeddings for a graph with ~ 13 million entities and 15 million walks. I'm using a machine with 128 Gb RAM, however, the in-memory saved random walks overflow the memory. Is there a way to store them to disk and batch load them like e.g. in image processing pipelines?

GillesVandewiele · 2022-10-28T08:14:45Z

Yes, most definitely! Within the fit() function, the extract_walks() function is called that returns a list of walks that are fed to Word2Vec later on.

https://github.com/IBCNServices/pyRDF2Vec/blob/main/pyrdf2vec/rdf2vec.py#L107

You could call this function on different chunks of entities to do this iteratively. We also have some mechanisms to speed up extraction and/or reduce memory usage:

Hashing of urls (less bytes needed)
Using HDT to speed up KG loading time, or even better, directly querying a locally hosted SPARQL endpoint (no loading time)
Extracting a limited amount of walks per entity, but this reduces accuracy later on
Excluding certain types of walks (forbidden predicates) of which you know they do not carry much information
Etc.

However, in the end, you will need to feed all these walks AT ONCE to a Word2Vec or other word embedding network, which could be the main bottleneck... For this, you could write custom data loaders that are responsible for preparing 1 or more batches for the network. This seems to be supported in gensim (and it most definitely is for Keras/Torch/...): https://stackoverflow.com/questions/63459657/how-to-load-large-dataset-to-gensim-word2vec-model

So you could read a file from disk in that dataloader and serve it to Word2Vec

deeplifyde · 2022-10-28T08:41:31Z

Thank you for the fast answer!

Using the is_update=True on chunks of the data is not possible ? Because the underlying model will store the walks for the old entities too ?

GillesVandewiele · 2022-10-28T10:06:34Z

It's not ideal, as Word2Vec doesn't support iterative updating that well (it is possible but suboptimal). I think the custom data loader will provide better results!

deeplifyde added the question Further information is requested label Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings for large KG #154

Embeddings for large KG #154

deeplifyde commented Oct 28, 2022

GillesVandewiele commented Oct 28, 2022 •

edited

Loading

deeplifyde commented Oct 28, 2022

GillesVandewiele commented Oct 28, 2022

Embeddings for large KG #154

Embeddings for large KG #154

Comments

deeplifyde commented Oct 28, 2022

❓ Question

GillesVandewiele commented Oct 28, 2022 • edited Loading

deeplifyde commented Oct 28, 2022

GillesVandewiele commented Oct 28, 2022

GillesVandewiele commented Oct 28, 2022 •

edited

Loading