Skip to content

Commit 3cc300d

Browse files
authored
Update README.md
1 parent 20821ca commit 3cc300d

File tree

1 file changed

+74
-9
lines changed

1 file changed

+74
-9
lines changed

README.md

Lines changed: 74 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,23 +21,42 @@ _**Cleora** is a genus of moths in the family **Geometridae**. Their scientific
2121

2222
Cleora is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.
2323

24-
**Cleora** is now available as a python package _pycleora_. Key improvements compared to the previous version:
25-
* _performance optimizations_: 10x faster embedding times
26-
* _performance optimizations_: reduced memory usage
27-
* _latest research_: significantly improved embedding quality
24+
# Introducing Cleora 2.0.0 - Python native
25+
26+
**Installation**
27+
```
28+
pip install pycleora
29+
```
30+
31+
**Build instructions**
32+
```
33+
# prepare python env
34+
pip install maturin
35+
36+
# Install pycleora in current env (meant for development)
37+
maturin develop
38+
39+
# Usage example below. More examples in examples/ folder.
40+
```
41+
## Changelog
42+
43+
**Cleora** is now available as a Python package _pycleora_. Key improvements compared to the previous version:
44+
* _performance optimizations_: ~10x faster embedding times
45+
* _performance optimizations_: significantly reduced memory usage
46+
* _latest research_: improved embedding quality
2847
* _new feature_: can create graphs from a Python iterator in addition to tsv files
2948
* _new feature_: seamless integration with _NumPy_
3049
* _new feature_: item attributes support via custom embeddings initialization
3150
* _new feature_: adjustable vector projection / normalization after each propagation step
3251

3352
**Breaking changes:**
34-
* _transient_ modifier not supported any more - creating _complex::reflexive_ columns for hypergraph embeddings, grouped by the transient entity gives better results.
53+
* _transient_ modifier not supported any more - creating `complex::reflexive` columns for hypergraph embeddings, _grouped by_ the transient entity gives better results.
3554

3655

37-
**Example usage:**
56+
# Usage example:
3857

3958
```
40-
import pycleora
59+
from pycleora import SparseMatrix
4160
import numpy as np
4261
import pandas as pd
4362
import random
@@ -61,7 +80,7 @@ customer_products = df.groupby('customer')['product'].apply(list).values
6180
cleora_input = map(lambda x: ' '.join(x), customer_products)
6281
6382
# Create Markov transition matrix for the hypergraph
64-
mat = pycleora.SparseMatrix.from_iterator(cleora_input, columns='complex::reflexive::product')
83+
mat = SparseMatrix.from_iterator(cleora_input, columns='complex::reflexive::product')
6584
6685
# Look at entity ids in the matrix, corresponding to embedding vectors
6786
print(mat.entity_ids)
@@ -95,6 +114,49 @@ print(np.dot(embeddings[0], embeddings[1]))
95114
print(np.dot(embeddings[0], embeddings[2]))
96115
print(np.dot(embeddings[0], embeddings[3]))
97116
```
117+
# FAQ
118+
119+
**Q: What should I embed?**
120+
121+
A: Typically products, stores, urls, locations, or any entity that people interact with.
122+
123+
**Q: How should I construct the input?**
124+
125+
A: What works best is grouping entities co-occurring in a similar context, and feeding them in whitespace-separated lines using `complex::reflexive` modifier is a good idea. E.g. if you have product data, you can group the products by shopping baskets or by users. If you have urls, you can group them by browser sessions, of by (user, time window) pairs. Check out the usage example above. Grouping products by customers is just one possibility.
126+
127+
**Q: Can I embed users and products simultaneously, to compare them with cosine similarity?**
128+
129+
A: No, this is a methodologically wrong approach, stemming from outdated matrix factorization approaches. What you should do is come up with good product embeddings first, then create user embeddings from them. Feeding two columns e.g. `user product` into cleora will result in a bipartite graph. Similar products will be close to each other, similar users will be close to each other, but users and products will not necessarily be similar to each other.
130+
131+
**Q: What embedding dimensionality to use?**
132+
133+
A: The more, the better, but we typically work from _1024_ to _4096_. Memory is cheap and machines are powerful, so don't skimp on embedding size.
134+
135+
**Q: How many iterations of Markov propagation should I use?**
136+
137+
A: Depends on what you want to achieve. Low iterations (3) tend to approximate the co-occurrence matrix, while high iterations (7+) tend to give contextual similarity (think skip-gram but much more accurate and faster).
138+
139+
**Q: How do I incorporate external information, e.g. entity metadata, images, texts into the embeddings?**
140+
141+
A: Just initialize the embedding matrix with your own vectors coming from a VIT, setence-transformers, of a random projection of your numeric features. In that scenario low numbers of Markov iterations (1 to 3) tend to work best.
142+
143+
**Q: My embeddings don't fit in memory, what do I do?**
144+
145+
A: Cleora operates on dimensions independently. Initialize your embeddings with a smaller number of dimensions, run Cleora, persist to disk, then repeat. You can concatenate your resulting embedding vectors afterwards, but remember to normalize them afterwards!
146+
147+
**Q: Is there a minimum number of entity occurrences?**
148+
149+
A: No, an entity `A` co-occuring just 1 time with some other entity `B` will get a proper embedding, i.e. `B` will be the most similar to `A`. The other way around, `A` will be highly ranked among nearest neighbors of `B`, which may or may not be desirable, depending on your use case. Feel free to prune your input to Cleora to eliminate low-frequency items.
150+
151+
**Q: Are there any edge cases where Cleora can fail?**
152+
153+
A: Cleora works best for relatively sparse hypergraphs. If all your hyperedges contain some very common entity `X`, e.g. a _shopping bag_, then it will degrade the quality of embeddings by degenerating shortest paths in the random walk. It is a good practice to remove such entities from the hypergraph.
154+
155+
**Q: How can Cleora be so fast and accurate at the same time?**
156+
157+
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. We perform 3+ such iterations. Thanks to a highly efficient implementation in Rust, with special care for concurrency, memory layout and cache coherence, it is blazingly fast. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
158+
159+
# Science
98160

99161
**Read the whitepaper ["Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"](https://arxiv.org/abs/2102.02302)**
100162

@@ -106,6 +168,7 @@ Types of data which can be embedded include for example:
106168
- text and other categorical array data
107169
- any combination of the above
108170

171+
**!!! Disclaimer: the numbers below are for Cleora 1.x, new version is significantly faster, but yet have to re-run the benchmarks**
109172

110173
Key competitive advantages of Cleora:
111174
* more than **197x faster than DeepWalk**
@@ -239,6 +302,8 @@ The technical properties described above imply good production-readiness of Cleo
239302

240303
## Documentation
241304

305+
**!!! Disclaimer the documentation below is for Cleora 1.x, to be updated for 2.x**
306+
242307
More information can be found in [the full documentation](https://cleora.readthedocs.io/).
243308

244309
For details contact us at [email protected]
@@ -263,4 +328,4 @@ Synerise Cleora is MIT licensed, as found in the [LICENSE](LICENSE) file.
263328

264329
## How to Contribute
265330

266-
You are welcomed to contribute to this open-source toolbox. The detailed instructions will be released soon as issues.
331+
Pull requests are welcome.

0 commit comments

Comments
 (0)