You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*_new feature_: can create graphs from a Python iterator in addition to tsv files
29
48
*_new feature_: seamless integration with _NumPy_
30
49
*_new feature_: item attributes support via custom embeddings initialization
31
50
*_new feature_: adjustable vector projection / normalization after each propagation step
32
51
33
52
**Breaking changes:**
34
-
*_transient_ modifier not supported any more - creating _complex::reflexive_ columns for hypergraph embeddings, grouped by the transient entity gives better results.
53
+
*_transient_ modifier not supported any more - creating `complex::reflexive` columns for hypergraph embeddings, _grouped by_ the transient entity gives better results.
A: Typically products, stores, urls, locations, or any entity that people interact with.
122
+
123
+
**Q: How should I construct the input?**
124
+
125
+
A: What works best is grouping entities co-occurring in a similar context, and feeding them in whitespace-separated lines using `complex::reflexive` modifier is a good idea. E.g. if you have product data, you can group the products by shopping baskets or by users. If you have urls, you can group them by browser sessions, of by (user, time window) pairs. Check out the usage example above. Grouping products by customers is just one possibility.
126
+
127
+
**Q: Can I embed users and products simultaneously, to compare them with cosine similarity?**
128
+
129
+
A: No, this is a methodologically wrong approach, stemming from outdated matrix factorization approaches. What you should do is come up with good product embeddings first, then create user embeddings from them. Feeding two columns e.g. `user product` into cleora will result in a bipartite graph. Similar products will be close to each other, similar users will be close to each other, but users and products will not necessarily be similar to each other.
130
+
131
+
**Q: What embedding dimensionality to use?**
132
+
133
+
A: The more, the better, but we typically work from _1024_ to _4096_. Memory is cheap and machines are powerful, so don't skimp on embedding size.
134
+
135
+
**Q: How many iterations of Markov propagation should I use?**
136
+
137
+
A: Depends on what you want to achieve. Low iterations (3) tend to approximate the co-occurrence matrix, while high iterations (7+) tend to give contextual similarity (think skip-gram but much more accurate and faster).
138
+
139
+
**Q: How do I incorporate external information, e.g. entity metadata, images, texts into the embeddings?**
140
+
141
+
A: Just initialize the embedding matrix with your own vectors coming from a VIT, setence-transformers, of a random projection of your numeric features. In that scenario low numbers of Markov iterations (1 to 3) tend to work best.
142
+
143
+
**Q: My embeddings don't fit in memory, what do I do?**
144
+
145
+
A: Cleora operates on dimensions independently. Initialize your embeddings with a smaller number of dimensions, run Cleora, persist to disk, then repeat. You can concatenate your resulting embedding vectors afterwards, but remember to normalize them afterwards!
146
+
147
+
**Q: Is there a minimum number of entity occurrences?**
148
+
149
+
A: No, an entity `A` co-occuring just 1 time with some other entity `B` will get a proper embedding, i.e. `B` will be the most similar to `A`. The other way around, `A` will be highly ranked among nearest neighbors of `B`, which may or may not be desirable, depending on your use case. Feel free to prune your input to Cleora to eliminate low-frequency items.
150
+
151
+
**Q: Are there any edge cases where Cleora can fail?**
152
+
153
+
A: Cleora works best for relatively sparse hypergraphs. If all your hyperedges contain some very common entity `X`, e.g. a _shopping bag_, then it will degrade the quality of embeddings by degenerating shortest paths in the random walk. It is a good practice to remove such entities from the hypergraph.
154
+
155
+
**Q: How can Cleora be so fast and accurate at the same time?**
156
+
157
+
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. We perform 3+ such iterations. Thanks to a highly efficient implementation in Rust, with special care for concurrency, memory layout and cache coherence, it is blazingly fast. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
158
+
159
+
# Science
98
160
99
161
**Read the whitepaper ["Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"](https://arxiv.org/abs/2102.02302)**
100
162
@@ -106,6 +168,7 @@ Types of data which can be embedded include for example:
106
168
- text and other categorical array data
107
169
- any combination of the above
108
170
171
+
**!!! Disclaimer: the numbers below are for Cleora 1.x, new version is significantly faster, but yet have to re-run the benchmarks**
109
172
110
173
Key competitive advantages of Cleora:
111
174
* more than **197x faster than DeepWalk**
@@ -239,6 +302,8 @@ The technical properties described above imply good production-readiness of Cleo
239
302
240
303
## Documentation
241
304
305
+
**!!! Disclaimer the documentation below is for Cleora 1.x, to be updated for 2.x**
306
+
242
307
More information can be found in [the full documentation](https://cleora.readthedocs.io/).
0 commit comments