Skip to content

Commit d3e26e8

Browse files
authored
Add tunable HMH impl (#2)
Add tunable HyperMinHash impl
1 parent b99941c commit d3e26e8

30 files changed

+3288
-538
lines changed

Diff for: NOTICE

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
This product includes software developed by The Apache Software
2+
Foundation (http://www.apache.org/).

Diff for: README.md

+51-16
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,26 @@
11
[![Build Status](https://travis-ci.org/LiveRamp/HyperMinHash-java.svg?branch=master)](https://travis-ci.org/LiveRamp/HyperMinHash-java)
22

33
# HyperMinHash-java
4-
A Java implementation of the HyperMinHash algorithm, presented in [Yu and Weber](https://arxiv.org/pdf/1710.08436.pdf). HyperMinHash allows approximating cardinalities, intersections, and Jaccard indices of sets with very high accuracy, in loglog space, and in a streaming fashion.
4+
A Java implementation of the HyperMinHash algorithm, presented by
5+
[Yu and Weber](https://arxiv.org/pdf/1710.08436.pdf).
6+
HyperMinHash allows approximating set unions, intersections, Jaccard Indices,
7+
and cardinalities of very large sets with high accuracy using only loglog space.
8+
It also supports streaming updates and merging sketches, just the same
9+
as HyperLogLog.
510

6-
This library uses [Loglog-Beta](https://arxiv.org/pdf/1612.02284.pdf) for the underlying LogLog implementation. Loglog-beta is almost identical in accuracy to HyperLogLog++, except it performs better on cardinality estimations for small datasets (n <= 200k). Since we use Loglog-Beta, we refer to our implementation as BetaMinHash.
11+
This repo implements two flavors of HyperMinHash:
12+
1) **HyperMinHash**: An implementation based on HyperLogLog with the
13+
addition of the bias correction seen in HyperLogLog++.
14+
2) **BetaMinHash**: An implementation which uses [LogLog-Beta](http://cse.seu.edu.cn/PersonalPage/csqjxiao/csqjxiao_files/papers/INFOCOM17.pdf)
15+
for the underlying LogLog implementation. Loglog-beta is almost identical in
16+
accuracy to HyperLogLog++, except it performs better on cardinality
17+
estimations for small datasets (n <= 200k). Since we use Loglog-Beta,
18+
we refer to our implementation as BetaMinHash. However, our implementation
19+
currently only supports a fixed precision `p=14`.
720

8-
In addition to the features described above, this library adds the ability to do many-way intersections between sets, a new feature not described in the original paper (though, credit to the authors, easy to deduce from their examples). We also provide an implementation of the Hadoop Writable interface for easy use with MapReduce.
21+
Both implementations are equipped with serialization/deserialization
22+
capabilities out of the box for sending sketches over the wire or
23+
persisting them to disk.
924

1025
## Demo Usage
1126

@@ -17,27 +32,47 @@ for (byte[] element : mySet){
1732
sketch.add(element);
1833
}
1934
20-
sketch.cardinality();
21-
```
35+
long estimatedCardinality = sketch.cardinality();
36+
```
2237

2338

24-
### Merging sketches
39+
### Merging (unioning) sketches
40+
```
41+
Collection<BetaMinHash> sketches = getSketches();
42+
SketchCombiner<BetaMinHash> combiner = BetaMinHashCombiner.getInstance();
43+
BetaMinHash combined = combiner.union(sketches);
44+
45+
// to get cardinality of the union
46+
long unionCardinality = combined.cardinality();
47+
48+
// using HyperMinHash instead of BetaMinHash
49+
Collection<HyperMinHash> sketches = getSketches();
50+
SketchCombiner<HyperMinHash> combiner = HyperMinHashCombinre.getInstance();
51+
HyperMinHash combined = combiner.union(sketches);
2552
```
26-
BetaMinHash[] sketches = getSketches();
27-
BetaMinHash.merge(sketches);
28-
```
2953

3054
### Cardinality of unions
3155
```
32-
BetaMinHash[] sketches = getSketches();
33-
BetaMinHash.union(sketches);
34-
```
56+
BetaMinHash combined = combiner.union(sketches);
57+
long estimatedCardinality = combined.cardinality();
58+
```
3559

3660
### Cardinality of intersection
3761
```
38-
BetaMinHash[] sketches = getSketches();
39-
BetaMinHash.intersection(sketches);
40-
```
62+
Collection<BetaMinHash> sketches = getSketches();
63+
SketchCombiner<BetaMinHash> combiner = BetaMinHashComber.getInstance();
64+
long intersectionCardinality = combiner.intersectionCardinality(sketches);
65+
```
66+
67+
### Serializing a sketch
68+
To get a byte[] representation of a sketch, use the `IntersectionSketch.SerDe` interface:
69+
```
70+
HyperMinHash sketch = new
71+
HyperMinHashSerde serde = new HyperMinHashSerde();
72+
```
4173

4274
## Acknowledgements
43-
Thanks to Seif Lotfy for implementing a [Golang version of HyperMinHash](http://github.com/axiomhq/hyperminhash). We use some of his tests in our library, and the decision to use LogLog-Beta was due to the example he set.
75+
Thanks to Seif Lotfy for implementing a
76+
[Golang version of HyperMinHash](http://github.com/axiomhq/hyperminhash).
77+
We use some of his tests in our library, and our BetaMinHash implementation
78+
references his implementation.

Diff for: pom.xml

+8-12
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
<?xml version="1.0" encoding="UTF-8"?>
2-
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"
3-
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
2+
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
3+
xmlns="http://maven.apache.org/POM/4.0.0"
4+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
45
<modelVersion>4.0.0</modelVersion>
56

67
<artifactId>hyperminhash</artifactId>
@@ -43,21 +44,16 @@
4344
</plugins>
4445
</build>
4546
<dependencies>
46-
<dependency>
47-
<groupId>org.apache.hadoop</groupId>
48-
<artifactId>hadoop-common</artifactId>
49-
<version>${hadoop.version}</version>
50-
</dependency>
51-
<dependency>
52-
<groupId>com.adgear</groupId>
53-
<artifactId>metrohash</artifactId>
54-
<version>1.0.0</version>
55-
</dependency>
5647
<dependency>
5748
<groupId>junit</groupId>
5849
<artifactId>junit</artifactId>
5950
<version>4.12</version>
6051
</dependency>
52+
<dependency>
53+
<groupId>org.apache.commons</groupId>
54+
<artifactId>commons-math3</artifactId>
55+
<version>3.4.1</version>
56+
</dependency>
6157
</dependencies>
6258

6359
<repositories>
+94-104
Original file line numberDiff line numberDiff line change
@@ -1,157 +1,147 @@
11
package com.liveramp.hyperminhash;
22

33
import java.nio.ByteBuffer;
4-
5-
import util.hash.MetroHash128;
4+
import java.util.Arrays;
65

76
/**
87
* Implementation of HyperMinHash described in Yu and Weber: https://arxiv.org/pdf/1710.08436.pdf.
9-
* This class implements LogLog-Beta described in Qin, Kim, et al. here: https://arxiv.org/pdf/1612.02284.pdf.
10-
* Loglog-Beta is almost identical in accuracy to HyperLogLog and HyperLogLog++ except it performs better on cardinality
11-
* estimations for small datasets (n <= 200_000). It's also much simpler to implement.
8+
* This class implements LogLog-Beta described in Qin, Kim, et al. here:
9+
* https://arxiv.org/pdf/1612.02284.pdf. Loglog-Beta is almost identical in accuracy to HyperLogLog
10+
* and HyperLogLog++ except it performs better on cardinality estimations for small datasets (n <=
11+
* 200_000). It's also much simpler to implement.
12+
* <p>
13+
* The log log implementation uses the values of p and beta coefficients tested in the Loglog-beta
14+
* paper. It's possible to use different values of P but we'd need to recompute the beta
15+
* coefficients which is a computationally intensive process. So for now, this impl doesn't support
16+
* using different values of P. This being said the current value of P works with high accuracy for
17+
* very large cardinalities and small jaccard indices. See the paper for more details.
1218
* <p>
13-
* The log log implementation uses the values of p and beta coefficients tested in the Loglog-beta paper. It's possible
14-
* to use different values of P but we'd need to recompute the beta coefficients which is a computationally intensive
15-
* process. So for now, this impl doesn't support using different values of P. This being said the current value of P
16-
* works with high accuracy for very large cardinalities and small jaccard indices. See the paper for more details.
1719
* <p>
20+
* Similarly, we use values of Q and R suggested in the HyperMinHash paper. Those are theoretically
21+
* changeable, but the current values should provide sufficient accuracy for set cardinalities up to
22+
* 2^89 (see Hyperminhash paper for reference).
1823
* <p>
19-
* Similarly, we use values of Q and R suggested in the HyperMinHash paper. Those are theoretically changeable, but the
20-
* current values should provide sufficient accuracy for set cardinalities up to 2^89 (see Hyperminhash paper for
21-
* reference).
24+
* If you want to be able to combine multiple BetaMinHash instances, or compute their intersection,
25+
* you can use {@link BetaMinHashCombiner}.
2226
* <p>
2327
* If you'd like this class to support custom Q or R or P values, please open a github issue.
2428
* <p>
2529
*/
26-
public class BetaMinHash {
27-
// HLL Precision parameter
28-
public final static int P = 14;
29-
public final static int NUM_REGISTERS = (int)Math.pow(2, P);
30+
public class BetaMinHash implements IntersectionSketch<BetaMinHash> {
3031

32+
// HLL Precision parameter
33+
public static final int P = 14;
34+
public static final int NUM_REGISTERS = (int) Math.pow(2, P);
3135

3236
// TODO add actual validation if necessary
3337
// Q + R must always be <= 16 since we're packing values into 16 bit registers
34-
public final static int Q = 6;
35-
public final static int R = 10;
38+
public static final int Q = 6;
39+
public static final int R = 10;
40+
41+
private static final int HASH_SEED = 1337;
42+
static final byte VERSION = 1;
3643

3744
final short[] registers;
3845

3946
public BetaMinHash() {
4047
registers = new short[NUM_REGISTERS];
4148
}
4249

43-
BetaMinHash(short[] registers) {
44-
this();
45-
System.arraycopy(registers, 0, this.registers, 0, registers.length);
46-
}
47-
48-
public BetaMinHash(BetaMinHash other) {
49-
this(other.registers);
50+
private BetaMinHash(short[] registers) {
51+
this.registers = registers;
5052
}
5153

52-
public void add(byte[] val) {
53-
MetroHash128 hash = new MetroHash128(1337).apply(ByteBuffer.wrap(val));
54-
ByteBuffer buf = ByteBuffer.allocate(16);
55-
hash.writeBigEndian(buf);
56-
addHash(buf);
57-
}
5854

59-
/**
60-
* @param _128BitHash
61-
*/
62-
private void addHash(ByteBuffer _128BitHash) {
63-
if (_128BitHash.array().length != 16) {
64-
throw new IllegalArgumentException("input hash should be 16 bytes");
55+
static BetaMinHash deepCopyFromRegisters(short[] registers) {
56+
if (registers.length != NUM_REGISTERS) {
57+
throw new IllegalArgumentException(String.format(
58+
"Expected exactly %d registers, but there are %d",
59+
NUM_REGISTERS,
60+
registers.length));
6561
}
6662

67-
long hashLeftHalf = _128BitHash.getLong(0);
68-
long hashRightHalf = _128BitHash.getLong(8);
69-
70-
int registerIndex = getLeftmostPBits(hashLeftHalf);
71-
short rBits = getRightmostRBits(hashLeftHalf);
63+
final short[] registersCopy = new short[NUM_REGISTERS];
64+
System.arraycopy(registers, 0, registers, 0, NUM_REGISTERS);
7265

73-
byte leftmostOneBitPosition = getLeftmostOneBitPosition(hashRightHalf);
74-
75-
short packedRegister = packIntoRegister(leftmostOneBitPosition, rBits);
76-
if (registers[registerIndex] < packedRegister) {
77-
registers[registerIndex] = packedRegister;
78-
}
66+
return wrapRegisters(registersCopy);
7967
}
8068

81-
private int getLeftmostPBits(long hash) {
82-
return (int)(hash >>> (Long.SIZE - P));
69+
static BetaMinHash wrapRegisters(short[] registers) {
70+
return new BetaMinHash(registers);
8371
}
8472

85-
/**
86-
* Finds the position of the leftmost one-bit in the first (2^Q)-1 bits.
87-
*
88-
* @param hash
89-
* @return
90-
*/
91-
private byte getLeftmostOneBitPosition(long hash) {
92-
// To find the position of the leftmost 1-bit in the first (2^Q)-1 bits
93-
// We zero out all bits to the right of the first (2^Q)-1 bits then add a
94-
// 1-bit in the 2^Qth position of the bits to search. This way if the bits we're
95-
// searching are all 0, we take the position of the leftmost 1-bit to be 2^Q
96-
int _2q = (1 << Q) - 1;
97-
int shiftAmount = (Long.SIZE - _2q);
98-
99-
// zero all bits to the right of the first (2^Q)-1 bits
100-
long _2qSearchBits = ((hash >>> shiftAmount) << shiftAmount);
101-
102-
// add a 1-bit in the 2^Qth position
103-
_2qSearchBits += (1 << (shiftAmount - 1));
104-
105-
return (byte)(Long.numberOfLeadingZeros(_2qSearchBits) + 1);
73+
@Override
74+
public long cardinality() {
75+
return BetaMinHashCardinalityGetter.cardinality(this);
10676
}
10777

108-
private short getRightmostRBits(long hash) {
109-
return (short)(hash << (Long.SIZE - R) >>> Long.SIZE - R);
78+
@Override
79+
public boolean offer(byte[] val) {
80+
long[] _128BitHash = Murmur3.hash128(val);
81+
ByteBuffer buf = ByteBuffer.allocate(16);
82+
buf.putLong(_128BitHash[0]);
83+
buf.putLong(_128BitHash[1]);
84+
return addHash(buf);
11085
}
11186

112-
/**
113-
* Creates a new tuple/register value for the LL-Beta by bit-packing the number
114-
* of leading zeros with the rightmost R bits.
115-
*
116-
* @param leftmostOnebitPosition
117-
* @param rightmostRBits
118-
* @return
119-
*/
120-
private short packIntoRegister(byte leftmostOnebitPosition, short rightmostRBits) {
121-
// Q is at most 6, which means that with R<=10, we should be able to store these two
122-
// numbers in the same register
123-
return (short)((leftmostOnebitPosition << R) | rightmostRBits);
87+
@Override
88+
public boolean equals(Object o) {
89+
if (this == o) {
90+
return true;
91+
}
92+
if (!(o instanceof BetaMinHash)) {
93+
return false;
94+
}
95+
BetaMinHash that = (BetaMinHash) o;
96+
return Arrays.equals(registers, that.registers);
12497
}
12598

126-
public long cardinality() {
127-
return BetaMinHashCardinalityGetter.cardinality(this);
99+
@Override
100+
public int hashCode() {
101+
return Arrays.hashCode(registers);
128102
}
129103

130-
/**
131-
* @return Merged sketch representing the input sketches
132-
*/
133-
public static BetaMinHash merge(BetaMinHash... sketches) {
134-
return BetaMinHashMergeGetter.merge(sketches);
104+
@Override
105+
public BetaMinHash deepCopy() {
106+
return deepCopyFromRegisters(this.registers);
135107
}
136108

137109
/**
138-
* @return Union cardinality estimation
110+
* @param _128BitHash
139111
*/
140-
public static long union(BetaMinHash... sketches) {
141-
return merge(sketches).cardinality();
142-
}
112+
private boolean addHash(ByteBuffer _128BitHash) {
113+
if (_128BitHash.array().length != 16) {
114+
throw new IllegalArgumentException("input hash should be 16 bytes");
115+
}
143116

144-
/**
145-
* @return Intersection cardinality estimation
146-
*/
147-
public static long intersection(BetaMinHash... sketches) {
148-
return BetaMinHashIntersectionGetter.getIntersection(sketches);
117+
long hashLeftHalf = _128BitHash.getLong(0);
118+
int registerIndex = (int) BitHelper.getLeftmostBits(hashLeftHalf, P);
119+
short leftmostOneBitPosition = BitHelper.getLeftmostOneBitPosition(_128BitHash.array(), P, Q);
120+
/* We take the rightmost bits as what's called h_hat3 in the paper. Note that his differs from
121+
* the diagram in the paper which draws a parallel to a mantissa in a floating point
122+
* representation, but still satisfies the criterion of serving as an independent hash function
123+
* by selecting a set of independent bits from a larger hash. This is slightly simpler to
124+
* implement. */
125+
short rBits = (short) BitHelper.getRightmostBits(_128BitHash.array(), R);
126+
127+
short packedRegister = packIntoRegister(leftmostOneBitPosition, rBits);
128+
if (registers[registerIndex] < packedRegister) {
129+
registers[registerIndex] = packedRegister;
130+
return true;
131+
}
132+
133+
return false;
149134
}
150135

151136
/**
152-
* @return Jaccard index estimation
137+
* Creates a new tuple/register value for the LL-Beta by bit-packing the number of leading zeros
138+
* with the rightmost R bits.
153139
*/
154-
public static double similarity(BetaMinHash... sketches) {
155-
return BetaMinHashSimilarityGetter.similarity(sketches);
140+
private short packIntoRegister(short leftmostOnebitPosition, short rightmostRBits) {
141+
// Q is at most 6, which means that with R<=10, we should be able to store these two
142+
// numbers in the same register
143+
final int exponent = leftmostOnebitPosition << R;
144+
final int packedRegister = (exponent | rightmostRBits);
145+
return (short) packedRegister;
156146
}
157147
}

0 commit comments

Comments
 (0)