Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 32 additions & 18 deletions java/com/google/re2j/Inst.java
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
*/
class Inst {

private static final int RUNES_LINER_SEARCH_SIZE = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINEAR?

Move this constant into the sole function that uses it.


enum Op {
ALT,
ALT_MATCH,
Expand Down Expand Up @@ -57,25 +59,37 @@ boolean matchRune(int r) {
// Special case: single-rune slice is from literal string, not char
// class.
if (runes.length == 1) {
int r0 = runes[0];
if (r == r0) {
return true;
}
if ((arg & RE2.FOLD_CASE) != 0) {
for (int r1 = Unicode.simpleFold(r0);
r1 != r0;
r1 = Unicode.simpleFold(r1)) {
if (r == r1) {
return true;
}
return singleMatchRune(r);
}

return multiMatchRune(r);
}

private boolean singleMatchRune(int r) {
int r0 = runes[0];
if (r == r0) {
return true;
}

if ((arg & RE2.FOLD_CASE) != 0) {
int[] folds = Unicode.optimizedFoldOrbit(r0);

if (folds == null) {
return (r == Unicode.optimizedFoldNonOrbit(r0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked this against an implementation based on Character.toLowerCase(r) == r? Ideally with a few changes like that, we could throw away all the Unicode tables and use the ones built into java.lang.

Be sure to use the int (code point) not char (UTF-16 code) variants of those functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are talking about the optimizedFoldNonOrbit? no I have not, was trying to be the least impactful as possible :) will have a look. From what I understand the Orbit exception will still need to be there ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the orbit case will still need special handling. I suggest you do this experiment as a separate change.

} else {
for(int i = 0; i < folds.length; i++) {
if (folds[i] == r) return true;
}
}
return false;
}
return false;
}

// Peek at the first few pairs.
private boolean multiMatchRune(int r) {
// Peek at the first 5 pairs.
// Should handle ASCII well.
for (int j = 0; j < runes.length && j <= 8; j += 2) {
int length = Math.min(runes.length, RUNES_LINER_SEARCH_SIZE);
for (int j = 0; j < length; j += 2) {
if (r < runes[j]) {
return false;
}
Expand All @@ -84,12 +98,12 @@ boolean matchRune(int r) {
}
}

// Otherwise binary search.
for (int lo = 0, hi = runes.length / 2; lo < hi; ) {
// Otherwise binary search on rest of the array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, please document:

// Invariant: lo, hi, m are even.

for (int lo = 0, hi = (runes.length - RUNES_LINER_SEARCH_SIZE) / 2; lo < hi; ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth the fuss to subtract the portion initially scanned. It's a logarithmic algorithm and this is not the common case. I'm sure it's fast enough to search the complete array. Did you find evidence to the contrary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, will remove it

int m = lo + (hi - lo) / 2;
int c = runes[2 * m];
int c = runes[2 * m + RUNES_LINER_SEARCH_SIZE];
if (c <= r) {
if (r <= runes[2 * m + 1]) {
if (r <= runes[2 * m + 1 + RUNES_LINER_SEARCH_SIZE]) {
return true;
}
lo = m + 1;
Expand Down
95 changes: 95 additions & 0 deletions java/com/google/re2j/Unicode.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@

package com.google.re2j;

import java.util.Arrays;

/**
* Utilities for dealing with Unicode better than Java does.
*
Expand Down Expand Up @@ -227,6 +229,99 @@ static int simpleFold(int r) {
return toUpper(r);
}

static int[] optimizedFoldOrbit(int r) {
if (r >= ORBIT_MIN_VALUE) {
return FOLD_MAP.get(r);
} else return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {...} around every block, or outdent the else block:

if (r < ORBIT_MIN_VALUE) {
    return null;
}
return FOLD_MAP.get(r);

}

static int optimizedFoldNonOrbit(int r) {
int l = toLower(r);
if (l != r) {
return l;
}
return toUpper(r);
}

private static final FoldMap FOLD_MAP = new FoldMap(UnicodeTables.CASE_ORBIT.length * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whence 4? Is it a load factor based on MAX_DISTANCE?

I don't think this should be a parameter. FoldMap is instantiated exactly once; specialize it.

private static final int ORBIT_MIN_VALUE = UnicodeTables.CASE_ORBIT[0][0];

static {
for(int i = 0; i < UnicodeTables.CASE_ORBIT.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int[][] orbit : UnicodeTables.CASE_ORBIT)

int r0 = UnicodeTables.CASE_ORBIT[i][0];
int[] folds = new int[0];
int r1 = r0;
while((r1 = simpleFold(r1)) != r0) {
folds = Arrays.copyOf(folds, folds.length + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't an orbit necessarily of length at least 3? I realize they are small and few, but it seems perverse to copy the entire array each time we add an element. Use a temporary large enough for the longest orbit (MAX_DISTANCE?) and copy it once before each call to FOLD_MAP.put.

folds[folds.length - 1] = r1;
}
FOLD_MAP.put(r0, folds);
}
}

/*
associate a rune to it's unfolded case equivalent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A FoldMap maps a rune to an array of runes that are equivalent to it in a case-insensitive pattern."

*/
private static class FoldMap {
private static final int MAX_DISTANCE = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document:

// number of closed-hashing probes before giving up

private final int[] keys;
private final int[][] values;
private final int mask;
private final int maxDistance;

FoldMap(int size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FoldMap serves exactly one purpose. It doesn't need a parameter, and it doesn't need to be separated from the logic that instantiates and populates it.

int s = findNextPositivePowerOfTwo(size);
keys = new int[s];
Arrays.fill(keys, -1);
values = new int[s][];
mask = s - 1;
maxDistance = Math.min(keys.length, MAX_DISTANCE);
}

public void put(int k, int[] fold) {
if (k == -1) throw new IllegalArgumentException("-1 is the empty marker");

int hash = hashCode(k);
int i = 0;
do {
int index = (hash + i) & mask;
if (keys[index] == -1) { // empty slot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two spaces before -1.

values[index] = fold;
keys[index] = k;
return;
}
i++;
} while (i < maxDistance);

throw new IllegalStateException("Map is full");
}

public int[] get(int k) {
int hash = hashCode(k);

int i = 0;
do {
int index = (hash + i) & mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two spaces before &.

int key = keys[index];
if (key == -1) return null; // empty slot
if (key == k) return values[index];
i++;
} while (i < maxDistance);

return null;
}
private static final int INT_PHI = 0x9E3779B9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to name this constant.


private int hashCode(int k) {
final int h = k * INT_PHI;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for final here.

return h ^ (h >> 16);
}

public static int findNextPositivePowerOfTwo(final int value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for final here.
Does this need to be public?

return 1 << (32 - Integer.numberOfLeadingZeros(value - 1));
}
}

private Unicode() {} // uninstantiable

}
48 changes: 48 additions & 0 deletions javatests/com/google/re2j/RunesTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
package com.google.re2j;

import org.junit.Test;

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

public class RunesTest {

@Test
public void testRunes() {
RE2 compile = RE2.compile("[0-13-46-78-9b-ce-fh-i]");
assertTrue(compile.match("0"));
assertTrue(compile.match("1"));
assertTrue(compile.match("3"));
assertTrue(compile.match("4"));
assertTrue(compile.match("6"));
assertTrue(compile.match("7"));
assertTrue(compile.match("8"));
assertTrue(compile.match("9"));
assertTrue(compile.match("b"));
assertTrue(compile.match("c"));
assertTrue(compile.match("e"));
assertTrue(compile.match("f"));
assertTrue(compile.match("h"));
assertTrue(compile.match("i"));

assertFalse(compile.match("2"));
assertFalse(compile.match("5"));
assertFalse(compile.match("a"));
assertFalse(compile.match("d"));
assertFalse(compile.match("g"));
assertFalse(compile.match("j"));
}
@Test
public void testRunesWithFold() {
Pattern pattern = Pattern.compile("ak", Pattern.CASE_INSENSITIVE);
String upperCase = "AK";
String withKelvin = "AK";
assertFalse(withKelvin.equals(upperCase));// check we use the kelvin sign
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before //


assertTrue(pattern.matches("ak"));
assertTrue(pattern.matches(upperCase));
assertTrue(pattern.matches("aK"));
assertTrue(pattern.matches("Ak"));
assertTrue(pattern.matches(withKelvin));
}
}
21 changes: 21 additions & 0 deletions javatests/com/google/re2j/UnicodeTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

package com.google.re2j;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.fail;

import org.junit.Test;
Expand Down Expand Up @@ -38,4 +39,24 @@ public void testFoldConstants() {
// int toLower(int r);
// int simpleFold(int r);

@Test
public void testOptimizedFold() {
// check that the new optimized fold alg will have the same result as using simple fold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check (capital C)
algorithm (in full)
fold. (period)

s/will have/gives/

for (int i = 0; i <= Unicode.MAX_RUNE; i++) {
int[] orbit = Unicode.optimizedFoldOrbit(i);
if (orbit != null) {
int r = Unicode.simpleFold(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoist this statement out of the if block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed you meant to create a method.

int j = 0;
while (r != i) {
assertEquals(r, orbit[j]);
j++;
r = Unicode.simpleFold(r);
}
} else {
int r = Unicode.simpleFold(i);
assertEquals(r, Unicode.optimizedFoldNonOrbit(i)); // first fold map the same
assertEquals(i, Unicode.simpleFold(r)); // second fold always go back to first
}
}
}
}