Skip to content

Commit 0de538c

Browse files
committed
Auto merge of rust-lang#136693 - oxalica:feat/shift-dfa-utf8, r=<try>
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
2 parents 1ff2135 + 486b0d1 commit 0de538c

File tree

2 files changed

+290
-121
lines changed

2 files changed

+290
-121
lines changed

Diff for: library/core/src/str/solve_dfa.py

+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#!/usr/bin/env python3
2+
# Use z3 to solve UTF-8 validation DFA for offset and transition table,
3+
# in order to encode transition table into u32.
4+
# We minimize the output variables in the solution to make it deterministic.
5+
# Ref: <https://gist.github.com/dougallj/166e326de6ad4cf2c94be97a204c025f>
6+
# See more detail explanation in `./validations.rs`.
7+
#
8+
# It is expected to find a solution in <30s on a modern machine, and the
9+
# solution is appended to the end of this file.
10+
from z3 import *
11+
12+
STATE_CNT = 9
13+
14+
# The transition table.
15+
# A value X on column Y means state Y should transition to state X on some
16+
# input bytes. We assign state 0 as ERROR and state 1 as ACCEPT (initial).
17+
# Eg. first line: for input byte 00..=7F, transition S1 -> S1, others -> S0.
18+
TRANSITIONS = [
19+
# 0 1 2 3 4 5 6 7 8
20+
# First bytes
21+
((0, 1, 0, 0, 0, 0, 0, 0, 0), "00-7F"),
22+
((0, 2, 0, 0, 0, 0, 0, 0, 0), "C2-DF"),
23+
((0, 3, 0, 0, 0, 0, 0, 0, 0), "E0"),
24+
((0, 4, 0, 0, 0, 0, 0, 0, 0), "E1-EC, EE-EF"),
25+
((0, 5, 0, 0, 0, 0, 0, 0, 0), "ED"),
26+
((0, 6, 0, 0, 0, 0, 0, 0, 0), "F0"),
27+
((0, 7, 0, 0, 0, 0, 0, 0, 0), "F1-F3"),
28+
((0, 8, 0, 0, 0, 0, 0, 0, 0), "F4"),
29+
# Continuation bytes
30+
((0, 0, 1, 0, 2, 2, 0, 4, 4), "80-8F"),
31+
((0, 0, 1, 0, 2, 2, 4, 4, 0), "90-9F"),
32+
((0, 0, 1, 2, 2, 0, 4, 4, 0), "A0-BF"),
33+
# Illegal
34+
((0, 0, 0, 0, 0, 0, 0, 0, 0), "C0-C1, F5-FF"),
35+
]
36+
37+
o = Optimize()
38+
offsets = [BitVec(f"o{i}", 32) for i in range(STATE_CNT)]
39+
trans_table = [BitVec(f"t{i}", 32) for i in range(len(TRANSITIONS))]
40+
41+
# Add some guiding constraints to make solving faster.
42+
o.add(offsets[0] == 0)
43+
o.add(trans_table[-1] == 0)
44+
45+
for i in range(len(offsets)):
46+
# Do not over-shift. It's not necessary but makes solving faster.
47+
o.add(offsets[i] < 32 - 5)
48+
for j in range(i):
49+
o.add(offsets[i] != offsets[j])
50+
for trans, (targets, _) in zip(trans_table, TRANSITIONS):
51+
for src, tgt in enumerate(targets):
52+
o.add((LShR(trans, offsets[src]) & 31) == offsets[tgt])
53+
54+
# Minimize ordered outputs to get a unique solution.
55+
goal = Concat(*offsets, *trans_table)
56+
o.minimize(goal)
57+
print(o.check())
58+
print("Offset[]= ", [o.model()[i].as_long() for i in offsets])
59+
print("Transitions:")
60+
for (_, label), v in zip(TRANSITIONS, [o.model()[i].as_long() for i in trans_table]):
61+
print(f"{label:14} => {v:#10x}, // {v:032b}")
62+
63+
# Output should be deterministic:
64+
# sat
65+
# Offset[]= [0, 6, 16, 19, 1, 25, 11, 18, 24]
66+
# Transitions:
67+
# 00-7F => 0x180, // 00000000000000000000000110000000
68+
# C2-DF => 0x400, // 00000000000000000000010000000000
69+
# E0 => 0x4c0, // 00000000000000000000010011000000
70+
# E1-EC, EE-EF => 0x40, // 00000000000000000000000001000000
71+
# ED => 0x640, // 00000000000000000000011001000000
72+
# F0 => 0x2c0, // 00000000000000000000001011000000
73+
# F1-F3 => 0x480, // 00000000000000000000010010000000
74+
# F4 => 0x600, // 00000000000000000000011000000000
75+
# 80-8F => 0x21060020, // 00100001000001100000000000100000
76+
# 90-9F => 0x20060820, // 00100000000001100000100000100000
77+
# A0-BF => 0x860820, // 00000000100001100000100000100000
78+
# C0-C1, F5-FF => 0x0, // 00000000000000000000000000000000

0 commit comments

Comments
 (0)