Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693

oxalica · 2025-02-07T16:27:55Z

Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in #107760,

For prior art: shift-DFAs are now used for UTF-8 validation in PostgreSQL, and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level1.

Rationales

Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.
Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

Implementation details

I use the ordinary UTF-8 language definition from RFC3692 and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of [u64; 256], or 2KiB rodata.

The main algorithm consists of following parts:

Main loop: taking a chunk of MAIN_CHUNK_SIZE = 16 bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose ASCII_CHUNK_SIZE = 16 to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and break into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:

Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
We still need to get UTF-8 encoded length from the first byte in utf8_char_width. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

Benchmarks

I made an out-of-tree implementation repository for easier testing and benching. It also tested various MAIN_CHUNK_SIZE (m) and ASCII_CHUNK_SIZE (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia William Shakespeare in en, es and zh language.

In short: with m=16, a=16, shift-DFA performance gives -45% on en, +69% on es, +135% on zh; with m=8, a=32, it gives +5% on en, +22% on es, +136% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like "es" because of the taken branch is flipping around.
To me, the difference in "en" is minimal in absolute time because the throughput is already high enough, comparing to not-as-fast "es". So I currently picking m=16, a=16 to lean towards "es" in the PR.

x86_64-linux results

On: Ryzen 7 5700G @3.775GHz (disabled turbo, with cpuset, with stack layout randomizer):

Note: zh input consists of solely non-ASCII codepoints and runs fully on DFA path, it's the same performance as purely emoji inputs.

Algorithm	Input language	Throughput / (GiB/s)
std	en	48.791 +-0.128
shift-dfa-m16-a16	en	26.815 +-0.006
shift-dfa-m8-a32	en	51.314 +-0.018
std	es	5.925 +-0.069
shift-dfa-m16-a16	es	10.033 +-0.007
shift-dfa-m8-a32	es	7.254 +-0.035
std	zh	1.450 +-0.008
shift-dfa-m16-a16	zh	3.414 +-0.002
shift-dfa-m8-a32	zh	3.421 +-0.002

Before (486b0d1):

    test string::from_utf8_lossy_100_ascii                   ... bench:          40.96 ns/iter (+/- 6.47)
    test string::from_utf8_lossy_100_invalid                 ... bench:       1,738.28 ns/iter (+/- 10.53)
    test string::from_utf8_lossy_100_multibyte               ... bench:          61.22 ns/iter (+/- 0.38)
    test string::from_utf8_lossy_invalid                     ... bench:         125.26 ns/iter (+/- 1.85)

    test str::str_validate_emoji                             ... bench:       3,279.32 ns/iter (+/- 54.01)

After with m16 a16 (ceb82dd971b7aef47493298255bab732bdc67b5e):

    test string::from_utf8_lossy_100_ascii                   ... bench:          23.92 ns/iter (+/- 0.10)
    test string::from_utf8_lossy_100_invalid                 ... bench:       1,962.22 ns/iter (+/- 42.19)
    test string::from_utf8_lossy_100_multibyte               ... bench:          41.19 ns/iter (+/- 0.75)
    test string::from_utf8_lossy_invalid                     ... bench:         155.69 ns/iter (+/- 3.46)

    test str::str_validate_emoji                             ... bench:       1,909.92 ns/iter (+/- 94.34)

After with m8 a32 (ceb82dd971b7aef47493298255bab732bdc67b5e):

    test string::from_utf8_lossy_100_ascii                   ... bench:          20.14 ns/iter (+/- 0.03)
    test string::from_utf8_lossy_100_invalid                 ... bench:       2,017.46 ns/iter (+/- 9.39)
    test string::from_utf8_lossy_100_multibyte               ... bench:          54.19 ns/iter (+/- 0.07)
    test string::from_utf8_lossy_invalid                     ... bench:         152.43 ns/iter (+/- 2.46)

    test str::str_validate_emoji                             ... bench:       2,715.99 ns/iter (+/- 164.68)

Unresolved

Benchmark on aarch64-darwin, another tier 1 target.
See this comment.
Decide the chunk size parameters. I'm currently picking m=16, a=16.
Should we also replace the implementation of lossy conversion by calling the new validation function? It has a very similar code doing almost the same thing.

It now also uses the new validation algorithm. Benchmarks of lossy conversions are included above.

rustbot · 2025-02-07T16:28:05Z

r? @cuviper

rustbot has assigned @cuviper.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

hanna-kruppe

Shift-based DFAs are a really cool trick, great to see that precise error reporting and ASCII fast paths can be added on top of them. I don't want to steal an actual reviewer's show and I've only given this a quick cursory read, but there's one thing I was wondering about and one small suggestion.

library/core/src/str/validations.rs

the8472 · 2025-02-07T18:36:12Z

or 2KiB rodata.

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding. Have you benchmarked this for short inputs and with mixed workloads?

In coretests we have TINY/SMALL/MEDIUM/LARGE/HUGE for that reason

Additionally you may want to benchmark 4-byte characters such as emoji or maybe some exotic scripts on the supplementary plane.

thaliaarchi · 2025-02-07T19:00:21Z

Should we also replace the implementation of lossy conversion by calling the new validation function?
It has a very similar code doing almost the same thing.

FYI, in #136677, I fixed the handling of invalid UTF-8 in Display for OsStr/Path, so there's a new user of lossy conversion. It needs to count the number of (potentially invalid) characters in a byte string and truncate it to some character width. I extracted a single iteration of the loop in Utf8Chunks::next to reuse it there. If Utf8Chunks is more amenable to using your optimization, OsStr might need a separate API.

hanna-kruppe · 2025-02-07T19:07:35Z

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding.

This is a great point and should be benchmarked. Note that it's mitigated on many common cases by the accesses to the table depending only on the input bytes, so e.g. strings that all ASCII touch at most half the table. But that's still a lot of cache lines.

cuviper · 2025-02-07T19:14:45Z

I don't want to steal an actual reviewer's show

No worries on my part as the current assignee -- in fact, if another reviewer has more context here, I'd be happy to yield this one. Maybe @the8472 wants to take it?

oxalica · 2025-02-07T19:39:04Z

@the8472

or 2KiB rodata.

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding. Have you benchmarked this for short inputs and with mixed workloads?

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

In coretests we have TINY/SMALL/MEDIUM/LARGE/HUGE for that reason

It seems they are only used in little places like str::char_count and str::debug. str::char() is untouchedby this PR, because it uses the UTF-8 decoder next_code_point, not the validator. str as Debug is unrelated to validation either. This PR is only changing the implementation of core::str::from_utf8 and its derivations.

Additionally you may want to benchmark 4-byte characters such as emoji or maybe some exotic scripts on the supplementary plane.

It should performs the same as the "zh" benchmark. Since the DFA is agnostic to the input, as long as it's purely non-ASCII to get into the DFA path, the performance should be the same. The correctness of DFA can be audited by checking state transitions table to match RFC3629.
Nevertheless, here's the benchmark (./x.py bench --stage 0 library/coretests -- str::str_validate_emoji):

Before:
str::str_validate_emoji                                         2742.95ns/iter   +/- 46.44
After:
str::str_validate_emoji                                         1529.48ns/iter    +/- 7.52

hanna-kruppe · 2025-02-07T19:42:36Z

No worries on my part as the current assignee -- in fact, if another reviewer has more context here, I'd be happy to yield this one.

I just meant that I, personally, couldn't take over the review even if I wanted to since I haven't had bors permissions in years 😄

hanna-kruppe · 2025-02-07T20:26:04Z

Slightly silly idea, free to a good home: mapping the input bytes into equivalence classes before feeding them to the DFA, as done in Flexible and Economical UTF-8 Decoder, would reduce the rodata size to 256B + 8B * (# classes). That's 256B + 96B with the classes from the linked implementation, but idk if that number can be lower or has to be larger in this context.

The critical path for the DFA transition is not affected by the input remapping, so this would still run pretty fast. On Skylake-ish hardware, idealized peak throughput goes from 1 cycle per byte to 1.5cpb because the extra load becomes the bottleneck (that's for a loop that does nothing but compute the final state, the version in this PR doesn't seem to hit 1cpb anyway). On CPUs that can sustain more than two loads per cycle, it may not even be any slower for long inputs.

the8472 · 2025-02-07T20:37:10Z

It seems they are only used in little places like str::char_count and str::debug. str::char() is untouchedby this PR

Sorry, I didn't mean that this was specifically for utf8 validation, I was just gesturing at more cases that can be useful for string benchmarking in general to cover common scenarios.
Maybe you can look at earlier PRs that touched validation, sometimes other people have implemented their own external benchmark suites if I recall correctly.

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

That's the tricky part. As hanna-kruppe says, such effects are most visible in complex codebases and are difficult to test with micro-benchmarks.

I suggested CBOR because it should contain short strings (e.g. dictionary keys) mixed with other non-string parsing and some implementations might already have their own benchmarks that can be reused. So it could be a benchmark of intermediate complexity.
You can try using perf stat to see if an existing benchmark is already experiencing cache pressure which might be exacerbated by the lookup table. Though I guess modern CPU caches may have some strategy that can deal with streaming workloads without clobbering the parts, so maybe it won't be complex enough.

I'll kick off a perf run to see if it impacts rustc itself. But we have already eliminated a bunch of string validations in the compiler, so it likely isn't a good benchmark either.

@bors try @rust-timer queue

Nevertheless, here's the benchmark (./x.py bench --stage 0 library/coretests -- str::str_validate_emoji):

👍

bors · 2025-02-07T20:38:22Z

⌛ Trying commit 33c076e with merge 71c0147...

Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.

oxalica · 2025-02-07T21:05:06Z

@hanna-kruppe

Slightly silly idea, free to a good home: mapping the input bytes into equivalence classes before feeding them to the DFA, as done in Flexible and Economical UTF-8 Decoder, would reduce the rodata size to 256B + 8B * (# classes). That's 256B + 96B with the classes from the linked implementation, but idk if that number can be lower or has to be larger in this context.
The critical path for the DFA transition is not affected by the input remapping

Not really. As I mentioned in the 32-bit-shift part, the bottleneck of the DFA path is latency, since the DFA has data dependency from each state to its precedent state. The later state can only be calculated after the previous result is out, no matter how many ALUs you have. So,

goes from 1 cycle per byte to 1.5cpb

... literally means +50% latency and -50% performance. I tested it by introducing an identity lookup table [u8; 256] before each TRANS_TABLE lookup, then throughput-zh-aligned/shift-dfa-m16-a16 goes down from 3.401GiB/s to 2.424GiB/s, which is -29%. So I don't think it worth it.

But yes, the table must be fully in cache to achieve these numbers, mixed workload is a challenge. I also prefer smaller table size, but not yet find a way to shrink it without introducing massive performance drop.

library/core/src/str/validations.rs

hanna-kruppe · 2025-02-07T21:23:45Z

@oxalica Remapping inputs before looking up the DFA transition for them doesn't add latency in any way that matters -- if it did, the extra latency of a load on the critical path would immediately take it to 4-5cpb or more. It's slower because it hits other bottlenecks, such as issue width or loads per cycle (you might want to check if your change unexpectedly added more instructions than just the loads). In any case, yes, its latency is significantly worse than the big LUT. But by your numbers, it's still 1.65x faster than the current implementation in std on the zh input. So it's another point on the pareto curve if rodate size is a concern -- often it's not, but the discussion about it prompted this idea.

bors · 2025-02-07T22:26:39Z

☀️ Try build successful - checks-actions
Build commit: 71c0147 (71c014723ade15a2bc2280eede6c97f96aa383fe)

oxalica · 2025-02-07T23:06:27Z

@the8472

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

That's the tricky part. As hanna-kruppe says, such effects are most visible in complex codebases and are difficult to test with micro-benchmarks.

I suggested CBOR because it should contain short strings (e.g. dictionary keys) mixed with other non-string parsing and some implementations might already have their own benchmarks that can be reused. So it could be a benchmark of intermediate complexity.

I wrote some CBOR deserialization bench in my out-of-tree repo. By running it on this PR and master rustc sysroot, the result shows this PR gives an +0.3%~1.4% performance improvement. On small-string-heavy twitter.json.dagcbor the difference is negligible. I doubt that these numbers are mostly just noise and there is no significant change in performance.

joboet · 2025-02-07T23:35:13Z

I think it's worth mentioning that in #107760 @thomcc managed to find a way to pack the 9 states of the DFA into 32-bits, which would help address the cache size, binary size and portability concerns. Perhaps you could ask him if he'd provide you with the SMT-solver code he used to find the state values?

rust-timer · 2025-02-08T01:40:20Z

Finished benchmarking commit (71c0147): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	3.3%	[3.3%, 3.3%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.3%	[3.3%, 3.3%]	1

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (primary 3.3%, secondary 2.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.3%	[3.3%, 3.3%]	1
Regressions ❌ (secondary)	2.5%	[1.4%, 2.9%]	9
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	3.3%	[3.3%, 3.3%]	1

Binary size

Results (primary 0.2%, secondary 0.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.1%, 0.5%]	14
Regressions ❌ (secondary)	0.5%	[0.3%, 0.6%]	38
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[0.1%, 0.5%]	14

Bootstrap: 779.143s -> 778.734s (-0.05%)
Artifact size: 329.04 MiB -> 329.08 MiB (0.01%)

oxalica · 2025-02-09T03:00:00Z

@joboet

I think it's worth mentioning that in #107760 @thomcc managed to find a way to pack the 9 states of the DFA into 32-bits, which would help address the cache size, binary size and portability concerns. Perhaps you could ask him if he'd provide you with the SMT-solver code he used to find the state values?

Thanks for pointing out this and I updated the patch to use a u32 table now. I checked the script mentioned in postgres comments and rewrote it for our use case:

Use the state assignment mentioned in this PR previously, which is a sequential assignment on the language definition, not using their newly-invented state names.
Optimize on the solution numbers to get the minimal solution, so it's deterministic and reproducible. It should also be stable across SAT solver used and/or its version.

I put the generating python script in tree and appended the output solution to it for reference. I'm not sure if there is a better place to put it, but I found there is library/core/src/unicode/printable.py. So I guess putting it next to the relevant Rust code is fine?

Performance wise, u32 transitions make the algorithm on i686 runs almost as fast as on x86_64. But it does not affect much on x86_64 alone. It should reduce cache pressure but that may not be easily benchmark-able.

@the8472
I'm not sure how to interpret the numbers in the perf report. Is it considered a severe regression? I think that the slower compilation time is mainly affected by slower ASCII checking, as rustc itself does only ASCII processing and most Rust source codes are in ASCII. If we want to optimize for this, I'll change the chunking parameters to MAIN_CHUNK_SIZE=8, ASCII_CHUNK_SIZE=32.

All benchmark results are updated in my repo: x86_64, i686, CBOR deserialization compiled by rustc from this PR.

the8472 · 2025-02-09T14:18:13Z

I'm not sure how to interpret the numbers in the perf report. Is it considered a severe regression?

Syn has been noisy, so that one can be ignored which means instructions look neutral.

But cycles look like they regressed across a few places. tuple-stress is spending more cycles across several variations of the benchmark... and yet instruction counts are unchanged. Weird. But it's a stress test so as long as it's not a huge impact it's probably not relevant.

Runtime benchmarks also aren't showing anything that would obviously point at validation. Brotli should be munging bytes, not strings.

If you want to doublecheck you could try running those benchmarks under perf diff locally and see if anything string-related shows up, but I doubt that it will.

Anyway, since the impl changed, let's see rerun perf.

@bors try @rust-timer queue

All benchmark results are updated in my repo: x86_64, i686, CBOR deserialization compiled by rustc from this PR.

Have you run them several times to check if there's variance between runs? ASLR, cpu clock boosting, differences between CPU cores etc. can lead to stable within-run but varying cross-run results.

Assuming they are stable, it looks like

throughput is generally up in microbenchmarks
but throughput is a bit worse for cbor-de-citm_catalog.json
on 32bit latency is worse than std for short strings
on 64bit latency looks mixed...
cbor-de-trivial_helloworld (another short string case) ends up worse too

That there are some latency improvements at all on x86-64 is tantalizing. But they appear to evaporate once it's embedded in additional code (for the single context, CBOR, that's being tested here). So maybe the DFA algorithm should only be used for strings above a certain length?

Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.

bors · 2025-02-09T14:19:25Z

⌛ Trying commit 486b0d1 with merge 0de538c...

kornelski · 2025-02-09T15:24:48Z

Benchmark on Apple M3 Max

bors · 2025-02-09T16:09:31Z

☀️ Try build successful - checks-actions
Build commit: 0de538c (0de538c8d4bb71779021767a5bd1e3a72da0b09b)

oxalica · 2025-02-09T16:59:53Z

@the8472

Syn has been noisy, so that one can be ignored which means instructions look neutral.

But cycles look like they regressed across a few places. tuple-stress is spending more cycles across several variations of the benchmark... and yet instruction counts are unchanged. Weird. But it's a stress test so as long as it's not a huge impact it's probably not relevant.

Runtime benchmarks also aren't showing anything that would obviously point at validation. Brotli should be munging bytes, not strings.

If you want to doublecheck you could try running those benchmarks under perf diff locally and see if anything string-related shows up, but I doubt that it will.

Anyway, since the impl changed, let's see rerun perf.

Thanks for the explanation. I'll play it around locally.

Have you run them several times to check if there's variance between runs? ASLR, cpu clock boosting, differences between CPU cores etc. can lead to stable within-run but varying cross-run results.

All my benchmarks are produced under my cbench tool which does control most points you mentioned: disalbe ASLR, disable CPU boosting, lock cpufreq, cpuset on a single core and forbid other programs to run on it, set irq affinity to avoid that core, and disable its hyper-threading sibling core. The result is pretty stable in a single run (as seen by the small deviation), but it can still fluctuate a little (mostly <1%) across different runs, which may blame memory layout change as mentioned in this criterion issue.

throughput is generally up in microbenchmarks

but throughput is a bit worse for cbor-de-citm_catalog.json

on 32bit latency is worse than std for short strings

on 64bit latency looks mixed...

cbor-de-trivial_helloworld (another short string case) ends up worse too

I'm not expecting a significant performance increase in these mixed workload, where CBOR parsing (branchy) should take most of time. But I'll look into these, especially the helloworld one.

@kornelski

Benchmark on Apple M3 Max

Thanks for the data. The result seems quite similar:

The ASCII case results in smaller difference between different chunking parameters. But the relative trend is quite similar.
Aligned and unaligned performances are also similar enough to be ignored, as in x86_64.
The latency result is a lot better than x86_64, so the absolute difference is smaller (1-2ns diff). But this patch behaves generally worse than std now, probably because the branch is better predicted in std?

I came up an idea to move the partial chunk processing from the tail to the head, to bless it to help in small string cases. I'll test it locally first.

rust-timer · 2025-02-09T17:56:52Z

Finished benchmarking commit (0de538c): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary 0.8%, secondary -1.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.8%	[0.8%, 0.8%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.9%	[-1.9%, -1.9%]	1
All ❌✅ (primary)	0.8%	[0.8%, 0.8%]	1

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.2%, secondary 0.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.1%, 0.5%]	12
Regressions ❌ (secondary)	0.6%	[0.3%, 0.7%]	38
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[0.1%, 0.5%]	12

Bootstrap: 781.536s -> 778.817s (-0.35%)
Artifact size: 329.13 MiB -> 329.11 MiB (-0.01%)

the8472 · 2025-02-09T20:02:33Z

This is looking quite neutral now. Combined with the throughput improvements in microbenchmarks that should be good enough for inclusion as far as perf aspects go.

oxalica · 2025-02-13T16:08:48Z

I updated lossy parsing/conversion (core::str::Utf8Chunks::next) to use the new validation algorithm. It took some extra work to perform relatively well on error latency, without suffering a 200% regression due to LLVM's bad register allocation (bug?) and a mild suboptimal codegen #136972.

On d2030aa, It shows a +11%~49% speed up on valid to almost-valid path, and a -13% regresson on worst path (all bytes are invalid). The number seems acceptable to me.

joboet

Thank you for pushing this further! I have some style nits...

Also, I think (correct me if I'm wrong) you could get rid of most of the unsafe stuff by rewriting the main loop around an iterator over as_rchunks() and computing the position for resolve_error_location using the length of the remaining slice like so:

let (remainder, chunks) = bytes.as_rchunks();
// ... check the remainder ...

let mut chunks = chunks.iter();
while let Some(mut chunk) = chunks.next() {
    if st == ST_ACCEPT && chunk[0].is_ascii() {
        while chunk_is_ascii {
            chunk = chunks.next();
        }
    } else {
        // ... check chunk ...

        if error {
            let i = bytes.len() - chunks.as_slice().as_flattened().len() - CHUNK_SIZE;
            // handle error
        }
    }
}

This is completely equivalent to the current version, doesn't introduce bound checks and the only unsafe remaining would be in run_with_error_handling.

library/core/src/str/validations.rs

joboet · 2025-02-13T16:58:32Z

library/core/src/str/validations.rs

+            let pos = ascii_chunks
+                .position(|chunk| {
+                    // NB. Always traverse the whole chunk to enable vectorization, instead of `.any()`.
+                    // LLVM will be fear of memory traps and fallback if loop has short-circuit.
+                    #[expect(clippy::unnecessary_fold)]
+                    let has_non_ascii = chunk.iter().fold(false, |acc, &b| acc || (b >= 0x80));
+                    has_non_ascii
+                })
+                .unwrap_or(ascii_rest_chunk_cnt);
+            i += pos * ASCII_CHUNK_SIZE;
+            if i + MAIN_CHUNK_SIZE > bytes.len() {
+                break;


You can immediately break from the loop if position returns None as that means that all chunks have been traversed.

library/core/src/str/validations.rs

oxalica · 2025-02-13T20:40:27Z

Also, I think (correct me if I'm wrong) you could get rid of most of the unsafe stuff by rewriting the main loop around an iterator over as_rchunks() and computing the position for resolve_error_location using the length of the remaining slice like so:

It only works if MAIN_CHUNK_SIZE == ASCII_CHUNK_SIZE I think? Currently they are both 16, but I think m8 a16 is another candidate with different pros&cons, and haven't really decide yet.

bors · 2025-03-07T17:34:51Z

☔ The latest upstream changes (presumably #138155) made this pull request unmergeable. Please resolve the merge conflicts.

This gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. Shift-based DFA algorithm does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefits from auto-vectorization, if the target supports. We use z3 to find a state mapping that only need u32 transition table. This shrinks the table size to 1KiB comparing a u64 states for less cache pressure, and produces faster code on platforms that only support 32-bit shift. Though, it does not affect the throughput on 64-bit platforms when the table is already fully in cache.

1. To reduce the cache footprint. 2. To avoid additional cost when access across pages.

Hope to have a better latency on short strings and/or the immediate-fail path.

When using `error_len: Option<u8>`, `Result<(), Utf8Error>` will be returned on stack and produces suboptimal stack suffling operations. It causes 50%-200% latency increase on the error path.

rustbot assigned cuviper Feb 7, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 7, 2025

This comment has been minimized.

Sign in to view

hanna-kruppe reviewed Feb 7, 2025

View reviewed changes

library/core/src/str/validations.rs Outdated Show resolved Hide resolved

library/core/src/str/validations.rs Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2025

the8472 assigned the8472 and unassigned cuviper Feb 7, 2025

CryZe reviewed Feb 7, 2025

View reviewed changes

library/core/src/str/validations.rs Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 8, 2025

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 9, 2025

This comment has been minimized.

Sign in to view

rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels Feb 9, 2025

joboet reviewed Feb 13, 2025

View reviewed changes

oxalica changed the title ~~Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings~~ Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings Feb 17, 2025

thaliaarchi mentioned this pull request Feb 20, 2025

Fix Display for invalid UTF-8 in OsStr/Path #136677

Open

oxalica requested a review from joboet February 28, 2025 18:19

oxalica added 8 commits March 7, 2025 22:50

Align transition table and fit it in a single page

ad775c2

1. To reduce the cache footprint. 2. To avoid additional cost when access across pages.

Reuse core::str::from_utf8 in lossy UTF-8 parsing

c17a4e1

Process partial chunks at beginning and remove unlikely hints

46a4782

Hope to have a better latency on short strings and/or the immediate-fail path.

Use UTF-8 error length enum to reduce register spill

dbba13b

When using `error_len: Option<u8>`, `Result<(), Utf8Error>` will be returned on stack and produces suboptimal stack suffling operations. It causes 50%-200% latency increase on the error path.

Prefer pure functions over mutable arguments

b69c448

Fix comments, use u8::is_ascii and simplify

16203e2

Add a benchmark of UTF-8 validation for ASCII

bc57db5

oxalica force-pushed the feat/shift-dfa-utf8 branch from a105390 to bc57db5 Compare March 8, 2025 03:56

Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693

Are you sure you want to change the base?

Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693

Conversation

oxalica commented Feb 7, 2025 • edited Loading

Background

Rationales

Implementation details

Benchmarks

Unresolved

rustbot commented Feb 7, 2025

This comment has been minimized.

hanna-kruppe left a comment

Choose a reason for hiding this comment

the8472 commented Feb 7, 2025 • edited Loading

thaliaarchi commented Feb 7, 2025

hanna-kruppe commented Feb 7, 2025

cuviper commented Feb 7, 2025

oxalica commented Feb 7, 2025 • edited Loading

hanna-kruppe commented Feb 7, 2025

hanna-kruppe commented Feb 7, 2025 • edited Loading

the8472 commented Feb 7, 2025

This comment has been minimized.

bors commented Feb 7, 2025

oxalica commented Feb 7, 2025

hanna-kruppe commented Feb 7, 2025 • edited Loading

bors commented Feb 7, 2025

This comment has been minimized.

oxalica commented Feb 7, 2025

joboet commented Feb 7, 2025

rust-timer commented Feb 8, 2025

Overall result: ❌ regressions - please read the text below

oxalica commented Feb 9, 2025

the8472 commented Feb 9, 2025

This comment has been minimized.

bors commented Feb 9, 2025

kornelski commented Feb 9, 2025

bors commented Feb 9, 2025

This comment has been minimized.

oxalica commented Feb 9, 2025 • edited Loading

rust-timer commented Feb 9, 2025

Overall result: no relevant changes - no action needed

the8472 commented Feb 9, 2025

oxalica commented Feb 13, 2025

joboet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxalica commented Feb 13, 2025

bors commented Mar 7, 2025

oxalica commented Feb 7, 2025 •

edited

Loading

the8472 commented Feb 7, 2025 •

edited

Loading

oxalica commented Feb 7, 2025 •

edited

Loading

hanna-kruppe commented Feb 7, 2025 •

edited

Loading

hanna-kruppe commented Feb 7, 2025 •

edited

Loading

oxalica commented Feb 9, 2025 •

edited

Loading