Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

oxalica
Copy link
Contributor

@oxalica oxalica commented Feb 7, 2025

Take 2 of #107760 (cc @thomcc)

Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in #107760,

For prior art: shift-DFAs are now used for UTF-8 validation in PostgreSQL, and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level1.

Rationales

  1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

  2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

Implementation details

I use the ordinary UTF-8 language definition from RFC3692 and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of [u64; 256], or 2KiB rodata.

The main algorithm consists of following parts:

  1. Main loop: taking a chunk of MAIN_CHUNK_SIZE = 16 bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
  2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose ASCII_CHUNK_SIZE = 16 to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
  3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and break into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:

  1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
  2. We still need to get UTF-8 encoded length from the first byte in utf8_char_width. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

Benchmarks

I made an out-of-tree implementation repository for easier testing and benching. It also tested various MAIN_CHUNK_SIZE (m) and ASCII_CHUNK_SIZE (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia William Shakespeare in en, es and zh language.

In short: with m=16, a=16, shift-DFA performance gives -45% on en, +69% on es, +135% on zh; with m=8, a=32, it gives +5% on en, +22% on es, +136% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like "es" because of the taken branch is flipping around.
To me, the difference in "en" is minimal in absolute time because the throughput is already high enough, comparing to not-as-fast "es". So I currently picking m=16, a=16 to lean towards "es" in the PR.

x86_64-linux results

On: Ryzen 7 5700G @3.775GHz (disabled turbo, with cpuset, with stack layout randomizer):

Note: zh input consists of solely non-ASCII codepoints and runs fully on DFA path, it's the same performance as purely emoji inputs.

Algorithm Input language Throughput / (GiB/s)
std en 48.791 +-0.128
shift-dfa-m16-a16 en 26.815 +-0.006
shift-dfa-m8-a32 en 51.314 +-0.018
std es 5.925 +-0.069
shift-dfa-m16-a16 es 10.033 +-0.007
shift-dfa-m8-a32 es 7.254 +-0.035
std zh 1.450 +-0.008
shift-dfa-m16-a16 zh 3.414 +-0.002
shift-dfa-m8-a32 zh 3.421 +-0.002

Before (486b0d1):

    test string::from_utf8_lossy_100_ascii                   ... bench:          40.96 ns/iter (+/- 6.47)
    test string::from_utf8_lossy_100_invalid                 ... bench:       1,738.28 ns/iter (+/- 10.53)
    test string::from_utf8_lossy_100_multibyte               ... bench:          61.22 ns/iter (+/- 0.38)
    test string::from_utf8_lossy_invalid                     ... bench:         125.26 ns/iter (+/- 1.85)

    test str::str_validate_emoji                             ... bench:       3,279.32 ns/iter (+/- 54.01)

After with m16 a16 (ceb82dd971b7aef47493298255bab732bdc67b5e):

    test string::from_utf8_lossy_100_ascii                   ... bench:          23.92 ns/iter (+/- 0.10)
    test string::from_utf8_lossy_100_invalid                 ... bench:       1,962.22 ns/iter (+/- 42.19)
    test string::from_utf8_lossy_100_multibyte               ... bench:          41.19 ns/iter (+/- 0.75)
    test string::from_utf8_lossy_invalid                     ... bench:         155.69 ns/iter (+/- 3.46)

    test str::str_validate_emoji                             ... bench:       1,909.92 ns/iter (+/- 94.34)

After with m8 a32 (ceb82dd971b7aef47493298255bab732bdc67b5e):

    test string::from_utf8_lossy_100_ascii                   ... bench:          20.14 ns/iter (+/- 0.03)
    test string::from_utf8_lossy_100_invalid                 ... bench:       2,017.46 ns/iter (+/- 9.39)
    test string::from_utf8_lossy_100_multibyte               ... bench:          54.19 ns/iter (+/- 0.07)
    test string::from_utf8_lossy_invalid                     ... bench:         152.43 ns/iter (+/- 2.46)

    test str::str_validate_emoji                             ... bench:       2,715.99 ns/iter (+/- 164.68)

Unresolved

  • Benchmark on aarch64-darwin, another tier 1 target.
    See this comment.

  • Decide the chunk size parameters. I'm currently picking m=16, a=16.

  • Should we also replace the implementation of lossy conversion by calling the new validation function? It has a very similar code doing almost the same thing.

    It now also uses the new validation algorithm. Benchmarks of lossy conversions are included above.

@rustbot
Copy link
Collaborator

rustbot commented Feb 7, 2025

r? @cuviper

rustbot has assigned @cuviper.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 7, 2025
@rust-log-analyzer

This comment has been minimized.

Copy link
Contributor

@hanna-kruppe hanna-kruppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shift-based DFAs are a really cool trick, great to see that precise error reporting and ASCII fast paths can be added on top of them. I don't want to steal an actual reviewer's show and I've only given this a quick cursory read, but there's one thing I was wondering about and one small suggestion.

@the8472
Copy link
Member

the8472 commented Feb 7, 2025

or 2KiB rodata.

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding. Have you benchmarked this for short inputs and with mixed workloads?

In coretests we have TINY/SMALL/MEDIUM/LARGE/HUGE for that reason

Additionally you may want to benchmark 4-byte characters such as emoji or maybe some exotic scripts on the supplementary plane.

@thaliaarchi
Copy link
Contributor

Should we also replace the implementation of lossy conversion by calling the new validation function?
It has a very similar code doing almost the same thing.

FYI, in #136677, I fixed the handling of invalid UTF-8 in Display for OsStr/Path, so there's a new user of lossy conversion. It needs to count the number of (potentially invalid) characters in a byte string and truncate it to some character width. I extracted a single iteration of the loop in Utf8Chunks::next to reuse it there. If Utf8Chunks is more amenable to using your optimization, OsStr might need a separate API.

@hanna-kruppe
Copy link
Contributor

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding.

This is a great point and should be benchmarked. Note that it's mitigated on many common cases by the accesses to the table depending only on the input bytes, so e.g. strings that all ASCII touch at most half the table. But that's still a lot of cache lines.

@cuviper
Copy link
Member

cuviper commented Feb 7, 2025

I don't want to steal an actual reviewer's show

No worries on my part as the current assignee -- in fact, if another reviewer has more context here, I'd be happy to yield this one. Maybe @the8472 wants to take it?

@oxalica
Copy link
Contributor Author

oxalica commented Feb 7, 2025

@the8472

or 2KiB rodata.

I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding. Have you benchmarked this for short inputs and with mixed workloads?

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

In coretests we have TINY/SMALL/MEDIUM/LARGE/HUGE for that reason

It seems they are only used in little places like str::char_count and str::debug. str::char() is untouchedby this PR, because it uses the UTF-8 decoder next_code_point, not the validator. str as Debug is unrelated to validation either. This PR is only changing the implementation of core::str::from_utf8 and its derivations.

Additionally you may want to benchmark 4-byte characters such as emoji or maybe some exotic scripts on the supplementary plane.

It should performs the same as the "zh" benchmark. Since the DFA is agnostic to the input, as long as it's purely non-ASCII to get into the DFA path, the performance should be the same. The correctness of DFA can be audited by checking state transitions table to match RFC3629.
Nevertheless, here's the benchmark (./x.py bench --stage 0 library/coretests -- str::str_validate_emoji):

Before:
str::str_validate_emoji                                         2742.95ns/iter   +/- 46.44
After:
str::str_validate_emoji                                         1529.48ns/iter    +/- 7.52

@hanna-kruppe
Copy link
Contributor

No worries on my part as the current assignee -- in fact, if another reviewer has more context here, I'd be happy to yield this one.

I just meant that I, personally, couldn't take over the review even if I wanted to since I haven't had bors permissions in years 😄

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 7, 2025

Slightly silly idea, free to a good home: mapping the input bytes into equivalence classes before feeding them to the DFA, as done in Flexible and Economical UTF-8 Decoder, would reduce the rodata size to 256B + 8B * (# classes). That's 256B + 96B with the classes from the linked implementation, but idk if that number can be lower or has to be larger in this context.

The critical path for the DFA transition is not affected by the input remapping, so this would still run pretty fast. On Skylake-ish hardware, idealized peak throughput goes from 1 cycle per byte to 1.5cpb because the extra load becomes the bottleneck (that's for a loop that does nothing but compute the final state, the version in this PR doesn't seem to hit 1cpb anyway). On CPUs that can sustain more than two loads per cycle, it may not even be any slower for long inputs.

@the8472
Copy link
Member

the8472 commented Feb 7, 2025

It seems they are only used in little places like str::char_count and str::debug. str::char() is untouchedby this PR

Sorry, I didn't mean that this was specifically for utf8 validation, I was just gesturing at more cases that can be useful for string benchmarking in general to cover common scenarios.
Maybe you can look at earlier PRs that touched validation, sometimes other people have implemented their own external benchmark suites if I recall correctly.

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

That's the tricky part. As hanna-kruppe says, such effects are most visible in complex codebases and are difficult to test with micro-benchmarks.

I suggested CBOR because it should contain short strings (e.g. dictionary keys) mixed with other non-string parsing and some implementations might already have their own benchmarks that can be reused. So it could be a benchmark of intermediate complexity.
You can try using perf stat to see if an existing benchmark is already experiencing cache pressure which might be exacerbated by the lookup table. Though I guess modern CPU caches may have some strategy that can deal with streaming workloads without clobbering the parts, so maybe it won't be complex enough.

I'll kick off a perf run to see if it impacts rustc itself. But we have already eliminated a bunch of string validations in the compiler, so it likely isn't a good benchmark either.

@bors try @rust-timer queue

Nevertheless, here's the benchmark (./x.py bench --stage 0 library/coretests -- str::str_validate_emoji):

👍

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 7, 2025
@the8472 the8472 assigned the8472 and unassigned cuviper Feb 7, 2025
@bors
Copy link
Contributor

bors commented Feb 7, 2025

⌛ Trying commit 33c076e with merge 71c0147...

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 7, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
@oxalica
Copy link
Contributor Author

oxalica commented Feb 7, 2025

@hanna-kruppe

Slightly silly idea, free to a good home: mapping the input bytes into equivalence classes before feeding them to the DFA, as done in Flexible and Economical UTF-8 Decoder, would reduce the rodata size to 256B + 8B * (# classes). That's 256B + 96B with the classes from the linked implementation, but idk if that number can be lower or has to be larger in this context.
The critical path for the DFA transition is not affected by the input remapping

Not really. As I mentioned in the 32-bit-shift part, the bottleneck of the DFA path is latency, since the DFA has data dependency from each state to its precedent state. The later state can only be calculated after the previous result is out, no matter how many ALUs you have. So,

goes from 1 cycle per byte to 1.5cpb

... literally means +50% latency and -50% performance. I tested it by introducing an identity lookup table [u8; 256] before each TRANS_TABLE lookup, then throughput-zh-aligned/shift-dfa-m16-a16 goes down from 3.401GiB/s to 2.424GiB/s, which is -29%. So I don't think it worth it.

But yes, the table must be fully in cache to achieve these numbers, mixed workload is a challenge. I also prefer smaller table size, but not yet find a way to shrink it without introducing massive performance drop.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 7, 2025

@oxalica Remapping inputs before looking up the DFA transition for them doesn't add latency in any way that matters -- if it did, the extra latency of a load on the critical path would immediately take it to 4-5cpb or more. It's slower because it hits other bottlenecks, such as issue width or loads per cycle (you might want to check if your change unexpectedly added more instructions than just the loads). In any case, yes, its latency is significantly worse than the big LUT. But by your numbers, it's still 1.65x faster than the current implementation in std on the zh input. So it's another point on the pareto curve if rodate size is a concern -- often it's not, but the discussion about it prompted this idea.

@bors
Copy link
Contributor

bors commented Feb 7, 2025

☀️ Try build successful - checks-actions
Build commit: 71c0147 (71c014723ade15a2bc2280eede6c97f96aa383fe)

@rust-timer

This comment has been minimized.

@oxalica
Copy link
Contributor Author

oxalica commented Feb 7, 2025

@the8472

Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.

That's the tricky part. As hanna-kruppe says, such effects are most visible in complex codebases and are difficult to test with micro-benchmarks.

I suggested CBOR because it should contain short strings (e.g. dictionary keys) mixed with other non-string parsing and some implementations might already have their own benchmarks that can be reused. So it could be a benchmark of intermediate complexity.

I wrote some CBOR deserialization bench in my out-of-tree repo. By running it on this PR and master rustc sysroot, the result shows this PR gives an +0.3%~1.4% performance improvement. On small-string-heavy twitter.json.dagcbor the difference is negligible. I doubt that these numbers are mostly just noise and there is no significant change in performance.

@joboet
Copy link
Member

joboet commented Feb 7, 2025

I think it's worth mentioning that in #107760 @thomcc managed to find a way to pack the 9 states of the DFA into 32-bits, which would help address the cache size, binary size and portability concerns. Perhaps you could ask him if he'd provide you with the SMT-solver code he used to find the state values?

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (71c0147): comparison URL.

Overall result: ❌ regressions - please read the text below

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
3.3% [3.3%, 3.3%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.3% [3.3%, 3.3%] 1

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

Results (primary 3.3%, secondary 2.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.3% [3.3%, 3.3%] 1
Regressions ❌
(secondary)
2.5% [1.4%, 2.9%] 9
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 3.3% [3.3%, 3.3%] 1

Binary size

Results (primary 0.2%, secondary 0.5%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.1%, 0.5%] 14
Regressions ❌
(secondary)
0.5% [0.3%, 0.6%] 38
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [0.1%, 0.5%] 14

Bootstrap: 779.143s -> 778.734s (-0.05%)
Artifact size: 329.04 MiB -> 329.08 MiB (0.01%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 8, 2025
@oxalica
Copy link
Contributor Author

oxalica commented Feb 9, 2025

@joboet

I think it's worth mentioning that in #107760 @thomcc managed to find a way to pack the 9 states of the DFA into 32-bits, which would help address the cache size, binary size and portability concerns. Perhaps you could ask him if he'd provide you with the SMT-solver code he used to find the state values?

Thanks for pointing out this and I updated the patch to use a u32 table now. I checked the script mentioned in postgres comments and rewrote it for our use case:

  1. Use the state assignment mentioned in this PR previously, which is a sequential assignment on the language definition, not using their newly-invented state names.
  2. Optimize on the solution numbers to get the minimal solution, so it's deterministic and reproducible. It should also be stable across SAT solver used and/or its version.

I put the generating python script in tree and appended the output solution to it for reference. I'm not sure if there is a better place to put it, but I found there is library/core/src/unicode/printable.py. So I guess putting it next to the relevant Rust code is fine?

Performance wise, u32 transitions make the algorithm on i686 runs almost as fast as on x86_64. But it does not affect much on x86_64 alone. It should reduce cache pressure but that may not be easily benchmark-able.

@the8472
I'm not sure how to interpret the numbers in the perf report. Is it considered a severe regression? I think that the slower compilation time is mainly affected by slower ASCII checking, as rustc itself does only ASCII processing and most Rust source codes are in ASCII. If we want to optimize for this, I'll change the chunking parameters to MAIN_CHUNK_SIZE=8, ASCII_CHUNK_SIZE=32.

All benchmark results are updated in my repo: x86_64, i686, CBOR deserialization compiled by rustc from this PR.

@the8472
Copy link
Member

the8472 commented Feb 9, 2025

I'm not sure how to interpret the numbers in the perf report. Is it considered a severe regression?

Syn has been noisy, so that one can be ignored which means instructions look neutral.

But cycles look like they regressed across a few places. tuple-stress is spending more cycles across several variations of the benchmark... and yet instruction counts are unchanged. Weird. But it's a stress test so as long as it's not a huge impact it's probably not relevant.

Runtime benchmarks also aren't showing anything that would obviously point at validation. Brotli should be munging bytes, not strings.

If you want to doublecheck you could try running those benchmarks under perf diff locally and see if anything string-related shows up, but I doubt that it will.

Anyway, since the impl changed, let's see rerun perf.

@bors try @rust-timer queue

All benchmark results are updated in my repo: x86_64, i686, CBOR deserialization compiled by rustc from this PR.

Have you run them several times to check if there's variance between runs? ASLR, cpu clock boosting, differences between CPU cores etc. can lead to stable within-run but varying cross-run results.

Assuming they are stable, it looks like

  • throughput is generally up in microbenchmarks
  • but throughput is a bit worse for cbor-de-citm_catalog.json
  • on 32bit latency is worse than std for short strings
  • on 64bit latency looks mixed...
  • cbor-de-trivial_helloworld (another short string case) ends up worse too

That there are some latency improvements at all on x86-64 is tantalizing. But they appear to evaporate once it's embedded in additional code (for the single context, CBOR, that's being tested here). So maybe the DFA algorithm should only be used for strings above a certain length?

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 9, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 9, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
@bors
Copy link
Contributor

bors commented Feb 9, 2025

⌛ Trying commit 486b0d1 with merge 0de538c...

@kornelski
Copy link
Contributor

@bors
Copy link
Contributor

bors commented Feb 9, 2025

☀️ Try build successful - checks-actions
Build commit: 0de538c (0de538c8d4bb71779021767a5bd1e3a72da0b09b)

@rust-timer

This comment has been minimized.

@oxalica
Copy link
Contributor Author

oxalica commented Feb 9, 2025

@the8472

Syn has been noisy, so that one can be ignored which means instructions look neutral.

But cycles look like they regressed across a few places. tuple-stress is spending more cycles across several variations of the benchmark... and yet instruction counts are unchanged. Weird. But it's a stress test so as long as it's not a huge impact it's probably not relevant.

Runtime benchmarks also aren't showing anything that would obviously point at validation. Brotli should be munging bytes, not strings.

If you want to doublecheck you could try running those benchmarks under perf diff locally and see if anything string-related shows up, but I doubt that it will.

Anyway, since the impl changed, let's see rerun perf.

Thanks for the explanation. I'll play it around locally.

Have you run them several times to check if there's variance between runs? ASLR, cpu clock boosting, differences between CPU cores etc. can lead to stable within-run but varying cross-run results.

All my benchmarks are produced under my cbench tool which does control most points you mentioned: disalbe ASLR, disable CPU boosting, lock cpufreq, cpuset on a single core and forbid other programs to run on it, set irq affinity to avoid that core, and disable its hyper-threading sibling core. The result is pretty stable in a single run (as seen by the small deviation), but it can still fluctuate a little (mostly <1%) across different runs, which may blame memory layout change as mentioned in this criterion issue.

  • throughput is generally up in microbenchmarks

    • but throughput is a bit worse for cbor-de-citm_catalog.json

    • on 32bit latency is worse than std for short strings

    • on 64bit latency looks mixed...

    • cbor-de-trivial_helloworld (another short string case) ends up worse too

I'm not expecting a significant performance increase in these mixed workload, where CBOR parsing (branchy) should take most of time. But I'll look into these, especially the helloworld one.

@kornelski

Benchmark on Apple M3 Max

Thanks for the data. The result seems quite similar:

  1. The ASCII case results in smaller difference between different chunking parameters. But the relative trend is quite similar.
  2. Aligned and unaligned performances are also similar enough to be ignored, as in x86_64.
  3. The latency result is a lot better than x86_64, so the absolute difference is smaller (1-2ns diff). But this patch behaves generally worse than std now, probably because the branch is better predicted in std?

I came up an idea to move the partial chunk processing from the tail to the head, to bless it to help in small string cases. I'll test it locally first.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (0de538c): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary 0.8%, secondary -1.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.8% [0.8%, 0.8%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.9% [-1.9%, -1.9%] 1
All ❌✅ (primary) 0.8% [0.8%, 0.8%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.2%, secondary 0.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.1%, 0.5%] 12
Regressions ❌
(secondary)
0.6% [0.3%, 0.7%] 38
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [0.1%, 0.5%] 12

Bootstrap: 781.536s -> 778.817s (-0.35%)
Artifact size: 329.13 MiB -> 329.11 MiB (-0.01%)

@rustbot rustbot removed S-waiting-on-perf Status: Waiting on a perf run to be completed. perf-regression Performance regression. labels Feb 9, 2025
@the8472
Copy link
Member

the8472 commented Feb 9, 2025

This is looking quite neutral now. Combined with the throughput improvements in microbenchmarks that should be good enough for inclusion as far as perf aspects go.

@oxalica
Copy link
Contributor Author

oxalica commented Feb 13, 2025

I updated lossy parsing/conversion (core::str::Utf8Chunks::next) to use the new validation algorithm. It took some extra work to perform relatively well on error latency, without suffering a 200% regression due to LLVM's bad register allocation (bug?) and a mild suboptimal codegen #136972.

On d2030aa, It shows a +11%~49% speed up on valid to almost-valid path, and a -13% regresson on worst path (all bytes are invalid). The number seems acceptable to me.

Copy link
Member

@joboet joboet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pushing this further! I have some style nits...

Also, I think (correct me if I'm wrong) you could get rid of most of the unsafe stuff by rewriting the main loop around an iterator over as_rchunks() and computing the position for resolve_error_location using the length of the remaining slice like so:

let (remainder, chunks) = bytes.as_rchunks();
// ... check the remainder ...

let mut chunks = chunks.iter();
while let Some(mut chunk) = chunks.next() {
    if st == ST_ACCEPT && chunk[0].is_ascii() {
        while chunk_is_ascii {
            chunk = chunks.next();
        }
    } else {
        // ... check chunk ...

        if error {
            let i = bytes.len() - chunks.as_slice().as_flattened().len() - CHUNK_SIZE;
            // handle error
        }
    }
}

This is completely equivalent to the current version, doesn't introduce bound checks and the only unsafe remaining would be in run_with_error_handling.

Comment on lines 306 to 319
let pos = ascii_chunks
.position(|chunk| {
// NB. Always traverse the whole chunk to enable vectorization, instead of `.any()`.
// LLVM will be fear of memory traps and fallback if loop has short-circuit.
#[expect(clippy::unnecessary_fold)]
let has_non_ascii = chunk.iter().fold(false, |acc, &b| acc || (b >= 0x80));
has_non_ascii
})
.unwrap_or(ascii_rest_chunk_cnt);
i += pos * ASCII_CHUNK_SIZE;
if i + MAIN_CHUNK_SIZE > bytes.len() {
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can immediately break from the loop if position returns None as that means that all chunks have been traversed.

@oxalica
Copy link
Contributor Author

oxalica commented Feb 13, 2025

Also, I think (correct me if I'm wrong) you could get rid of most of the unsafe stuff by rewriting the main loop around an iterator over as_rchunks() and computing the position for resolve_error_location using the length of the remaining slice like so:

It only works if MAIN_CHUNK_SIZE == ASCII_CHUNK_SIZE I think? Currently they are both 16, but I think m8 a16 is another candidate with different pros&cons, and haven't really decide yet.

@oxalica oxalica changed the title Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings Feb 17, 2025
@oxalica oxalica requested a review from joboet February 28, 2025 18:19
@bors
Copy link
Contributor

bors commented Mar 7, 2025

☔ The latest upstream changes (presumably #138155) made this pull request unmergeable. Please resolve the merge conflicts.

oxalica added 8 commits March 7, 2025 22:50
This gives plenty of performance increase on validating strings with
many non-ASCII codepoints, which is the normal case for almost every
non-English content.

Shift-based DFA algorithm does not use SIMD instructions and does not
rely on the branch predictor to get a good performance, thus is good as
a general, default, architecture-agnostic implementation.
There is still a bypass for ASCII-only strings to benefits from
auto-vectorization, if the target supports.

We use z3 to find a state mapping that only need u32 transition table.
This shrinks the table size to 1KiB comparing a u64 states for less
cache pressure, and produces faster code on platforms that only support
32-bit shift. Though, it does not affect the throughput on 64-bit
platforms when the
table is already fully in cache.
1. To reduce the cache footprint.
2. To avoid additional cost when access across pages.
Hope to have a better latency on short strings and/or the immediate-fail
path.
When using `error_len: Option<u8>`, `Result<(), Utf8Error>` will be
returned on stack and produces suboptimal stack suffling operations. It
causes 50%-200% latency increase on the error path.
@oxalica oxalica force-pushed the feat/shift-dfa-utf8 branch from a105390 to bc57db5 Compare March 8, 2025 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.