-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite UTF-8 validation in shift-based DFA for 70%~135% performance increase on non-ASCII strings #136693
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shift-based DFAs are a really cool trick, great to see that precise error reporting and ASCII fast paths can be added on top of them. I don't want to steal an actual reviewer's show and I've only given this a quick cursory read, but there's one thing I was wondering about and one small suggestion.
I suspect that this is going to eat into cache capacity, especially in code that interleaves string validation with other things, e.g. CBOR decoding. Have you benchmarked this for short inputs and with mixed workloads? In coretests we have TINY/SMALL/MEDIUM/LARGE/HUGE for that reason Additionally you may want to benchmark 4-byte characters such as emoji or maybe some exotic scripts on the supplementary plane. |
FYI, in #136677, I fixed the handling of invalid UTF-8 in |
This is a great point and should be benchmarked. Note that it's mitigated on many common cases by the accesses to the table depending only on the input bytes, so e.g. strings that all ASCII touch at most half the table. But that's still a lot of cache lines. |
No worries on my part as the current assignee -- in fact, if another reviewer has more context here, I'd be happy to yield this one. Maybe @the8472 wants to take it? |
Could you please explain how to test "short inputs and with mixed workloads"? Should I bench the performance of a CBOR library like ciborium using the new std on this branch? I'll give it a check later.
It seems they are only used in little places like
It should performs the same as the "zh" benchmark. Since the DFA is agnostic to the input, as long as it's purely non-ASCII to get into the DFA path, the performance should be the same. The correctness of DFA can be audited by checking state transitions table to match RFC3629.
|
I just meant that I, personally, couldn't take over the review even if I wanted to since I haven't had bors permissions in years 😄 |
Slightly silly idea, free to a good home: mapping the input bytes into equivalence classes before feeding them to the DFA, as done in Flexible and Economical UTF-8 Decoder, would reduce the rodata size to 256B + 8B * (# classes). That's 256B + 96B with the classes from the linked implementation, but idk if that number can be lower or has to be larger in this context. The critical path for the DFA transition is not affected by the input remapping, so this would still run pretty fast. On Skylake-ish hardware, idealized peak throughput goes from 1 cycle per byte to 1.5cpb because the extra load becomes the bottleneck (that's for a loop that does nothing but compute the final state, the version in this PR doesn't seem to hit 1cpb anyway). On CPUs that can sustain more than two loads per cycle, it may not even be any slower for long inputs. |
Sorry, I didn't mean that this was specifically for utf8 validation, I was just gesturing at more cases that can be useful for string benchmarking in general to cover common scenarios.
That's the tricky part. As hanna-kruppe says, such effects are most visible in complex codebases and are difficult to test with micro-benchmarks. I suggested CBOR because it should contain short strings (e.g. dictionary keys) mixed with other non-string parsing and some implementations might already have their own benchmarks that can be reused. So it could be a benchmark of intermediate complexity. I'll kick off a perf run to see if it impacts rustc itself. But we have already eliminated a bunch of string validations in the compiler, so it likely isn't a good benchmark either. @bors try @rust-timer queue
👍 |
This comment has been minimized.
This comment has been minimized.
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
Not really. As I mentioned in the 32-bit-shift part, the bottleneck of the DFA path is latency, since the DFA has data dependency from each state to its precedent state. The later state can only be calculated after the previous result is out, no matter how many ALUs you have. So,
... literally means +50% latency and -50% performance. I tested it by introducing an identity lookup table But yes, the table must be fully in cache to achieve these numbers, mixed workload is a challenge. I also prefer smaller table size, but not yet find a way to shrink it without introducing massive performance drop. |
@oxalica Remapping inputs before looking up the DFA transition for them doesn't add latency in any way that matters -- if it did, the extra latency of a load on the critical path would immediately take it to 4-5cpb or more. It's slower because it hits other bottlenecks, such as issue width or loads per cycle (you might want to check if your change unexpectedly added more instructions than just the loads). In any case, yes, its latency is significantly worse than the big LUT. But by your numbers, it's still 1.65x faster than the current implementation in std on the |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
I wrote some CBOR deserialization bench in my out-of-tree repo. By running it on this PR and master rustc sysroot, the result shows this PR gives an +0.3%~1.4% performance improvement. On small-string-heavy |
I think it's worth mentioning that in #107760 @thomcc managed to find a way to pack the 9 states of the DFA into 32-bits, which would help address the cache size, binary size and portability concerns. Perhaps you could ask him if he'd provide you with the SMT-solver code he used to find the state values? |
Finished benchmarking commit (71c0147): comparison URL. Overall result: ❌ regressions - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary 3.3%, secondary 2.5%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary 0.2%, secondary 0.5%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 779.143s -> 778.734s (-0.05%) |
Thanks for pointing out this and I updated the patch to use a u32 table now. I checked the script mentioned in postgres comments and rewrote it for our use case:
I put the generating python script in tree and appended the output solution to it for reference. I'm not sure if there is a better place to put it, but I found there is Performance wise, u32 transitions make the algorithm on i686 runs almost as fast as on x86_64. But it does not affect much on x86_64 alone. It should reduce cache pressure but that may not be easily benchmark-able. @the8472 All benchmark results are updated in my repo: x86_64, i686, CBOR deserialization compiled by rustc from this PR. |
Syn has been noisy, so that one can be ignored which means instructions look neutral. But cycles look like they regressed across a few places. tuple-stress is spending more cycles across several variations of the benchmark... and yet instruction counts are unchanged. Weird. But it's a stress test so as long as it's not a huge impact it's probably not relevant. Runtime benchmarks also aren't showing anything that would obviously point at validation. Brotli should be munging bytes, not strings. If you want to doublecheck you could try running those benchmarks under perf diff locally and see if anything string-related shows up, but I doubt that it will. Anyway, since the impl changed, let's see rerun perf. @bors try @rust-timer queue
Have you run them several times to check if there's variance between runs? ASLR, cpu clock boosting, differences between CPU cores etc. can lead to stable within-run but varying cross-run results. Assuming they are stable, it looks like
That there are some latency improvements at all on x86-64 is tantalizing. But they appear to evaporate once it's embedded in additional code (for the single context, CBOR, that's being tested here). So maybe the DFA algorithm should only be used for strings above a certain length? |
This comment has been minimized.
This comment has been minimized.
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Thanks for the explanation. I'll play it around locally.
All my benchmarks are produced under my
I'm not expecting a significant performance increase in these mixed workload, where CBOR parsing (branchy) should take most of time. But I'll look into these, especially the helloworld one.
Thanks for the data. The result seems quite similar:
I came up an idea to move the partial chunk processing from the tail to the head, to bless it to help in small string cases. I'll test it locally first. |
Finished benchmarking commit (0de538c): comparison URL. Overall result: no relevant changes - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)Results (primary 0.8%, secondary -1.9%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResults (primary 0.2%, secondary 0.6%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 781.536s -> 778.817s (-0.35%) |
This is looking quite neutral now. Combined with the throughput improvements in microbenchmarks that should be good enough for inclusion as far as perf aspects go. |
I updated lossy parsing/conversion ( On d2030aa, It shows a +11%~49% speed up on valid to almost-valid path, and a -13% regresson on worst path (all bytes are invalid). The number seems acceptable to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pushing this further! I have some style nits...
Also, I think (correct me if I'm wrong) you could get rid of most of the unsafe
stuff by rewriting the main loop around an iterator over as_rchunks()
and computing the position for resolve_error_location
using the length of the remaining slice like so:
let (remainder, chunks) = bytes.as_rchunks();
// ... check the remainder ...
let mut chunks = chunks.iter();
while let Some(mut chunk) = chunks.next() {
if st == ST_ACCEPT && chunk[0].is_ascii() {
while chunk_is_ascii {
chunk = chunks.next();
}
} else {
// ... check chunk ...
if error {
let i = bytes.len() - chunks.as_slice().as_flattened().len() - CHUNK_SIZE;
// handle error
}
}
}
This is completely equivalent to the current version, doesn't introduce bound checks and the only unsafe
remaining would be in run_with_error_handling
.
library/core/src/str/validations.rs
Outdated
let pos = ascii_chunks | ||
.position(|chunk| { | ||
// NB. Always traverse the whole chunk to enable vectorization, instead of `.any()`. | ||
// LLVM will be fear of memory traps and fallback if loop has short-circuit. | ||
#[expect(clippy::unnecessary_fold)] | ||
let has_non_ascii = chunk.iter().fold(false, |acc, &b| acc || (b >= 0x80)); | ||
has_non_ascii | ||
}) | ||
.unwrap_or(ascii_rest_chunk_cnt); | ||
i += pos * ASCII_CHUNK_SIZE; | ||
if i + MAIN_CHUNK_SIZE > bytes.len() { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can immediately break
from the loop if position
returns None
as that means that all chunks have been traversed.
It only works if |
☔ The latest upstream changes (presumably #138155) made this pull request unmergeable. Please resolve the merge conflicts. |
This gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. Shift-based DFA algorithm does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefits from auto-vectorization, if the target supports. We use z3 to find a state mapping that only need u32 transition table. This shrinks the table size to 1KiB comparing a u64 states for less cache pressure, and produces faster code on platforms that only support 32-bit shift. Though, it does not affect the throughput on 64-bit platforms when the table is already fully in cache.
1. To reduce the cache footprint. 2. To avoid additional cost when access across pages.
Hope to have a better latency on short strings and/or the immediate-fail path.
When using `error_len: Option<u8>`, `Result<(), Utf8Error>` will be returned on stack and produces suboptimal stack suffling operations. It causes 50%-200% latency increase on the error path.
a105390
to
bc57db5
Compare
Take 2 of #107760 (cc @thomcc)
Background
About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725
As stated in #107760,
Rationales
Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.
Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.
Implementation details
I use the ordinary UTF-8 language definition from RFC3692 and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of
[u64; 256]
, or 2KiB rodata.The main algorithm consists of following parts:
MAIN_CHUNK_SIZE = 16
bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.ASCII_CHUNK_SIZE = 16
to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.break
into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.There are also some small tricks being used:
utf8_char_width
. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.Benchmarks
I made an out-of-tree implementation repository for easier testing and benching. It also tested various
MAIN_CHUNK_SIZE
(m) andASCII_CHUNK_SIZE
(a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia William Shakespeare in en, es and zh language.In short: with m=16, a=16, shift-DFA performance gives -45% on en, +69% on es, +135% on zh; with m=8, a=32, it gives +5% on en, +22% on es, +136% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like "es" because of the taken branch is flipping around.
To me, the difference in "en" is minimal in absolute time because the throughput is already high enough, comparing to not-as-fast "es". So I currently picking m=16, a=16 to lean towards "es" in the PR.
x86_64-linux results
On: Ryzen 7 5700G @3.775GHz (disabled turbo, with cpuset, with stack layout randomizer):
Note: zh input consists of solely non-ASCII codepoints and runs fully on DFA path, it's the same performance as purely emoji inputs.
Before (486b0d1):
After with m16 a16 (ceb82dd971b7aef47493298255bab732bdc67b5e):
After with m8 a32 (ceb82dd971b7aef47493298255bab732bdc67b5e):
Unresolved
Benchmark on aarch64-darwin, another tier 1 target.
See this comment.
Decide the chunk size parameters. I'm currently picking m=16, a=16.
Should we also replace the implementation of lossy conversion by calling the new validation function? It has a very similar code doing almost the same thing.
It now also uses the new validation algorithm. Benchmarks of lossy conversions are included above.