-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize base16 (hex) decoding #26
Comments
Base16 (hex) can be decoded similarly to base32hex, except first we’d But.. how many shuffles should be used? Haswell CPUs have only one shuffle execution port. An AVX2 port of the base32hex style decoder would require 4 shuffles per 32-bytes of input. The linked solution by Geoff Langdale uses 2 shuffles, has only 1 more instruction in total, but it would require several more 32-byte constants from memory. |
Let us test it out! |
For base16, we'd need to keep some state as some record types (and generic RDATA representation) allow for spaces. In terms of the zone parser, that means calling |
The standard seems to require that bytes are encoded as pairs and that pairs are held together, so that the input is always an even number of characters. Are you saying that they don't do that? That'd be very strange. |
Using GCC11 on an Ice Lake server, I get the following results:
My benchmark uses relatively short strings, of random length. We can (and maybe should) change that to more closely match what we have in the actual data. It is important to note that the bulk of the instructions are in the setting up for both In my table It is possible we can improve our approach further. I did not quite understand @aqrit's description: I am still using a multiplication when combining the bytes. I can do it without a multiplication, but the approach I imagine require several extra instructions (no big deal, but slightly slower...). Maybe there is a clever way that I don't see. The version in simdzone currently is Here is the assembly for our version (note that there is a hot loop in the middle): base16hex_simd(unsigned char*, unsigned char const*):
movdqu xmm1, XMMWORD PTR [rsi]
pcmpeqd xmm5, xmm5
mov rax, rsi
movdqa xmm7, XMMWORD PTR .LC1[rip]
paddb xmm1, xmm5
movdqa xmm6, XMMWORD PTR .LC0[rip]
movdqa xmm8, XMMWORD PTR .LC2[rip]
movdqa xmm2, xmm1
movdqa xmm3, xmm7
psrld xmm2, 4
movdqa xmm0, xmm8
pand xmm2, xmm6
pshufb xmm3, xmm2
pshufb xmm0, xmm2
paddb xmm0, xmm1
paddb xmm1, xmm3
pmovmskb edx, xmm1
test edx, edx
jne .L4
movdqa xmm4, XMMWORD PTR .LC3[rip]
movdqa xmm9, XMMWORD PTR .LC4[rip]
.L2:
add rax, 16
pmaddubsw xmm0, xmm4
movdqa xmm3, xmm7
pshufb xmm0, xmm9
movups XMMWORD PTR [rdi], xmm0
movdqu xmm1, XMMWORD PTR [rax]
movdqa xmm0, xmm8
add rdi, 8
paddb xmm1, xmm5
movdqa xmm2, xmm1
psrld xmm2, 4
pand xmm2, xmm6
pshufb xmm3, xmm2
pshufb xmm0, xmm2
paddb xmm0, xmm1
paddb xmm1, xmm3
pmovmskb edx, xmm1
test edx, edx
je .L2
.L4:
bsf rdx, rdx
test edx, edx
je .L3
movsx rdx, edx
mov ecx, 16
sub rcx, rdx
add rax, rdx
movdqu xmm1, XMMWORD PTR base16hex_simd(unsigned char*, unsigned char const*)::zero_masks[rcx]
pandn xmm1, xmm0
movdqa xmm0, XMMWORD PTR .LC3[rip]
pmaddubsw xmm1, xmm0
pshufb xmm1, XMMWORD PTR .LC4[rip]
movups XMMWORD PTR [rdi], xmm1
.L3:
sub rax, rsi
ret Here is the assembly for Geoff's version (note the hot loop in the middle): base16hex_simd_geoff(unsigned char*, unsigned char const*):
movdqa xmm3, XMMWORD PTR .LC5[rip]
mov rax, rsi
movdqu xmm1, XMMWORD PTR [rsi]
movdqa xmm4, XMMWORD PTR .LC6[rip]
movdqa xmm0, xmm3
movdqa xmm5, XMMWORD PTR .LC7[rip]
paddb xmm0, xmm1
movdqa xmm6, XMMWORD PTR .LC8[rip]
psubusb xmm0, xmm4
movdqa xmm8, XMMWORD PTR .LC9[rip]
pand xmm1, xmm5
movdqa xmm7, XMMWORD PTR .LC10[rip]
paddb xmm1, xmm6
movdqa xmm9, XMMWORD PTR .LC11[rip]
paddusb xmm1, xmm8
paddb xmm0, xmm7
pminub xmm0, xmm1
movdqa xmm1, xmm0
paddusb xmm0, xmm9
pmovmskb edx, xmm0
test edx, edx
jne .L10
movdqa xmm2, XMMWORD PTR .LC3[rip]
movdqa xmm10, XMMWORD PTR .LC4[rip]
.L8:
add rax, 16
pmaddubsw xmm1, xmm2
movdqa xmm0, xmm3
pshufb xmm1, xmm10
movups XMMWORD PTR [rdi], xmm1
movdqu xmm1, XMMWORD PTR [rax]
add rdi, 8
paddb xmm0, xmm1
pand xmm1, xmm5
paddb xmm1, xmm6
psubusb xmm0, xmm4
paddusb xmm1, xmm8
paddb xmm0, xmm7
pminub xmm0, xmm1
movdqa xmm1, xmm0
paddusb xmm0, xmm9
pmovmskb edx, xmm0
test edx, edx
je .L8
.L10:
bsf rdx, rdx
test edx, edx
je .L9
movsx rdx, edx
mov ecx, 16
sub rcx, rdx
add rax, rdx
movdqu xmm0, XMMWORD PTR base16hex_simd_geoff(unsigned char*, unsigned char const*)::zero_masks[rcx]
pandn xmm0, xmm1
movdqa xmm1, XMMWORD PTR .LC3[rip]
pmaddubsw xmm0, xmm1
pshufb xmm0, XMMWORD PTR .LC4[rip]
movups XMMWORD PTR [rdi], xmm0
.L9:
sub rax, rsi
ret The source code is available at https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/07/26 |
Quite a few RR presentations don't require that pairs are held together. For instance DS is described in RFC 4034 as:
|
If you know it is white space, then you can trim it out as a first pass. Super fast if AVX-512 is available, slower but still fast otherwise. I can easily include that in the benchmark. Again… what matters is whether I am benchmarking the right problem. We need to make sure that I do. |
E.g., when I load the data, I can add a branch that checks for white space and prune it out dynamically as needed. |
Looks good @lemire! Note that for simdzone if there's white space we need to stop as for contiguous strings that's a delimiter and the next set of characters may not be complete. As for @aqrit's comment, I think he means we need to subtract 1 as |
Last night, I managed to get...
as opposed to...
I don't think it is win. |
Can you elaborate? E.g., is it the case that There are many ways to deal efficiently with the spaces, but they appear matters a lot. The length of the base16 continuous sequences matters.
|
Examples would be great. |
I added a version (
|
I was incorrectly assuming a bswap would be required.
We might not hit a shuffle bottleneck here, but the Base64 decoder did get slower when I tried using an extra shuffle, Icelake seems to have two shuffle ports. Also note:
|
Yes In current usage base16 with whitespace is used in DS, CDS, ZONEMD and SSHFP to present digests that cap out at 64 bytes (SHA-512). They probably represent the general case though the SMIMEA and TLSA RRs also use it for either a digest or a certificate (I'm not that familiar with those RR so that could be wrong). The spaces question is trickier because of parentheses. From RFC 1035:
So space may encompass traditional whitespace, newlines, comments, or nested parentheses. I don't have data on how the presentation format is used in the wild but anecdotally I don't believe I've ever seen base16 broken into whitespace on a nibble boundary outside of test cases and most tools seem to present base16 without whitespace. |
For base16, we always require contiguous strings. Some RRTYPEs allow for spaces, some don't. e.g. in It's hard to say how DNS operators present the fields. Since spaces are allowed, it's valid to do: As for examples:
There is another case where base16 can be used for all record types, and that's with generic encoding. RFC3597 introduced a generic format to present RRs unknown to the name server. Basically, you can present an A record in one of two ways:
or, in generic notation:
So, it's hard to say how long the sequence will be. Generic notation is rarely used (though it may be an interesting method to parsing the zone 🤔). It's probably best to focus on
I had not considered the second option, but it's certainly worth considering. Thinking about it some more, we could introduce a |
Blog post: Decoding base16 sequences quickly. Whether this can be used in simdzone productively is an open question. Typically, one would first test whether a fast path is possible, and then use a routine such as these, and if not, then fall back on something slower. The check for a fast path needs to cheap, but to also cover a broad range of use case. This will require domain expertise to set up. |
Note: zero_masks table is probably not need here, because output bytes are formed from only two input bytes. At worst, bad chars could be zero'd out using |
Nice write-up @lemire! It's good to have this work nicely documented.
Pun intended? 😅 But, yes. Zone files are weird for people not that familiar with DNS (that includes me 🙂). Actually, they're weird for people familiar with DNS too. I think that for 1433128 / 8924060 (DS / all) (2019-09-05) The older In both cases the base16 data is (SHA-1, SHA-2561 or SHA-384 encoded digests):
We can probably do a base16 decoder just for @lemire, I can have a look next week unless you want to have a stab at it? It's fine with me either way. |
You probably don't want to use generic code. I suggest that the code should be tailored to the data you do have, with possible fallbacks when the data has an unexpected form. E.g., if you expect that the input should be |
I'll have a go at it later this week (probably). I'll test some different scenarios and see which one works. I kind of like the idea of having a separate |
RFC3597 section 5 (Handling of Unknown DNS Resource Record (RR) Types) states the following about hex-encoded sequences:
I figured the same would apply to the |
Base64 will have the same problem. Basically, you can wipe out all or most of the benefits of fast parsing if you optimize for arbitrary inputs instead of the inputs you do have in practice. If you go back to our orignal paper on base64, there is an appendix where we make the point that handling spaces at arbitrary locations can end up taking the bulk of the running time. It is just a fact that actual data has often low entropy: it comes in specific shapes most of the time, and that's what you want to optimize for. |
base16 encoding is used quite a lot. i.e. in
DS
records and to represent rdata for unknown record types (as outlined in RFC3597). Research was done to deserialize more efficiently by Geoff Langdale and Wojciech Muła. A couple of algorithms are outlined in this article. Other research may be available too.The text was updated successfully, but these errors were encountered: