-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add_UTF16_support + fix/test array OutOfIndex error #25
Conversation
Got to love these numbers. :-) |
benchmark/Benchmark.cs
Outdated
for (int i = 0; i < FileContent.Length; i++) | ||
{ | ||
string s = FileContent[i]; | ||
char[] base64 = s.ToCharArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect ToCharArray
is allocating. I suggest trying...
fixed (char* p = s)
{
// p now points to the first character of the string
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or possibly...
Span<char> span = s.AsSpan();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it:
The benchmarks show an improvement with using s.AsSpan() :
Eg Before:
Method | FileName | Mean | Error | StdDev | Speed (GB/s) |
---|---|---|---|---|---|
SSEDecodingRealDataUTF8 | data/dns/swedenzonebase.txt | 13,676.2 us | 4,527.54 us | 248.17 us | 2.57 |
SSEDecodingRealDataWithAllocUTF8 | data/dns/swedenzonebase.txt | 15,854.0 us | 2,258.94 us | 123.82 us | 2.21 |
SSEDecodingRealDataUTF16 | data/dns/swedenzonebase.txt | 21,441.2 us | 11,428.57 us | 626.44 us | 1.64 |
SSEDecodingRealDataWithAllocUTF16 | data/dns/swedenzonebase.txt | 23,101.2 us | 1,199.81 us | 65.77 us | 1.52 |
SSEDecodingRealDataUTF8 | data/email/ | 303.5 us | 94.11 us | 5.16 us | 6.51 |
SSEDecodingRealDataWithAllocUTF8 | data/email/ | 502.2 us | 188.52 us | 10.33 us | 3.94 |
SSEDecodingRealDataUTF16 | data/email/ | 1,231.8 us | 794.61 us | 43.56 us | 1.60 |
SSEDecodingRealDataWithAllocUTF16 | data/email/ | 1,807.2 us | 398.33 us | 21.83 us | 1.09 |
DotnetRuntimeSIMDBase64RealDataUTF8 | data/dns/swedenzonebase.txt | 11,092.6 us | 1,009.68 us | 55.34 us | 3.16 |
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 | data/dns/swedenzonebase.txt | 11,840.5 us | 88.56 us | 4.85 us | 2.96 |
DotnetRuntimeBase64RealDataUTF16 | data/dns/swedenzonebase.txt | 47,221.1 us | 9,269.31 us | 508.08 us | .74 |
DotnetRuntimeSIMDBase64RealDataUTF8 | data/email/ | 434.1 us | 10.26 us | 0.56 us | 4.55 |
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 | data/email/ | 585.7 us | 637.26 us | 34.93 us | 3.37 |
DotnetRuntimeBase64RealDataUTF16 | data/email/ | 2,210.4 us | 619.72 us | 33.97 us | .89 |
After:
Method | FileName | Mean | Error | StdDev | Speed (GB/s) |
---|---|---|---|---|---|
SSEDecodingRealDataUTF8 | data/dns/swedenzonebase.txt | 13,609.2 us | 598.55 us | 32.81 us | 2.58 |
SSEDecodingRealDataWithAllocUTF8 | data/dns/swedenzonebase.txt | 16,344.4 us | 11,092.34 us | 608.01 us | 2.15 |
SSEDecodingRealDataUTF16 | data/dns/swedenzonebase.txt | 14,942.3 us | 869.03 us | 47.63 us | 2.35 |
SSEDecodingRealDataWithAllocUTF16 | data/dns/swedenzonebase.txt | 24,094.8 us | 4,199.09 us | 230.17 us | 1.46 |
SSEDecodingRealDataUTF8 | data/email/ | 343.1 us | 67.03 us | 3.67 us | 5.76 |
SSEDecodingRealDataWithAllocUTF8 | data/email/ | 484.0 us | 451.40 us | 24.74 us | 4.08 |
SSEDecodingRealDataUTF16 | data/email/ | 338.4 us | 32.81 us | 1.80 us | 5.84 |
SSEDecodingRealDataWithAllocUTF16 | data/email/ | 1,762.7 us | 311.41 us | 17.07 us | 1.12 |
DotnetRuntimeSIMDBase64RealDataUTF8 | data/dns/swedenzonebase.txt | 11,020.2 us | 1,918.12 us | 105.14 us | 3.19 |
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 | data/dns/swedenzonebase.txt | 11,794.3 us | 1,063.94 us | 58.32 us | 2.98 |
DotnetRuntimeBase64RealDataUTF16 | data/dns/swedenzonebase.txt | 47,170.9 us | 2,594.95 us | 142.24 us | .74 |
DotnetRuntimeSIMDBase64RealDataUTF8 | data/email/ | 443.0 us | 74.44 us | 4.08 us | 4.46 |
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 | data/email/ | 551.8 us | 843.09 us | 46.21 us | 3.58 |
DotnetRuntimeBase64RealDataUTF16 | data/email/ | 2,270.3 us | 777.69 us | 42.63 us | .87 |
src/Base64SSE.cs
Outdated
|
||
while (leftover < 4 && src < srcEnd) | ||
{ | ||
byte val = toBase64[(byte)*src]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in simdutf (C++) here. So when you do toBase64[something], if the something is not ascii, then you are truncating content.
So you need to do something like
if(val > 64 || isNot8Byte(*src)) {
...
}
Where isNot8Byte
can be as simple as isNot8Byte(something) { return something & 0xff00 != 0; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies everywhere toBase64
is used with char
inputs.
while (idx < 4 && src < srcEnd) | ||
{ | ||
char c = (char)*src; | ||
byte code = toBase64[c]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comments, you may need to check that c
is in [0,256)
.
I recommend you add as a test, the string |
This is very good work. |
@Nick-Nuon There are some conflicts with recent changes, but I expect that they are not too troublesome. |
Wasn't too bad , I think this PR should be good to go. |
Let us merge. We can revisit as needed. :-) |
Merged. |
This adds support for UTF16.
These are the new benchmarks:
NB: I ended up copy pasting most of what we had and slightly modifying it.
While I did play a bit with generics, they seem less flexible than C++ templates(they also do their checks at runtime). The code would have IMO have been more concise but also less readable so I'm not sure they are really worth it in this case.