Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add_UTF16_support + fix/test array OutOfIndex error #25

Merged
merged 23 commits into from
Aug 19, 2024
Merged

Conversation

Nick-Nuon
Copy link
Collaborator

@Nick-Nuon Nick-Nuon commented Aug 16, 2024

This adds support for UTF16.

These are the new benchmarks:

Method FileName Mean Error StdDev Speed (GB/s)
SSEDecodingRealDataUTF8 data/dns/swedenzonebase.txt 20,036.6 us 3,980.39 us 218.18 us 1.75
SSEDecodingRealDataWithAllocUTF8 data/dns/swedenzonebase.txt 21,661.4 us 3,467.40 us 190.06 us 1.62
SSEDecodingRealDataUTF16 data/dns/swedenzonebase.txt 26,688.1 us 1,808.73 us 99.14 us 1.32
SSEDecodingRealDataWithAllocUTF16 data/dns/swedenzonebase.txt 29,825.3 us 5,466.67 us 299.65 us 1.18
SSEDecodingRealDataUTF8 data/email/ 331.2 us 59.03 us 3.24 us 5.97
SSEDecodingRealDataWithAllocUTF8 data/email/ 562.6 us 646.32 us 35.43 us 3.51
SSEDecodingRealDataUTF16 data/email/ 1,255.9 us 972.81 us 53.32 us 1.57
SSEDecodingRealDataWithAllocUTF16 data/email/ 1,866.5 us 825.29 us 45.24 us 1.06
DotnetRuntimeSIMDBase64RealDataUTF8 data/dns/swedenzonebase.txt 10,931.8 us 5,225.28 us 286.42 us 3.21
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/dns/swedenzonebase.txt 11,929.2 us 6,126.45 us 335.81 us 2.94
DotnetRuntimeBase64RealDataUTF16 data/dns/swedenzonebase.txt 56,194.0 us 14,758.67 us 808.97 us .62
DotnetRuntimeSIMDBase64RealDataUTF8 data/email/ 451.0 us 132.43 us 7.26 us 4.38
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/email/ 627.3 us 415.36 us 22.77 us 3.15
DotnetRuntimeBase64RealDataUTF16 data/email/ 2,267.6 us 962.06 us 52.73 us .87

NB: I ended up copy pasting most of what we had and slightly modifying it.

While I did play a bit with generics, they seem less flexible than C++ templates(they also do their checks at runtime). The code would have IMO have been more concise but also less readable so I'm not sure they are really worth it in this case.

@lemire
Copy link
Member

lemire commented Aug 16, 2024

Got to love these numbers. :-)

for (int i = 0; i < FileContent.Length; i++)
{
string s = FileContent[i];
char[] base64 = s.ToCharArray();
Copy link
Member

@lemire lemire Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect ToCharArray is allocating. I suggest trying...

        fixed (char* p = s)
        {
            // p now points to the first character of the string
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or possibly...

Span<char> span = s.AsSpan();

Copy link
Collaborator Author

@Nick-Nuon Nick-Nuon Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it:

The benchmarks show an improvement with using s.AsSpan() :

Eg Before:

Method FileName Mean Error StdDev Speed (GB/s)
SSEDecodingRealDataUTF8 data/dns/swedenzonebase.txt 13,676.2 us 4,527.54 us 248.17 us 2.57
SSEDecodingRealDataWithAllocUTF8 data/dns/swedenzonebase.txt 15,854.0 us 2,258.94 us 123.82 us 2.21
SSEDecodingRealDataUTF16 data/dns/swedenzonebase.txt 21,441.2 us 11,428.57 us 626.44 us 1.64
SSEDecodingRealDataWithAllocUTF16 data/dns/swedenzonebase.txt 23,101.2 us 1,199.81 us 65.77 us 1.52
SSEDecodingRealDataUTF8 data/email/ 303.5 us 94.11 us 5.16 us 6.51
SSEDecodingRealDataWithAllocUTF8 data/email/ 502.2 us 188.52 us 10.33 us 3.94
SSEDecodingRealDataUTF16 data/email/ 1,231.8 us 794.61 us 43.56 us 1.60
SSEDecodingRealDataWithAllocUTF16 data/email/ 1,807.2 us 398.33 us 21.83 us 1.09
DotnetRuntimeSIMDBase64RealDataUTF8 data/dns/swedenzonebase.txt 11,092.6 us 1,009.68 us 55.34 us 3.16
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/dns/swedenzonebase.txt 11,840.5 us 88.56 us 4.85 us 2.96
DotnetRuntimeBase64RealDataUTF16 data/dns/swedenzonebase.txt 47,221.1 us 9,269.31 us 508.08 us .74
DotnetRuntimeSIMDBase64RealDataUTF8 data/email/ 434.1 us 10.26 us 0.56 us 4.55
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/email/ 585.7 us 637.26 us 34.93 us 3.37
DotnetRuntimeBase64RealDataUTF16 data/email/ 2,210.4 us 619.72 us 33.97 us .89

After:

Method FileName Mean Error StdDev Speed (GB/s)
SSEDecodingRealDataUTF8 data/dns/swedenzonebase.txt 13,609.2 us 598.55 us 32.81 us 2.58
SSEDecodingRealDataWithAllocUTF8 data/dns/swedenzonebase.txt 16,344.4 us 11,092.34 us 608.01 us 2.15
SSEDecodingRealDataUTF16 data/dns/swedenzonebase.txt 14,942.3 us 869.03 us 47.63 us 2.35
SSEDecodingRealDataWithAllocUTF16 data/dns/swedenzonebase.txt 24,094.8 us 4,199.09 us 230.17 us 1.46
SSEDecodingRealDataUTF8 data/email/ 343.1 us 67.03 us 3.67 us 5.76
SSEDecodingRealDataWithAllocUTF8 data/email/ 484.0 us 451.40 us 24.74 us 4.08
SSEDecodingRealDataUTF16 data/email/ 338.4 us 32.81 us 1.80 us 5.84
SSEDecodingRealDataWithAllocUTF16 data/email/ 1,762.7 us 311.41 us 17.07 us 1.12
DotnetRuntimeSIMDBase64RealDataUTF8 data/dns/swedenzonebase.txt 11,020.2 us 1,918.12 us 105.14 us 3.19
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/dns/swedenzonebase.txt 11,794.3 us 1,063.94 us 58.32 us 2.98
DotnetRuntimeBase64RealDataUTF16 data/dns/swedenzonebase.txt 47,170.9 us 2,594.95 us 142.24 us .74
DotnetRuntimeSIMDBase64RealDataUTF8 data/email/ 443.0 us 74.44 us 4.08 us 4.46
DotnetRuntimeSIMDBase64RealDataWithAllocUTF8 data/email/ 551.8 us 843.09 us 46.21 us 3.58
DotnetRuntimeBase64RealDataUTF16 data/email/ 2,270.3 us 777.69 us 42.63 us .87

src/Base64SSE.cs Outdated

while (leftover < 4 && src < srcEnd)
{
byte val = toBase64[(byte)*src];
Copy link
Member

@lemire lemire Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug in simdutf (C++) here. So when you do toBase64[something], if the something is not ascii, then you are truncating content.

So you need to do something like

if(val > 64 || isNot8Byte(*src)) {
...
}

Where isNot8Byte can be as simple as isNot8Byte(something) { return something & 0xff00 != 0; }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies everywhere toBase64 is used with char inputs.

while (idx < 4 && src < srcEnd)
{
char c = (char)*src;
byte code = toBase64[c];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comments, you may need to check that c is in [0,256).

@lemire
Copy link
Member

lemire commented Aug 16, 2024

I recommend you add as a test, the string string s = "♡♡♡♡";. It should fail to decode.

@lemire
Copy link
Member

lemire commented Aug 16, 2024

This is very good work.

@lemire
Copy link
Member

lemire commented Aug 19, 2024

@Nick-Nuon There are some conflicts with recent changes, but I expect that they are not too troublesome.

@Nick-Nuon
Copy link
Collaborator Author

@Nick-Nuon There are some conflicts with recent changes, but I expect that they are not too troublesome.

Wasn't too bad , I think this PR should be good to go.

@Nick-Nuon Nick-Nuon changed the title Add_UTF16_support Add_UTF16_support + fix/test array OutOfIndex error Aug 19, 2024
@lemire
Copy link
Member

lemire commented Aug 19, 2024

Let us merge. We can revisit as needed. :-)

@lemire lemire merged commit 2c4ca94 into main Aug 19, 2024
1 check passed
@lemire
Copy link
Member

lemire commented Aug 19, 2024

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants