-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new loop_parse_if_digits function #291
Conversation
We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.
I prefer the tests to be run before merging this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not merge pull requests that skip CI.
I encourage you to include a rationale as to why you think that compilers will generate faster code. Be mindful that if you are recording timings on purely synthetic benchmarks, without performance counters, you can easily derive the wrong conclusions. The bar is pretty high for us to accept an optimization in this library as it is quite mature. It is fairly easy to reorganize the code and get better results under on compiler for one type of synthetic data set. Please consider there is a significant cost for us to accepting such as change as it may introduce a new bug, which would be terrible. So we need hard evidence that the result is definitively faster. For benchmarking, I use the following framework... https://github.com/lemire/simple_fastfloat_benchmark It includes realistic files (derived from actual content). I tried testing your PR, but it won't build. We do very much encourage pull requests which might improve performance, but, as I wrote, the bar is high. I will be definitively doing my own analysis, on different hardware and different compilers. |
PR #293 added benchmarks to the project. I invite you to sync your fork and run these benchmarks (before and after), ideally with performance counters. |
Co-authored-by: Anders Dalvander <[email protected]>
Alright thank you very much I will do that. |
I included that. As I said, given that we are no longer subtracting 0x30303030303030 twice, here: fast_float/include/fast_float/ascii_number.h Line 143 in d4c573d
and here: fast_float/include/fast_float/ascii_number.h Line 124 in d4c573d
And instead moving the subtraction out to here: fast_float/include/fast_float/ascii_number.h Line 244 in d4c573d
As well as no longer collecting the value twice as is done here: fast_float/include/fast_float/ascii_number.h Lines 227 to 230 in d4c573d
But instead only collect it once here: fast_float/include/fast_float/ascii_number.h Line 244 in d4c573d
This should produce faster code, no? |
Your mental model is that you are reducing the number of instructions. If so, then the next step is to validate that it is true. If you run the benchmark I have just included in privileged mode (so that you have access to hardware performance counters), you should see the number of instructions retired. E.g., the output should look like this...
See the This depends crucially on the compiler and on the targeted system. Alternatively, you can examine the binary output and manually check that there are fewer instructions being executed. There are other elements to consider... but if you can show a substantial reduction in the number of retired instructions per input byte, then it is a very good start. (This is not the only way to optimize the code.) |
The current code achieved the following results using your framework, on an i7 8700k: CLANG/UBUNTU:New Code: # Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 726.38 MB/s (+/- 12.3 %) 41.74 Mfloat/s 21.22 i/B 387.24 i/f (+/- 0.0 %) 5.57 c/B 101.66 c/f (+/- 9.7 %) 3.81 i/c 60.44 b/f 11.78 bm/f 4.24 GHz
fastfloat (32) : 755.59 MB/s (+/- 7.7 %) 43.42 Mfloat/s 20.59 i/B 375.78 i/f (+/- 0.0 %) 5.38 c/B 98.10 c/f (+/- 5.5 %) 3.83 i/c 57.82 b/f 10.50 bm/f 4.26 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64) : 1429.88 MB/s (+/- 4.7 %) 41.09 Mfloat/s 10.29 i/B 375.51 i/f (+/- 0.0 %) 2.83 c/B 103.21 c/f (+/- 3.8 %) 3.64 i/c 59.46 b/f 10.94 bm/f 4.24 GHz
fastfloat (32) : 1500.15 MB/s (+/- 6.9 %) 43.10 Mfloat/s 9.98 i/B 364.06 i/f (+/- 0.0 %) 2.70 c/B 98.36 c/f (+/- 3.1 %) 3.70 i/c 56.84 b/f 10.02 bm/f 4.24 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 471.22 MB/s (+/- 3.5 %) 64.19 Mfloat/s 29.92 i/B 230.27 i/f (+/- 0.0 %) 8.39 c/B 64.56 c/f (+/- 2.0 %) 3.57 i/c 41.87 b/f 9.35 bm/f 4.14 GHz
fastfloat (32) : 450.11 MB/s (+/- 3.4 %) 61.32 Mfloat/s 32.52 i/B 250.30 i/f (+/- 0.0 %) 8.96 c/B 68.94 c/f (+/- 1.6 %) 3.63 i/c 42.50 b/f 9.99 bm/f 4.23 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64) : 990.81 MB/s (+/- 8.2 %) 67.49 Mfloat/s 14.95 i/B 230.09 i/f (+/- 0.0 %) 4.10 c/B 63.18 c/f (+/- 4.8 %) 3.64 i/c 41.50 b/f 9.73 bm/f 4.26 GHz
fastfloat (32) : 927.38 MB/s (+/- 28.7 %) 63.17 Mfloat/s 16.25 i/B 250.11 i/f (+/- 0.0 %) 4.39 c/B 67.51 c/f (+/- 24.2 %) 3.71 i/c 42.12 b/f 10.73 bm/f 4.26 GHz Old Code: # Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 693.12 MB/s (+/- 3.9 %) 39.83 Mfloat/s 20.52 i/B 374.37 i/f (+/- 0.0 %) 5.84 c/B 106.61 c/f (+/- 2.3 %) 3.51 i/c 60.44 b/f 10.99 bm/f 4.25 GHz
fastfloat (32) : 755.23 MB/s (+/- 5.2 %) 43.40 Mfloat/s 19.89 i/B 362.92 i/f (+/- 0.0 %) 5.37 c/B 97.99 c/f (+/- 3.5 %) 3.70 i/c 57.82 b/f 9.96 bm/f 4.25 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64) : 1231.28 MB/s (+/- 8.4 %) 35.38 Mfloat/s 10.29 i/B 375.51 i/f (+/- 0.0 %) 3.30 c/B 120.44 c/f (+/- 5.8 %) 3.12 i/c 59.46 b/f 11.15 bm/f 4.26 GHz
fastfloat (32) : 1271.75 MB/s (+/- 7.5 %) 36.54 Mfloat/s 9.98 i/B 364.06 i/f (+/- 0.0 %) 3.19 c/B 116.48 c/f (+/- 4.0 %) 3.13 i/c 56.84 b/f 10.07 bm/f 4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 489.71 MB/s (+/- 9.8 %) 66.71 Mfloat/s 29.15 i/B 224.37 i/f (+/- 0.0 %) 8.11 c/B 62.43 c/f (+/- 8.0 %) 3.59 i/c 41.87 b/f 9.49 bm/f 4.16 GHz
fastfloat (32) : 450.12 MB/s (+/- 33.4 %) 61.32 Mfloat/s 31.75 i/B 244.40 i/f (+/- 0.0 %) 8.82 c/B 67.92 c/f (+/- 32.1 %) 3.60 i/c 42.50 b/f 10.40 bm/f 4.16 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64) : 944.14 MB/s (+/- 19.5 %) 64.31 Mfloat/s 14.95 i/B 230.09 i/f (+/- 0.0 %) 4.30 c/B 66.26 c/f (+/- 15.2 %) 3.47 i/c 41.50 b/f 9.88 bm/f 4.26 GHz
fastfloat (32) : 842.46 MB/s (+/- 10.7 %) 57.38 Mfloat/s 16.25 i/B 250.11 i/f (+/- 0.0 %) 4.78 c/B 73.53 c/f (+/- 6.3 %) 3.40 i/c 42.12 b/f 10.07 bm/f 4.22 GHz MSVC/WINDOWS:New Code: ####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 366.17 MB/s (+/- 11.0 %) 21.04 Mfloat/s 47.52 ns/f
fastfloat (32) : 377.26 MB/s (+/- 8.8 %) 21.68 Mfloat/s 46.13 ns/f
UTF-16 volume = 3.86749 MB
fastfloat (64) : 708.95 MB/s (+/- 13.4 %) 20.37 Mfloat/s 49.09 ns/f
fastfloat (32) : 698.66 MB/s (+/- 15.9 %) 20.07 Mfloat/s 49.81 ns/f
####
# reading C:/Users/Chris/source/repos/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 235.39 MB/s (+/- 8.0 %) 32.07 Mfloat/s 31.19 ns/f
fastfloat (32) : 217.98 MB/s (+/- 11.0 %) 29.69 Mfloat/s 33.68 ns/f
UTF-16 volume = 1.07202 MB
fastfloat (64) : 475.50 MB/s (+/- 6.0 %) 32.39 Mfloat/s 30.88 ns/f
fastfloat (32) : 435.04 MB/s (+/- 7.0 %) 29.63 Mfloat/s 33.75 ns/f Old Code: ####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 366.25 MB/s (+/- 13.7 %) 21.05 Mfloat/s
fastfloat (32) : 362.43 MB/s (+/- 17.4 %) 20.83 Mfloat/s
UTF-16 volume = 3.86749 MB
fastfloat (64) : 662.16 MB/s (+/- 23.5 %) 19.03 Mfloat/s
fastfloat (32) : 661.80 MB/s (+/- 12.0 %) 19.02 Mfloat/s
####
# reading C:/Users/Chris/source/repos/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 228.54 MB/s (+/- 32.9 %) 31.13 Mfloat/s
fastfloat (32) : 212.88 MB/s (+/- 13.1 %) 29.00 Mfloat/s
UTF-16 volume = 1.07202 MB
fastfloat (64) : 457.15 MB/s (+/- 6.2 %) 31.14 Mfloat/s
fastfloat (32) : 437.25 MB/s (+/- 4.6 %) 29.78 Mfloat/s GCC/UBUNTU:New Code: # Using hardware counters
####
# reading /home/chris/fast_float/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 930.93 MB/s (+/- 5.7 %) 53.50 Mfloat/s 16.50 i/B 301.01 i/f (+/- 0.0 %) 4.44 c/B 80.92 c/f (+/- 2.0 %) 3.72 i/c 49.29 b/f 11.44 bm/f 4.33 GHz
fastfloat (32) : 934.23 MB/s (+/- 6.4 %) 53.69 Mfloat/s 16.49 i/B 300.93 i/f (+/- 0.0 %) 4.45 c/B 81.22 c/f (+/- 2.3 %) 3.71 i/c 47.80 b/f 10.11 bm/f 4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64) : 1592.76 MB/s (+/- 5.2 %) 45.77 Mfloat/s 8.57 i/B 312.87 i/f (+/- 0.0 %) 2.55 c/B 92.93 c/f (+/- 3.6 %) 3.37 i/c 50.27 b/f 11.29 bm/f 4.25 GHz
fastfloat (32) : 1647.42 MB/s (+/- 8.2 %) 47.34 Mfloat/s 8.57 i/B 312.79 i/f (+/- 0.0 %) 2.46 c/B 89.89 c/f (+/- 4.5 %) 3.48 i/c 48.78 b/f 10.16 bm/f 4.25 GHz
####
# reading /home/chris/fast_float/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 737.92 MB/s (+/- 7.5 %) 100.52 Mfloat/s 21.67 i/B 166.80 i/f (+/- 0.0 %) 5.51 c/B 42.44 c/f (+/- 2.2 %) 3.93 i/c 28.38 b/f 9.04 bm/f 4.27 GHz
fastfloat (32) : 585.43 MB/s (+/- 3.0 %) 79.75 Mfloat/s 24.04 i/B 185.08 i/f (+/- 0.0 %) 6.95 c/B 53.49 c/f (+/- 0.9 %) 3.46 i/c 30.34 b/f 9.70 bm/f 4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64) : 1357.46 MB/s (+/- 6.4 %) 92.46 Mfloat/s 11.19 i/B 172.19 i/f (+/- 0.0 %) 3.00 c/B 46.14 c/f (+/- 3.6 %) 3.73 i/c 28.76 b/f 9.40 bm/f 4.27 GHz
fastfloat (32) : 1202.43 MB/s (+/- 2.7 %) 81.90 Mfloat/s 12.37 i/B 190.47 i/f (+/- 0.0 %) 3.38 c/B 52.08 c/f (+/- 1.7 %) 3.66 i/c 30.71 b/f 9.26 bm/f 4.27 GHz Old Code: # Using hardware counters
####
# reading /home/chris/test/Build/benchmarks/data/canada.txt
####
# read 111126 lines
ASCII volume = 1.93374 MB
fastfloat (64) : 885.60 MB/s (+/- 5.7 %) 50.89 Mfloat/s 16.29 i/B 297.15 i/f (+/- 0.0 %) 4.69 c/B 85.58 c/f (+/- 1.7 %) 3.47 i/c 50.27 b/f 11.15 bm/f 4.36 GHz
fastfloat (32) : 971.87 MB/s (+/- 8.2 %) 55.85 Mfloat/s 16.28 i/B 297.07 i/f (+/- 0.0 %) 4.28 c/B 78.07 c/f (+/- 5.1 %) 3.81 i/c 48.78 b/f 11.27 bm/f 4.36 GHz
UTF-16 volume = 3.86749 MB
fastfloat (64) : 1565.78 MB/s (+/- 3.9 %) 44.99 Mfloat/s 8.49 i/B 309.89 i/f (+/- 0.0 %) 2.60 c/B 94.75 c/f (+/- 2.9 %) 3.27 i/c 50.27 b/f 11.12 bm/f 4.26 GHz
fastfloat (32) : 1629.05 MB/s (+/- 4.3 %) 46.81 Mfloat/s 8.49 i/B 309.81 i/f (+/- 0.0 %) 2.50 c/B 91.08 c/f (+/- 3.3 %) 3.40 i/c 48.78 b/f 10.01 bm/f 4.26 GHz
####
# reading /home/chris/test/Build/benchmarks/data/mesh.txt
####
# read 73019 lines
ASCII volume = 0.536009 MB
fastfloat (64) : 680.55 MB/s (+/- 4.7 %) 92.71 Mfloat/s 22.24 i/B 171.22 i/f (+/- 0.0 %) 5.98 c/B 46.01 c/f (+/- 2.3 %) 3.72 i/c 28.76 b/f 9.34 bm/f 4.27 GHz
fastfloat (32) : 620.27 MB/s (+/- 3.3 %) 84.50 Mfloat/s 24.62 i/B 189.49 i/f (+/- 0.0 %) 6.56 c/B 50.49 c/f (+/- 2.3 %) 3.75 i/c 30.71 b/f 9.29 bm/f 4.27 GHz
UTF-16 volume = 1.07202 MB
fastfloat (64) : 1241.73 MB/s (+/- 7.4 %) 84.58 Mfloat/s 11.43 i/B 175.95 i/f (+/- 0.0 %) 3.28 c/B 50.44 c/f (+/- 5.2 %) 3.49 i/c 28.76 b/f 9.35 bm/f 4.27 GHz
fastfloat (32) : 1120.19 MB/s (+/- 2.8 %) 76.30 Mfloat/s 12.62 i/B 194.23 i/f (+/- 0.0 %) 3.62 c/B 55.72 c/f (+/- 2.2 %) 3.49 i/c 30.71 b/f 9.49 bm/f 4.25 GHz It would seem that it's not exactly less instructions, although it seems to run faster, generally. |
Thank you. Don't worry about linting further. We can do that later. I will need to review this manually. |
include/fast_float/ascii_number.h
Outdated
@@ -32,7 +32,7 @@ template <typename UC> fastfloat_really_inline constexpr bool has_simd_opt() { | |||
// able to optimize it well. | |||
template <typename UC> | |||
fastfloat_really_inline constexpr bool is_integer(UC c) noexcept { | |||
return !(c > UC('9') || c < UC('0')); | |||
return static_cast<uint8_t>(c - UC('0')) < 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overflow when sizeof(UC) > sizeof(uint8_t)
resulting in false positives for:
wchar_t ch = '0';
ch += 256;
is_integer(ch); // incorrectly true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will overflow when
sizeof(UC) > sizeof(uint8_t)
resulting in false positives for:wchar_t ch = '0'; ch += 256; is_integer(ch); // incorrectly true
How does it look now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better, but conditional_t require c++14.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright it should be good now.
This should avoid using unnecessarily sized unsigned types, while also avoiding overflow.
Running tests. I will review manually a bit later this week. |
Don't worry about linting at this stage. |
Linux systemIce Lake process with GCC 11 canada.txtThis PR:
Base (main):
mesh.txtThis PR:
Base (main):
|
At this point, I am unconvinced that this pull request will lead to measurable gains. My internal testing does not suggest that there are gains. |
I'm guessing you've run the tests on systems representative of the population using this library? If so I will close it then, thanks for your time. |
@RealTimeChris I am not actually sure. I am just saying that I am unconvinced. We need more evidence. |
I'm curious what you mean exactly when you say "gains". |
@RealTimeChris Ideally, we want solid evidence that the new code is never slower, and evidence that it is sometimes measurably faster. When in doubt, it is best to keep the existing code. As you can see, in my own test, this PR gave poorer results (lower speed). |
"Never slower and sometimes measurably faster" is a fantastic ethos. Alright thanks. |
We can eliminate one of the subtractions from within is_made_of_eight_digits_fast and parse_eight_digits_unrolled by instead moving it out into when the value is collected. Additionally, we can also save some overhead by not collecting the value twice for each iteration of the loop within loop_parse_if_eight_digits.
The new function results in the following benchmark scores:
https://pastebin.com/raw/EQJvPF5G
with the following code:
https://github.com/RealTimeChris/BenchmarkSuite/blob/loop_parse_if_digits/StringComparison/main.cpp