-
Notifications
You must be signed in to change notification settings - Fork 144
Avoid per-byte loop in cstring{,Utf8} builders #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
96880aa
to
266d6da
Compare
The emulated CI build failures are spurious/systemic, not related to the PR. If I add a couple of new benchmarks that use somewhat longer string literals in builders: --- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
, benchB' "UTF-8 String" () $ \() -> P.cstringUtf8 "hello world\0"#
, benchB' "String (naive)" "hello world!" fromString
, benchB' "String" () $ \() -> P.cstring "hello world!"#
+ , benchB' "AsciiLit64" () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+ , benchB' "Utf8Lit64" () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
]
, bgroup "Encoding wrappers" The relevant benchmark results (GHC 9.4.5) are:
The baseline master branch run was:
|
Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0. I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass. Your I will take a closer look later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branching logic can potentially be simplified some. Currently we ask:
- Are we done?
- Is there a null to decode?
- Is the output buffer full?
- Are there any non-nulls to copy?
But we can also ask only:
- Is there a null to decode? (If we are done, the answer will be no.)
- Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)
That would mean we perform extra zero-length memcpy
s in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.
9086b60
to
e6cc4a2
Compare
nitpick: could |
Sure. Done. I do hope we won't forget to squash before merging... |
Data/ByteString/Builder.hs
Outdated
|
||
-- | Char8 encode a 'String'. | ||
{-# INLINE [1] string8 #-} -- phased to allow P.cstring rewrite | ||
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chessai , @Bodigrim , @clyring A question for the reviewers:
Why is the phase specified here equal to 1
? When I add tests to see whether string8
and stringUtf8
actually benefit from the RULES, I only get improvement when the phase is set to 0
:
--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -255,6 +255,10 @@ ascBuf, utfBuf :: Ptr Word8
ascBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
utfBuf = Ptr "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+ascStr, utfStr :: String
+ascStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+utfStr = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+
asclit, utflit :: Ptr Word8 -> Builder
asclit str@(Ptr addr) = BI.ascLiteralCopy str (byteCountLiteral addr)
utflit str@(Ptr addr) = BI.modUtf8LitCopy str (byteCountLiteral addr)
@@ -273,6 +277,8 @@ main = do
, benchB' "String" () $ \() -> asclit (Ptr "hello world!"#)
, benchB' "AsciiLit" () $ \() -> asclit ascBuf
, benchB' "Utf8Lit" () $ \() -> utflit utfBuf
+ , benchB' "strLit" () $ \() -> string8 ascStr
+ , benchB' "utfLit" () $ \() -> stringUtf8 utfStr
]
, bgroup "Encoding wrappers"
With the phase set to 1
as a baseline, testing with phase 0
the bench
report is:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (2.64s)
243 ns ± 8.5 ns, same as baseline
Utf8Lit: OK (1.45s)
286 ns ± 15 ns, same as baseline
strLit: OK (1.23s)
243 ns ± 19 ns, 51% less than baseline
utfLit: OK (1.38s)
279 ns ± 15 ns, 44% less than baseline
This is with GHC 9.4.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing with GHC 8.10 the phase 1 -> phase 0 delta is:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (1.00s)
179 ns ± 9.5 ns, 47% less than baseline
Utf8Lit: OK (0.53s)
198 ns ± 15 ns, same as baseline
strLit: OK (0.49s)
187 ns ± 19 ns, 51% less than baseline
utfLit: OK (0.52s)
199 ns ± 20 ns, same as baseline
With GHC 9.2:
All
Data.ByteString.Builder
Small payload
AsciiLit: OK (1.36s)
236 ns ± 22 ns, same as baseline
Utf8Lit: OK (1.36s)
274 ns ± 16 ns, same as baseline
strLit: OK (1.20s)
237 ns ± 14 ns, 69% less than baseline
utfLit: OK (1.36s)
275 ns ± 15 ns, 64% less than baseline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. In principle these rules can fire in phase 2, and I do observe "Rule fired: string8/unpackFoldrCString# (Data.ByteString.Builder)" if I inline ascStr in your example and compile with ghc-9.4.4.
These should probably just be NOINLINE pragmas. primMapList{Fixed,Bounded} are themselves marked inline (to encourage good specialization with the particular BoundedPrim
used) and produce lots of code, but will not actually fuse with good list producers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HEAD (master), with no changes other than phase [1] -> [0]:
All
Data.ByteString.Builder
Small payload
strLit: OK (1.90s)
688 ns ± 43 ns, 27% less than baseline
utfLit: OK (0.93s)
736 ns ± 62 ns, 31% less than baseline
Which seems to suggest that the original phase control was not helping, indeed simply removing the rules and inlining (GHC 9.2) gives:
All
Data.ByteString.Builder
Small payload
strLit: OK (1.14s)
790 ns ± 52 ns, 16% less than baseline
utfLit: OK (0.99s)
791 ns ± 66 ns, 26% less than baseline
All 2 tests passed (2.18s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the issue be that I'm giving string8
and stringUtf8
named constant strings, rather than inline string constants? That is, first inline the arguments at the call site, and only then inline string8
?
This looks rather fragile. Is there a downside to setting the phase number to 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the unpackCStringUtf8#
gets rewritten in terms of unpackFoldrCStringUtf8#
before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
{-# INLINE [1] string8 #-} -- phased to allow literal cstring rewrites | |
{-# NOINLINE string8 #-} -- allow literal cstring rewrites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed fragile, and it doesn't even work with a non-ASCII string. (Looking at the core2core output, it looks like the
unpackCStringUtf8#
gets rewritten in terms ofunpackFoldrCStringUtf8#
before our rule fires...) Setting the phase number to 0 might help, too, but my suggestion was more extreme:
I am beginning to agree. And I don't think this impairs the efficiency of the non-literal input case, where the particular BoundedPrim is already available, and that's all that needs to be optimised, the input data does not have be seen for good code generation. And rewrite RULES fire without inlining the result, so this makes sense. I'll push a commit with NOINLINE
and the additional benchmark variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, there is no explicit rewrite rule for the UTF8 build
case, adding one doesn't seem to make a difference, I think that the build (unpackFoldr ...)
form gets rewritten back when no fusion happens, and then the existing rule fires?
+#if __GLASGOW_HASKELL__ >= 811
+{-# RULES
+"stringUtf8/unpackFoldrCStringUtf8#" forall s.
+ stringUtf8 (build (unpackFoldrCStringUtf8# s)) =
+ modUtf8LitCopy (Ptr s) (byteCountLiteral s)
+ #-}
+#endif
The above is harmless, and can be added, but does not appear to be necessary.
If there's anything further I need to do, please let me know... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been a bit sidetracked the last few weeks, sorry.
How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.
Data/ByteString/Builder/Internal.hs
Outdated
!op' = op0 `plusPtr` (nullFree + 1) | ||
nullAt' <- c_strstr ip' modifiedUtf8NUL | ||
modUtf8_step ip' len' nullAt' k (BufferRange op' ope) | ||
| avail > 0 = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, but also avail == 0
should be a very rare case.
@vdukhovni please rebase to trigger updated CI jobs. |
44fdcbc
to
0645428
Compare
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM module naming nitpicking!
@vdukhovni could you possibly address @clyring's questions?
Data/ByteString/Builder/Internal.hs
Outdated
-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding, | ||
-- which is part of "modified UTF-8" (GHC does not also implement CESU-8). | ||
modifiedUtf8NUL :: CString | ||
modifiedUtf8NUL = Ptr "\xc0\x80"# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modifiedUtf8NUL = Ptr "\xc0\x80"# | |
modUtf8NUL = Ptr "\xc0\x80"# |
Let's keep the prefix consistent.
ping @vdukhovni Do you plan to come back to this patch? Would you like to pass this off to a maintainer? |
It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work? |
0645428
to
01b5f36
Compare
Perhaps I can get this over the line. What remains to be done? |
Data/ByteString/Builder/Internal.hs
Outdated
-- available buffer space. If the string is long enough, we may have asked | ||
-- for less than its full length, filling the buffer with the rest will go | ||
-- into the next builder step. | ||
| avail > nullFree = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check with hpc
that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)
This PR is languishing. Where do we go from here? |
Removing milestone for now. |
111456a
to
0a7d5c8
Compare
I've rebased this PR and significantly improved its performance. Please look again. The only possibly improvement (if worth it) is to rewrite The Haskell version is however likely not too far from the expected C performance. So no sure this warranted new "cbits". |
Sorry for leaving this hanging so long, @vdukhovni! I think that the last time I was working on this I was trying to get the null-encoding-correction work between calls to
We want native Haskell implementations anyway due to I will take a look at the changes you have pushed tomorrow. I'm not sure what's going on with the OpenBSD job. It superficially looks like tests are failing...
...but we pass |
@clyring I'm pretty sure OpenBSD failure is completely unrelated to this PR, it currently fails across all my projects which have a OpenBSD job. (I suspect the root partition is deliberately very small on those runners and we need a hack similar to haskellari/splitmix@1a7118d, but let's leave it for another day) |
Well, the delay gave me an opportunity to tackle it afresh and come up with a cleaner, more performant design. The core idea is to observe that a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to think more about the awkward middle ground list of 'things modUtf8Lit
attempts to translate' chosen by the current version of this patch.
Data/ByteString/Builder/Internal.hs
Outdated
if | ch /= 0xC0 -> do | ||
poke op ch | ||
let !cnt = ipe `minusPtr` ip' | ||
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!runend <- S.memchr ip' 0xC0 (fromIntegral cnt) | |
!runend <- S.memchr ip' 0xC0 (fromIntegral @Int @CSize cnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly CSize
is not in scope in this module, and I don't think making it available is worthwhile. I added the explicit @Int
, perhaps that's "progress".
3d72c68
to
b202ddb
Compare
I've fixed the issues with older GHC compatibility, CI passes. Perhaps close to done now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(I checked manually that this branch passes new tests from #714 if rebased) If there are no more comments / suggestions by October 12, I'll merge it as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ASCII side of things is perfect.
On the UTF-8 side of things I am not 100% sold on the specification yet:
- The current version of this patch changes the observable behavior of
cstringUtf8
on (admittedly very sketchy) inputs with0xC0
not followed by0x80
.- This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current
cstringUtf8
. But it might make the backporting and deprecation story a bit messier. - How much of a performance penalty is there for matching the old behavior exactly? And if we don't care to match that old behavior exactly, would always ignoring the next input byte after
0xC0
be a further performance improvement?
- This isn't necessarily a major problem: I'd be surprised if there is any actual breakage, and wouldn't be surprised if literally nobody directly uses the current
- The demand that
modUtf8LitCopy
be given a null-terminated buffer (for safety if0xC0
is the last byte) and its length not including that null terminator makes for a very weird interface!
testCStringUtf8 :: Int -> TestTree | ||
testCStringUtf8 sz = testProperty "cstringUtf8" $ | ||
BE.toLazyByteStringWith (BE.untrimmedStrategy sz sz) L.empty | ||
(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0"#) == | |
(BP.cstringUtf8 "hello\xc0\x80\xc0\x80\xd0\xbc\xd0\xb8\xd1\x80\xc0\x80\xC0\x80"#) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it is best to test that the code does not blow up with input that unexpectedly ends with just \xC0
before the raw (implicit) NUL terminator. So this is more of a robustness test, that then also encodes the implemented handling, than a promise to users that this is how that's handled.
Data/ByteString/Builder/Internal.hs
Outdated
|
||
, byteStringCopy | ||
, asciiLiteralCopy | ||
, modUtf8LitCopy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the earlier modUtf8LitCopy
-> modUtf8LiteralCopy
naming suggestion.
Copy chunks of the input to the output buffer with, up to the shorter of the available buffer space and the "null-free" portion of the remaining string. Actually "null-free" here means not containing any denormalised two-byte encodings starting with 0xC0 (so possibly also other ASCII bytes if the UTF-8 encoding is oddball). This substantially improves performance, with just one "15%" increase that looks like a spurious measurement error (perhaps code layout difference artefact). UTF-8 String (12B): OK 16.7 ns ± 1.3 ns, 60% less than baseline UTF-8 String (64B, one null): OK 22.6 ns ± 1.3 ns, 87% less than baseline UTF-8 String (64B, one null, no shared work): OK 30.1 ns ± 2.6 ns, 83% less than baseline UTF-8 String (64B, half nulls): OK 92.6 ns ± 5.3 ns, 49% less than baseline UTF-8 String (64B, all nulls): OK 76.3 ns ± 4.5 ns, 57% less than baseline UTF-8 String (64B, all nulls, no shared work): OK 82.3 ns ± 5.6 ns, 54% less than baseline ASCII String (12B): OK 6.50 ns ± 326 ps, 76% less than baseline ASCII String (64B): OK 8.03 ns ± 334 ps, 94% less than baseline AsciiLit: OK 8.02 ns ± 648 ps, 94% less than baseline Utf8Lit: OK 21.8 ns ± 1.3 ns, 88% less than baseline strLit: OK 8.90 ns ± 788 ps, 94% less than baseline stringUtf8: OK 22.4 ns ± 1.3 ns, 87% less than baseline strLitInline: OK 8.26 ns ± 676 ps, 94% less than baseline utf8LitInline: OK 23.2 ns ± 1.3 ns, 87% less than baseline foldMap byteStringInsert (10000): OK 46.0 μs ± 4.0 μs, 15% less than baseline --> lazyByteStringHex (10000): OK --> 4.74 μs ± 337 ns, 15% more than baseline foldMap integerDec (small) (10000): OK 205 μs ± 12 μs, 9% less than baseline char8 (10000): OK 2.58 μs ± 234 ns, 30% less than baseline foldMap (left-assoc) (10000): OK 73.2 μs ± 2.9 μs, 54% less than baseline foldMap (right-assoc) (10000): OK 43.0 μs ± 4.2 μs, 65% less than baseline foldMap [manually fused, left-assoc] (10000): OK 81.4 μs ± 5.3 μs, 48% less than baseline foldMap [manually fused, right-assoc] (10000): OK 47.3 μs ± 785 ns, 61% less than baseline
Thanks.
Such overlong encodings, of which What is perhaps a bit more bold is that I ignore the top two bits, because the input is an My take is that after The original code would have retained any
Only invalid UTF8 would result in possibly surprising behaviour, but not more surprising than the current code. This code implements the
It would definitely be noticeably costlier for inputs with many (overlong encoded) NULs. And what would you then do with other overlong inputs?
We can't "ignore" the next byte by producing no output, that's the byte that's encoding the NUL as
The requirement for a final (unencoded) NUL is a result of the input being just a raw The main idea is to no throw errors, this code will produce valid output for valid "modified UTF8", and something closely related to the input otherwise, never throwing any errors (just as before). |
Review feedback
b202ddb
to
833ed24
Compare
Avoid per-byte loop in cstring{,Utf8} builders
Copy chunks of the input to the output buffer with, up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string. Actually "null-free" here means not containing any denormalised
two-byte encodings starting with 0xC0 (so possibly also other ASCII
bytes if the UTF-8 encoding is oddball).
This substantially improves performance, with just one "15%" increase
that looks like a spurious measurement error (perhaps code layout
difference artefact).