-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy_misaligned_words: avoid out-of-bounds accesses #799
base: master
Are you sure you want to change the base?
Conversation
ea4768d
to
469b91f
Compare
I've been meaning to ask about this, it sounds like a great idea to me. You can just add a new main.yml CI job, probably with these bits compiler-builtins/.github/workflows/main.yml Lines 107 to 115 in 571ce5f
The float tests can probably be skipped since that module has no unsafe (we might even be able to |
Even |
e4ef842
to
842fa94
Compare
dest_usize = dest_usize.wrapping_add(1); | ||
} | ||
|
||
// There's one more element left to go, and we can't use the loop for that as on the `src` side, | ||
// it is partially out-of-bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code previously seemed unaware that there can also be OOB accesses at the end of the range -- but of course that's fundamentally the same problem as at the beginning.
605b6ca
to
d8cf8a2
Compare
Miri is looking good on CI :) |
d8cf8a2
to
cd0efd6
Compare
cd0efd6
to
01c90a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some surface-level notes, I'll take a closer look at perf soonish
run: rustup update nightly --no-self-update && rustup default nightly | ||
shell: bash | ||
- run: rustup component add miri | ||
- run: cargo miri setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does miri benefit much from caching?
- run: cargo miri setup | |
- run: cargo miri setup | |
- uses: Swatinem/rust-cache@v2 |
Also could you add Miri to the success
job at the bottom, so branch protection blocks on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly gets cached there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It saves the Cargo cache and the target directory. Probably doesn't hurt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching the target dir seems pointless since we're using the latest nightly so it will be invalidated every night, but sure 🤷
130c8a0
to
6ee9f22
Compare
b3464f6
to
9330e7c
Compare
Unfortunately it looks like this comes close to doubling the total line and label counts of this routine https://godbolt.org/z/7WYa6e83n. I agree that the UB is worth fixing even at a performance hit, but I have to imagine this could be improved with massaging. @nbdd0121 I know it has been a long time since you worked on #405 but do you have any ideas on how to improve the codegen here without OOB access? (I haven't actually tested so it is possible visual asm heuristics don't accurately reflect runtime, but the end blocks are definitely larger) |
Yeah there's prefix and postfix handling now which of course adds some extra code and labels. The original code neglected to treat the last loop iteration differently which makes this not a fully fair comparison (at the very least, we should compare with a version that uses an atomic/volatile load for the last round, as that one can also be OOB). The code size could be reduced by using This compares the "original but with the final loop iteration unrolled" with the |
Technically last iteration of loop doesn't special handling if use use unordered atomic load for each loop iteration. The codegen shouldn't be massively different. Using byte copy doesn't necessarily mean worse performance as it at most (on each end) performs 3 additional byte copies. But it also removes data-dependent branches which is hard to predict. This can also merged with the out-most byte-copy computation. I guess benchmarking would be necessary. |
In my view, since this is almost certainly still faster than the code before #405, and that PR achieved its performance by having UB, this is still a win. But I'd also be curious what the numbers actually look like. If someone has access to an ARM-32 system and could benchmark this, that would be great. :) |
Fixes #559 for
memmove
/memcpy
: load the underaligned prefix and suffix incopy_*_misaligned_words
in up to 3 separate aligned loads (a 1-byte load, a 2-byte load, and for 64bit targets a 4-byte load), while only doing those loads that are actually inbounds. The hope is that the performance loss compared to a single aligned ptr-sized load is negligible.I confirmed that this now passes Miri (the second of these already worked before this PR):
I added a new test since the existing test had some slack space around the memory being copied, making all accesses accidentally inbounds (but Miri was still helpful to confirm everything is aligned). This test found a bug in my code, fixed in the second commit. :D
This also add those above commands to CI so hopefully this crate still stay green for Miri. :)