Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_misaligned_words: avoid out-of-bounds accesses #799

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

RalfJung
Copy link
Member

@RalfJung RalfJung commented Mar 18, 2025

Fixes #559 for memmove/memcpy: load the underaligned prefix and suffix in copy_*_misaligned_words in up to 3 separate aligned loads (a 1-byte load, a 2-byte load, and for 64bit targets a 4-byte load), while only doing those loads that are actually inbounds. The hope is that the performance loss compared to a single aligned ptr-sized load is negligible.

I confirmed that this now passes Miri (the second of these already worked before this PR):

# target without mem-unaligned
MIRIFLAGS=-Zmiri-tree-borrows cargo miri test --features no-asm --target armv7-unknown-linux-gnueabihf -- align
# target with mem-unaligned
MIRIFLAGS=-Zmiri-tree-borrows cargo miri test --features no-asm --target x86_64-unknown-linux-gnu -- align

I added a new test since the existing test had some slack space around the memory being copied, making all accesses accidentally inbounds (but Miri was still helpful to confirm everything is aligned). This test found a bug in my code, fixed in the second commit. :D

This also add those above commands to CI so hopefully this crate still stay green for Miri. :)

@tgross35
Copy link
Contributor

Is there some good place in the CI config to add this Miri check? Note that I am only running some of the tests (those with align in their name) as otherwise this will take ~forever; some tests have large iteration counts. We need Tree Borrows since the test suite has the as_ptr+as_mut_ptr pattern that is not compatible with Stacked Borrows.

I've been meaning to ask about this, it sounds like a great idea to me. You can just add a new main.yml CI job, probably with these bits

- uses: actions/checkout@v4
with:
submodules: true
- name: Install Rust (rustup)
run: rustup update ${{ matrix.rust }} --no-self-update && rustup default ${{ matrix.rust }}
shell: bash
- run: rustup target add ${{ matrix.target }}
- run: rustup component add llvm-tools-preview
- uses: Swatinem/rust-cache@v2
then put the rest in a script.

The float tests can probably be skipped since that module has no unsafe (we might even be able to forbid it) and it's probably quite slow to run.

@RalfJung
Copy link
Member Author

Even mem has very slow tests like this one. That's why I only ran the align tests. Though maybe it'd be worth reducing those constants in Miri so more tests can run. I don't want to go over the entire test suite though, that sounds like a lot of work. ;)

@RalfJung RalfJung force-pushed the memmove-inbounds branch 6 times, most recently from e4ef842 to 842fa94 Compare March 18, 2025 21:51
dest_usize = dest_usize.wrapping_add(1);
}

// There's one more element left to go, and we can't use the loop for that as on the `src` side,
// it is partially out-of-bounds.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code previously seemed unaware that there can also be OOB accesses at the end of the range -- but of course that's fundamentally the same problem as at the beginning.

@RalfJung RalfJung force-pushed the memmove-inbounds branch 2 times, most recently from 605b6ca to d8cf8a2 Compare March 18, 2025 22:17
@RalfJung
Copy link
Member Author

Miri is looking good on CI :)

Copy link
Contributor

@tgross35 tgross35 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some surface-level notes, I'll take a closer look at perf soonish

run: rustup update nightly --no-self-update && rustup default nightly
shell: bash
- run: rustup component add miri
- run: cargo miri setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does miri benefit much from caching?

Suggested change
- run: cargo miri setup
- run: cargo miri setup
- uses: Swatinem/rust-cache@v2

Also could you add Miri to the success job at the bottom, so branch protection blocks on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly gets cached there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It saves the Cargo cache and the target directory. Probably doesn't hurt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching the target dir seems pointless since we're using the latest nightly so it will be invalidated every night, but sure 🤷

@RalfJung RalfJung force-pushed the memmove-inbounds branch 4 times, most recently from 130c8a0 to 6ee9f22 Compare March 19, 2025 07:29
@tgross35
Copy link
Contributor

tgross35 commented Mar 19, 2025

Unfortunately it looks like this comes close to doubling the total line and label counts of this routine https://godbolt.org/z/7WYa6e83n. I agree that the UB is worth fixing even at a performance hit, but I have to imagine this could be improved with massaging.

@nbdd0121 I know it has been a long time since you worked on #405 but do you have any ideas on how to improve the codegen here without OOB access?

(I haven't actually tested so it is possible visual asm heuristics don't accurately reflect runtime, but the end blocks are definitely larger)

@RalfJung
Copy link
Member Author

Yeah there's prefix and postfix handling now which of course adds some extra code and labels. The original code neglected to treat the last loop iteration differently which makes this not a fully fair comparison (at the very least, we should compare with a version that uses an atomic/volatile load for the last round, as that one can also be OOB).

The code size could be reduced by using copy_forward_bytes instead of load_aligned_partial/load_aligned_end_partial. But I would expect that to be worse for performance...

This compares the "original but with the final loop iteration unrolled" with the copy_forward_bytes variant: https://godbolt.org/z/768YPGqGG. Still an increase, but "only" by 60%.

@nbdd0121
Copy link
Contributor

Technically last iteration of loop doesn't special handling if use use unordered atomic load for each loop iteration. The codegen shouldn't be massively different.

Using byte copy doesn't necessarily mean worse performance as it at most (on each end) performs 3 additional byte copies. But it also removes data-dependent branches which is hard to predict. This can also merged with the out-most byte-copy computation. I guess benchmarking would be necessary.

@RalfJung
Copy link
Member Author

In my view, since this is almost certainly still faster than the code before #405, and that PR achieved its performance by having UB, this is still a win.

But I'd also be curious what the numbers actually look like. If someone has access to an ARM-32 system and could benchmark this, that would be great. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Several functions perform out-of-bounds memory accesses (which is UB)
3 participants