Skip to content

fix: unaligned memcpy#1004

Merged
greenhat merged 29 commits intonextfrom
i1003-fix-byte-memcpy
Mar 27, 2026
Merged

fix: unaligned memcpy#1004
greenhat merged 29 commits intonextfrom
i1003-fix-byte-memcpy

Conversation

@greenhat
Copy link
Copy Markdown
Contributor

@greenhat greenhat commented Mar 12, 2026

Close #1003

Summary

  • factor the counted loop setup used by memcpy/memset and add regression coverage for aligned and unaligned byte copies/sets, zero-length operations, and unaligned u16/i16 memory accesses
  • tighten MASM memory emission for byte-oriented memcpy fast paths so they only convert byte pointers to element addresses when the inputs are word-aligned
  • handle unaligned 16-bit loads and stores that cross a 32-bit element boundary by routing offset == 3 through the split-word intrinsics while preserving the existing within-element path for offset <= 2

I suggest reviewing on a per-commit basis skipping non-interesting commits (refactors, etc.).

@greenhat greenhat changed the title fix: byte-version of the memcpy (iteration guard) fix: byte-version of the unaligned memcpy Mar 12, 2026
@greenhat greenhat force-pushed the i1003-fix-byte-memcpy branch from 3f5b5d0 to 30b783a Compare March 17, 2026 05:02
@greenhat greenhat marked this pull request as ready for review March 18, 2026 14:03
@greenhat greenhat changed the title fix: byte-version of the unaligned memcpy fix: unaligned memcpy Mar 18, 2026
@greenhat greenhat requested a review from bitwalker March 18, 2026 14:04
Copy link
Copy Markdown
Collaborator

@bitwalker bitwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly looks fine, I left comments for my various nits, but nothing major sticks out.

# which is the offset of the first byte, in the 32-bit representation of that element.
#
# Stack transition: [addr, offset] -> [value]
pub proc load_u16
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub proc load_u16
pub proc load_u16(addr: ptr<felt, addrspace(felt)>, offset: u8) -> u16

Let's make sure to add type signatures to all new procedures

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done in c180217

impl OpEmitter<'_> {
/// Emit the loop header for a counted `while.true` loop.
///
/// The caller provides the `dup` instruction needed to bring `count` to the top of the stack
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to clarify a bit why this is up to the caller, and perhaps provide an example. IMO, it would make more sense to provide an offset value or something, rather than the instruction itself - but the rationale might be obvious from better documentation/examples here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c180217


/// Emit the loop back-edge condition for a counted `while.true` loop.
///
/// The caller provides the `dup` instruction needed to bring `count` to the top of the stack
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c180217

self.emit_all(
[
masm::Instruction::U32DivModImm(16.into()),
masm::Instruction::Assertz,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably consider using an instruction variant with an error message, so it is clear why this fails when it fails

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you! Done in c180217

///
/// This is used for branch bodies which operate on a known stack shape from the enclosing
/// emitter, but which do not need to synchronize typed operand-stack state back to it.
fn build_raw_block(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a distinction between a "raw" MASM block and a non-"raw" MASM block? I think I might just call this build_masm_block or something

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c180217

[
// Convert `src` to element address
masm::Instruction::U32DivModImm(4.into()),
masm::Instruction::Assertz,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably try and make these asserts produce useful errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c180217

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also went and added error messages to all emitted asserts at 16a7669

@greenhat
Copy link
Copy Markdown
Contributor Author

Please don't merge this PR before #995. It's a pain to rebase it afterwards.

@bitwalker
Copy link
Copy Markdown
Collaborator

This just needs a rebase and it can be merged

greenhat added 22 commits March 27, 2026 09:54
Fix the byte-addressed memory paths that cross a 32-bit element boundary.
This keeps the `memcpy`/`memset` fallback coverage added in this branch
working for short unaligned copies, including scalarized `u16` loads and
stores at byte offset 3.
Zero-length memory operations must be no-ops, but both loop headers
seeded `while.true` with `count >= 0`, which executes one iteration
when `count == 0`.

Switch the entry condition to a strict unsigned `count > 0` check and
add regressions for zero-count unaligned copy/set paths.
The unaligned `u16` regressions are asserting compiler memory layout,
so they should not depend on the host endianness.

Use `to_le_bytes()` in the expected byte construction to keep the tests
portable and aligned with the byte-addressable memory model.
`memset` and fallback `memcpy` were carrying separate copies of the
same counted `while.true` control flow, which makes fixes easy to miss
in one path.

Extract the shared loop header and back-edge emission so the counted
loop protocol is defined once and reused by both sites.
Only offset 3 spans two elements for a `u16` load/store. Route the
other unaligned offsets through the existing single-element logic so we
don't spuriously touch `addr + 1` at the end of memory.
Add regression cases for byte offsets 1 and 2 in the integration suite,
and add emitter-level tests that exercise unaligned `load_imm` and
`store_imm` for `u16` addresses.
Cover the aligned byte-copy fast path and a case where only `count` is
misaligned so the fast-path predicate is regression-tested as well.
@greenhat greenhat force-pushed the i1003-fix-byte-memcpy branch from 16a7669 to 4a529fd Compare March 27, 2026 09:13
@greenhat greenhat merged commit f7578b1 into next Mar 27, 2026
15 checks passed
@greenhat greenhat deleted the i1003-fix-byte-memcpy branch March 27, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix byte-version of memcpy

2 participants