Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PregSet machine env #177

Closed
wants to merge 19 commits into from
Closed

Conversation

KGrewal1
Copy link

As mentioned in bytecodealliance/wasmtime#8489 (comment) , use PRegSet instead of Vecs for storing the physical registers. I'm slightly unsure if I have done this correctly for the reg_traversal iterator however and am unsure how it's performance will change now its calling n iterator steps as opposed to indexing straight into an array

Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for tackling this!

The big picture thing to note here is that a PRegSet already covers all register classes, so there's no need to have an array of three of them. That should simplify things in a variety of places in this PR.

In the places where we only want to examine the registers that are within a particular class, it would be handy to add a PRegSet::in_class method (I'm open to suggestions for alternative names) which returns a copy where only the bits from that class are set. I can think of several ways to write that but I don't know which will be easiest.

We should put some thought into whether we can somehow restructure the loop where you're calling nth, to avoid needing to do that. But if we can't, there are bit-twiddling tricks for quickly finding the nth set bit. See the section immediately above https://graphics.stanford.edu/~seander/bithacks.html#ParityNaive titled "Select the bit position (from the most-significant bit) with the given count (rank)". (I can't link to that section directly for some reason.) We could override the default implementation of Iterator::nth with something along those lines.

There are several other iterator operations we can provide faster implementations for here. Your implementation of n_regs would be useful for both Iterator::count and Iterator::size_hint, and with an appropriate implementation of size_hint we could also implement ExactSizeIterator and be able to call len() on these iterators. I would mildly prefer to use those iterator methods instead of introducing an n_regs method, if you don't mind doing that in this PR. We could also provide an efficient implementation for last if we want to.

A change which I've been meaning to make that would make some of this PR easier is that PRegSet could itself implement Iterator, instead of having a separate PRegSetIter type which is basically just a copy of PRegSet. Then you can avoid calling .into_iter() in a bunch of places. Since PRegSet is Copy, it's fine for its iterator to destructively mutate it; in places where the original value is still needed it's easy to just copy it.

I hope this helps with a broad strokes plan for this PR; I may have more detailed feedback once these changes are out of the way. Most of the above is not strictly necessary for this PR, I just think it would help, so feel free to push back if you disagree!

@lpereira
Copy link

lpereira commented May 14, 2024

To find the nth-set bit you can use things like trailing_zeros() and leading_zeros(), which will compile down to a single instruction in most targets. So, for instance, if you want to iterate over all the set bits in an 64-bit integer, you'd do something along the lines of:

let mut bitmap: u64 = ...;

while bitmap != 0 {
    let zeros = bitmap.trailing_zeros();
    // zeros is the index of the least-significant-bit set

    bitmap &= !(1<<zeros);
}

@jameysharp
Copy link
Contributor

jameysharp commented May 14, 2024

That's right, @lpereira, and that's the kind of code we have now:

regalloc2/src/lib.rs

Lines 259 to 271 in b13cda1

fn next(&mut self) -> Option<PReg> {
if self.bits[0] != 0 {
let index = self.bits[0].trailing_zeros();
self.bits[0] &= !(1u128 << index);
Some(PReg::from_index(index as usize))
} else if self.bits[1] != 0 {
let index = self.bits[1].trailing_zeros();
self.bits[1] &= !(1u128 << index);
Some(PReg::from_index(index as usize + 128))
} else {
None
}
}

However, that takes time linear in the number of set bits, and there are faster alternatives if you want the index of the nth set bit.

I'm still hoping to instead find that we can rewrite the loop where this PR is calling Iterator::nth to not need to do that at all, but I haven't thought it through carefully enough to see how to do that yet. Perhaps you'd like to take a look at the changes in this PR to src/ion/reg_traversal.rs and see if you have suggestions?

@KGrewal1
Copy link
Author

KGrewal1 commented May 14, 2024

Thanks for the feedback!

As the count_ones function does not modify the original, wouldn't it be suitable for len in ExactSizeIterator.

Regarding an in_class method, and having only PRegSet, as opposed to [PRegSet, 3], I'm unsure where the class data would then be stored in MachineEnv?

Have updated to implement Iterator and ExactSizeIterator directly on PRegSet. Still need to do more efficient implementations of nth and last

I believe last may be possible with leading_zeros but unsure if I need to zero out the iterator after so it will return None if called again

src/lib.rs Outdated Show resolved Hide resolved
@cfallin
Copy link
Member

cfallin commented May 14, 2024

I think it might be good to take a step back here: why are we considering bitsets to compact the set? What is the goal?

From the linked comment above, we'd like to have a statically-initialized MachineEnv if we can, and dynamically-allocated lists of registers preclude that. But as noted here, having the list of registers to iterate over during the inner allocation loop is fairly performance-critical. Ordinarily one would want to consider a bitset when storage density is more important than simple/fast iteration (EDIT: and random access).

But I think there's a better way: why not take a borrow to a 'static slice for the register list instead of a PRegSet? That allows us to have a MachineEnv that is fully static, but at the same time preserves the existing algorithms (and the simplicity of implementation).

Alternately, there is this suggestion I left on the original thread. I'm not sure if that got overlooked or lost -- but allocating the MachineEnv once when the backend is set up (normally once in the process for wasmtime compile or similar) is another design point that preserves most of the existing benefits but avoids the need to make MachineEnv statically-initializable (which might be a hurdle if we add more complex state in the future).

Either of these could work, and both I think are preferable to complex Hacker's Delight bit twiddling tricks and compromises in iteration speed...

@jameysharp
Copy link
Contributor

As the count_ones function does not modify the original, wouldn't it be suitable for len in ExactSizeIterator.

I'm not sure I understand this question, but I think your implementation of len is correct, if that helps.

Regarding an in_class method, and having only PRegSet, as opposed to [PRegSet, 3], I'm unsure where the class data would then be stored in MachineEnv?

PRegSet has space for three classes of 64 registers each. Class 0 (integer registers) occupies bits 0-63, class 1 (floating-point registers) occupies bits 64-127, and class 2 (vector registers) occupies bits 0-63 of the second u128. (We could also reduce the size of this type by using three u64 instead of two u128, but let's not do that right now.)

Does that help? I'm happy to answer more questions but I'm not yet sure which part is confusing.

I believe last may be possible with leading_zeros but unsure if I need to zero out the iterator after so it will return None if called again

Fortunately, Iterator::last takes ownership of and consumes self, so the compiler will stop anyone trying to look at it again afterward.

@KGrewal1
Copy link
Author

KGrewal1 commented May 14, 2024

Thanks: yes the confusion was regarding how PRegSet encoded the types of register, this clears it up:

I think as it is being used currently (ie where it is only ever iterating over one class at a time), wouldn't merging all three lists into one then filtering out the data we don't need by quite inefficient however?

Also understand @cfallin point about performance: is performance testing solely regalloc2 possible in sightglass or would it also need a forked wasmtime to work with?

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(r-'ing this PR until above points about why PRegSet at all are addressed; I don't think we should use it at all)

@KGrewal1 KGrewal1 marked this pull request as draft May 14, 2024 20:44
@jameysharp
Copy link
Contributor

There are two big-picture questions to examine here:

  1. How do we make accessing an appropriate MachineEnv fast when we need it? As @cfallin pointed out, this is the question which is relevant to refactor(cranelift-codegen): remove OnceLock wasmtime#8489.
  2. What representation of MachineEnv should we use to support the use cases we have for it inside regalloc2?

Regarding the first question, I agree with Chris: if we can construct the MachineEnv once up front, maybe in frontends like Wasmtime or rustc-codegen-clif, that would be ideal. I'm not sure that is feasible if the set of allocatable registers depends on the calling convention of individual functions, though. In fact I don't understand why the current backends don't depend on calling convention: why is it okay to always use SysV's preferred and non-preferred register sets on x64, even on Windows?

If that doesn't work, it's certainly also possible to require that the registers be in a slice with 'static lifetime. If that's too restrictive for some users of regalloc2, a Cow<'static, [PReg]> would allow choosing dynamically between using Vec<PReg> or &'static [PReg].

I think there is still an interesting conversation to have about the second question. I believe RegTraversalIter is the key point of contention here. (The conversion from MachineEnv to PRegSet is obviously easier if we use PRegSet, and was what originally motivated me to consider this, but it's also not performance critical since bytecodealliance/wasmtime#8457 landed.) Since I'm trying to learn how this is used today, I'm going to write down how I think this works first.

When RegTraversalIter::new is called, it constructs an iterator that returns registers in this order:

  • if fixed was provided, then that is the only register that will be returned; otherwise:
  • hint_reg, if provided
  • hint2_reg, if provided
  • all the preferred registers in the class, excluding the hint registers, but the list is rotated by offset
  • all the non-preferred registers in the class, excluding the hint registers, but the list is rotated by offset

Rotating a list by an offset is equivalent to splitting the list at that offset, and iterating over the second part before the first one. We certainly could duplicate that behavior efficiently with PRegSet: splitting a bitset into two parts at a given offset is easy enough, and then we can pull the next element from each part in constant time using only the iterator we already have. We can also remove the hint registers from the sets up front rather than having to check for them on each iteration.

I think this is both simpler and faster than the current implementation. With some optimization, RegTraversalIter shrinks from 64 bytes to something like 40 or 48 bytes, I think we need just a few instructions per call to next(), and LLVM could plausibly keep the whole state in registers. The specifics depend on questions like: "does hint_reg always need to be returned before hint2_reg, or is it okay if they get swapped?" and "how exactly do we want the set of registers to be split according to offset?" but in broad strokes, I think this could be worth doing.

@cfallin
Copy link
Member

cfallin commented May 14, 2024

Thanks for writing all this down, Jamey -- I agree with your assessment that actually a bitset could work here with an O(1) step, with the split-the-bitset-in-half-at-the-rotation-point approach.

I think my main point of confusion through all of this discussion -- prior to the above -- has been trying to understand the motivation; my initial sense was that this would be a loss both in performance (complex bit-twiddling for linear-time access of Nth element) and an increase in complexity, to pay for... constness that can be had other ways too? But given the above at least the linear-time concern is gone.

The question I'd ask of any new approach is twofold: does it increase clarity of implementation, and/or does it make a measurable reduction in runtime? (Either alone is IMHO good justification for a change. Conversely, an implementation that's about the same complexity and has no change in performance isn't worth the effort and risk to undergo.) My guess for the latter (based only on experience profiling this loop 2-3 years ago) is that the traversal iter implementation doesn't matter all that much; there's a ton of logic in the loop (the btree lookup and liverange collision checks) that takes the majority of the time. But it's possible that I'm wrong about that. And we'd want to see an implementation (with the new approach) to evaluate the former.

@KGrewal1
Copy link
Author

KGrewal1 commented May 14, 2024

If PRegSet is used for RegTraversalIter, where the out of order access occurs (and so is likely to be slower in this PR), what is the advantage of the keeping another representation in MachineEnv, as the use in ion/process in which it is iterated in order over, and liveranges in which only last is called should be quite efficient and relatively simple code (especially as these are only done a class at a time and so could be further sped up by using the fact that int registers are first 64 bits, floats second and vector third, which wasn't used in this PR).

Definitely think this has significantly broader implications than I'd intended however.

@jameysharp
Copy link
Contributor

I've had this on my to-do list to investigate since working on bytecodealliance/wasmtime#8457, because at a quick glance it didn't make sense to me to use arrays of PRegs… pretty much anywhere. A PRegSet is basically the same size as an empty Vec, and does no heap allocation. But I hadn't looked into the details.

Now that I better understand how MachineEnv is used in regalloc2, I'm still interested in doing this experiment and will leave it on my to-do list if nobody else tries it first, but there's definitely a bigger obligation than I expected to show a benefit from the change.

Definitely think this has significantly broader implications than I'd intended however.

Yeah, me too! If you would prefer to try out Chris' suggestions (or my variations) about other ways to remove the OnceLock in Cranelift, that would be great too.

If PRegSet is used for RegTraversalIter, where the out of order access occurs (and so is likely to be slower in this PR), what is the advantage of the keeping another representation in MachineEnv, as the use in ion/process in which it is iterated in order over, and liveranges in which only last is called should be quite efficient and relatively simple code (especially as these are only done a class at a time and so could be further sped up by using the fact that int registers are first 64 bits, floats second and vector third, which wasn't used in this PR).

I don't understand this question. I think we can make it fast and simple to use PRegSet (or a variation of the same idea) in RegTraversalIter, and if that's true, then I think it's reasonable to also use PRegSet in MachineEnv. And I think Chris said above that if making this change makes the code either significantly faster or easier to understand, then it's worth doing. Does that answer your question?

@KGrewal1
Copy link
Author

KGrewal1 commented May 15, 2024

if that's true, then I think it's reasonable to also use PRegSet in MachineEnv.

Thanks, yes this does answer my question

@KGrewal1
Copy link
Author

@jameysharp I think this is close to ready, still a bit unsure on the select n-th code from https://graphics.stanford.edu/~seander/bithacks.html , but I think it has somewhat simplified reg_traversal_iter code.

@KGrewal1 KGrewal1 marked this pull request as ready for review May 19, 2024 11:52
src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is coming together! Thanks for your patience waiting for me to give this a detailed review.

Personally, I think you've demonstrated that we can address @cfallin's concerns, but I won't speak for him. Chris, could you check whether you're still concerned about overall algorithmic complexity with this version?

I have a number of suggestions for simplifying this PR further which I think are important for keeping this code as maintainable as we can, but overall I'm in favor of these changes. Thank you for your efforts!

Comment on lines +21 to 23
hint_regs: u64,
is_fixed: bool,
fixed: Option<PReg>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the fixed-register case is mutually exclusive with all the other cases, we could put that register in the hint_regs, ensure all the other sets are 0, and drop the is_fixed and fixed fields.

Comment on lines -50 to +47
if hint_reg.is_none() {
hint_reg = hint2_reg;
hint2_reg = None;
if hint_reg != PReg::invalid() {
let mask = 1u64 << (hint_reg.bits & 0b0011_1111);
hint_mask |= mask;
}
let hints = [hint_reg, hint2_reg];

if hint2_reg != PReg::invalid() {
let mask = 1u64 << (hint2_reg.bits & 0b0011_1111);
hint_mask |= mask;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this so much better than having two separate Option<PReg> for the hint registers!

I was going to be a little concerned that this could change the order that the hint registers are returned in. However it turns out that at most one hint register is ever used at a time, and that's been true since regalloc2 0.0.1 in 2021. If you want to delete hint2_reg I think that would be worth doing, but it doesn't have to be in this PR.

I would like two changes here though:

  1. debug_assert_eq!(hint_reg.class(), reg_class); (and same for hint2_reg if you don't delete it)
  2. use hint_reg.hw_enc() to get the shift amount instead of accessing bits directly. (I considered suggesting a debug-assert that this is less than 64, but that is automatic, part of https://github.com/rust-lang/rfcs/blob/master/text/0560-integer-overflow.md.)

Comment on lines +82 to +86
// remove the hint registers from the bit vectors
let pref_regs_first = pref_regs_first & !hint_mask;
let pref_regs_second = pref_regs_second & !hint_mask;
let non_pref_regs_first = non_pref_regs_first & !hint_mask;
let non_pref_regs_second = non_pref_regs_second & !hint_mask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove the hint registers from pref_regs_by_class and non_pref_regs_by_class above, before splitting the sets, then you would only have to mask twice instead of four times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd done it this way in order to be as close to the original split as possible: I may try they way Chris suggested, that doesn't depend on doing a bit trick to find the split point instead as it does seem 'simplest'

Comment on lines +115 to +151
if self.hint_regs != 0 {
let index = self.hint_regs.trailing_zeros() as u8;
self.hint_regs &= !(1u64 << index);
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
}

// iterate over the preferred register rotated by offset
// iterate over first half
if self.pref_regs_first != 0 {
let index = self.pref_regs_first.trailing_zeros() as u8;
self.pref_regs_first &= !(1u64 << index);
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
}
if self.hint_idx < 2 && self.hints[self.hint_idx].is_some() {
let h = self.hints[self.hint_idx];
self.hint_idx += 1;
return h;
// iterate over second half
if self.pref_regs_second != 0 {
let index = self.pref_regs_second.trailing_zeros() as u8;
self.pref_regs_second &= !(1u64 << index);
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
}
while self.pref_idx < self.env.preferred_regs_by_class[self.class].len() {
let arr = &self.env.preferred_regs_by_class[self.class][..];
let r = arr[wrap(self.pref_idx + self.offset_pref, arr.len())];
self.pref_idx += 1;
if Some(r) == self.hints[0] || Some(r) == self.hints[1] {
continue;
}
return Some(r);

// iterate over the nonpreferred register rotated by offset
// iterate over first half
if self.non_pref_regs_first != 0 {
let index = self.non_pref_regs_first.trailing_zeros() as u8;
self.non_pref_regs_first &= !(1u64 << index);
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
}
while self.non_pref_idx < self.env.non_preferred_regs_by_class[self.class].len() {
let arr = &self.env.non_preferred_regs_by_class[self.class][..];
let r = arr[wrap(self.non_pref_idx + self.offset_non_pref, arr.len())];
self.non_pref_idx += 1;
if Some(r) == self.hints[0] || Some(r) == self.hints[1] {
continue;
}
return Some(r);
// iterate over second half
if self.non_pref_regs_second != 0 {
let index = self.non_pref_regs_second.trailing_zeros() as u8;
self.non_pref_regs_second &= !(1u64 << index);
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because all five of these if statements do the same thing, I would suggest defining an array of [u64; 5] in struct RegTraversalIter, and looping over its elements here.

If you do that, there are additional tricks you could use, but they may not be faster at this scale, so I think it would be good to stop there.

Comment on lines +118 to +119
let reg_index = index as u8 | self.class_mask;
return Some(PReg::from(reg_index));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not introduce a From-conversion from u8 to PReg. I would go ahead and store the enum RegClass and use the PReg::new constructor, I think. Conceptually I like the micro-optimization of only doing the left-shift once in RegTraversalIter::new(), but it only saves one instruction and makes the code harder to read and reason about, so I don't think it's worth it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I'd actually done this for the opposite reason: I found it easier to reason/think about what was occurring when I could think of it as a 2 bit number for the class and 6 bit number for the position in the class, though that also requires fully committing to that representation, as opposed to mentally going through the associated constants and casts between types (I'm not actually sure how much leeway there is to change the bitpacked representation as doing it this way definitely would concrete it as it is)

Comment on lines +163 to +167
impl From<u8> for PReg {
fn from(raw_index: u8) -> Self {
PReg { bits: raw_index }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it in a different spot but I'll just say it again here: I'd rather not have this From<u8> implementation for PReg.

Comment on lines +271 to +276
fn to_preg_class(&self, class: usize) -> PRegClass {
PRegClass {
class_mask: (class as u8) << 6,
regs: self.bits[class],
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that PRegClass is the minimum size needed to iterate over a single class. But that ends up meaning there's a lot of duplicated code between the single-class cases and the all-registers cases. Also, because of padding, it's 16 bytes while a PRegSet is 24 bytes, so it's not actually saving that much.

To keep this simpler, I would prefer that PRegSet::to_preg_class return another PRegSet, identical to self except that the other classes are 0. Then you can delete PRegClass.

Also, I would prefer to use .to_preg_class().last() or .to_preg_class().len() instead of the above last_in_class and len_class methods. It might be a few more instructions, but I actually suspect rustc/LLVM will inline those methods, constant-fold the zeroes, and generate the same code in the end anyway. So I'm much more interested in keeping the source code simple.

Comment on lines 246 to +248
self.bits[0] |= other.bits[0];
self.bits[1] |= other.bits[1];
self.bits[2] |= other.bits[2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point there are enough elements in the bits array that I would prefer to loop over it, both here and in all the Iterator methods below. I expect LLVM to unroll the fixed-length loops and end up producing the same code in the end, so I'm more concerned about keeping the source code simple.

Comment on lines +408 to +411
const C: u64 = 0b00000000_11111111_00000000_11111111_00000000_11111111_00000000_11111111; // 0x00FF00FF
const D: u64 = 0b00001111_00001111_00001111_00001111_00001111_00001111_00001111_00001111; // 0xF0F0F0F0
const E: u64 = 0b00110011_00110011_00110011_00110011_00110011_00110011_00110011_00110011; // 0x33333333
const F: u64 = 0b01010101_01010101_01010101_01010101_01010101_01010101_01010101_01010101; // 0x55555555
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All ways of writing these constants are a bit magic, but I think I'd prefer a shorter form than a full 64-bit binary constant. Here's another form that is also magic and maybe not any better:

    const C: u64 = u64::MAX / (1 << 8 + 1);
    const D: u64 = u64::MAX / (1 << 4 + 1);
    const E: u64 = u64::MAX / (1 << 2 + 1);
    const F: u64 = u64::MAX / (1 << 1 + 1);

Probably the hex form is best here though I think.

@@ -1380,7 +1533,7 @@ pub struct MachineEnv {
///
/// If an explicit scratch register is provided in `scratch_by_class` then
/// it must not appear in this list.
pub preferred_regs_by_class: [Vec<PReg>; 3],
pub preferred_regs_by_class: PRegSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, but I'm inclined to remove _by_class from the PRegSet field names in MachineEnv. That makes these fields a little inconsistent with scratch_by_class where I think the name still makes sense, but I think the inconsistency is okay.

@cfallin
Copy link
Member

cfallin commented May 30, 2024

I think this is going in the right direction for sure -- thanks @KGrewal1 for your continued work on this despite my earlier skepticism!

Seeing the full glory of find_nth, I do have to admit I find myself fairly concerned. It's very complex code (even if transcribed from a cited algorithm online), if we ship it we should definitely include in-depth tests for it, and it's quite a bit heavier-weight than I had imagined previously -- I count 79 ALU ops, and we do that twice per traversal setup, which happens once per liverange (potentially many times per original liverange). Allocator efficiency should be something like a few thousand instructions per liverange, so that's really nontrivial: I think it is warranted to do a full benchmarking to make sure it doesn't impact the common case of short functions.

However I think we can actually avoid it altogether, along with the modulo operator in the RegTraversalIterator setup, using one of several approaches.

The thought goes something like: I had originally added the "offset and wraparound" behavior to RegTraversalIter to ensure that various allocations try probing registers in an evenly-distributed way, rather than biasing toward the early elements in the list, to minimize contention in the allocation maps. The effect on allocation speed was pretty dramatic: if memory serves, something like an average of 13 BTree probes down to an average of 4 probes per LiveRange processing step. (This at the cost of "spreading out a bit and enjoying the room" in the register file, rather than neatly packing r0, r1, r2, ... tightly.)

But there's no reason we need to preserve the exact traversal order and distribution scheme; all that's needed is some distribution, and it need not even be perfectly spread.

  • The simplest one I can think of is: take offset mod 128 (i.e. offset & 127); that's the splitpoint. Effectively we're rotating the masks one position per instruction-offset rather than one set element per instruction-offset. If PReg indices in the set are fairly evenly distributed, this will have the same "load-spreading" effect.
  • However if indices are clustered (say, reg indices 0..24 are in the set, and 25..128 are not), then we have a bias in the probing distribution (in this example, 128-24 = 104 out of every 128 instructions will start at p0).
  • But we can push back against that expected clustering with a "segmented" splitting strategy: for example, take N = offset mod 32 or offset mod 16 and, in every 32- or 16-bit segment of the set, split at bit N. I think this would work something like (using u128s for clarity rather than two u64s):
let split = offset & 31;
let mask = (0x00000001_00000001_00000001_00000001_u128 << split) - 1;
let bits1 = bits & mask;
let bits2 = bits & !mask;

This still has a bias but not as bad; it works out well because the segment size of 32 elements is just a bit bigger than the 0..24 register range. In practice I suspect a segment size of 32 bits may work well enough.

We can evaluate this by checking the process_bundle_reg_probes_any stat in mainline vs. this change.

  • Generalizing a bit, we could do two or more such steps: split at some bit position within 32-bit segments, then again within 16-bit segments (then with ...). This is kind of like a bitonic sort network but running backward, to give some arbitrary subset. I can write this out if interested but I'd be curious how well the above does first.

  • Another thought that comes to mind: there's no reason we can't keep some state in the env, such that each time we do a traversal, we start at the next position. This means that we're doing the offsetting by "processing step number" rather than instruction offset but no matter. In effect we can get the starting position for traversal N+1 by taking the starting position for traversal N, taking the lowest-set-bit and removing it from first-half and putting it in second-half, and flipping them when the first-half becomes 0.

Anyway, let me know what you think! If I had to choose I think I'd be interested in seeing the last one implemented as it keeps the even distribution of probes perfectly while doing only a few clz/ctz ops per traversal step and per iter step.

@KGrewal1
Copy link
Author

Thanks for the feedback: just finished my finals, so will try do a PR to address them this weekend

@jameysharp
Copy link
Contributor

Thanks for writing all that up, Chris! I found it very helpful. I think in various places where you said "128" you probably meant "64", right? That's how many PRegs we have in each class right now. The mask expression also probably doesn't do what you intended but we can figure that out.

Thanks for the feedback: just finished my finals, so will try do a PR to address them this weekend

Congratulations, and I'd suggest taking some time to relax! I'll be pretty busy next week so I may not be able to respond that quickly anyway.

@cfallin
Copy link
Member

cfallin commented May 31, 2024

The mask expression also probably doesn't do what you intended but we can figure that out.

Err, yes, sorry, I think I meant ... - 0x00000001_00000001_00000001_00000001 rather than - 1 (basically I want a mask like 0b0000..1111..0000..1111..0000..1111..0000..1111 where the groups are the bottom halves of 32-bit lanes) EDIT: and, yes, 64 bits total rather than 128 per class, thanks!

I do very much think the last idea is the way to go though (keeping state, starting where we left off and rotating one place) both for simplicity and performance, so hopefully the above isn't actually needed :-)

@KGrewal1
Copy link
Author

KGrewal1 commented Jun 3, 2024

Really sorry but started an internship and don't think I'm going to be able to finish this in the next few months😢. All this work was done whilst I was a student, so it's available if anyone else wants to finish it/use it to build off.

@KGrewal1 KGrewal1 closed this Jul 26, 2024
@KGrewal1
Copy link
Author

This seems to be unnecessary given greater refactors in PRegSet now anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants