-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PregSet machine env #177
PregSet machine env #177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you for tackling this!
The big picture thing to note here is that a PRegSet
already covers all register classes, so there's no need to have an array of three of them. That should simplify things in a variety of places in this PR.
In the places where we only want to examine the registers that are within a particular class, it would be handy to add a PRegSet::in_class
method (I'm open to suggestions for alternative names) which returns a copy where only the bits from that class are set. I can think of several ways to write that but I don't know which will be easiest.
We should put some thought into whether we can somehow restructure the loop where you're calling nth
, to avoid needing to do that. But if we can't, there are bit-twiddling tricks for quickly finding the nth set bit. See the section immediately above https://graphics.stanford.edu/~seander/bithacks.html#ParityNaive titled "Select the bit position (from the most-significant bit) with the given count (rank)". (I can't link to that section directly for some reason.) We could override the default implementation of Iterator::nth
with something along those lines.
There are several other iterator operations we can provide faster implementations for here. Your implementation of n_regs
would be useful for both Iterator::count
and Iterator::size_hint
, and with an appropriate implementation of size_hint
we could also implement ExactSizeIterator
and be able to call len()
on these iterators. I would mildly prefer to use those iterator methods instead of introducing an n_regs
method, if you don't mind doing that in this PR. We could also provide an efficient implementation for last
if we want to.
A change which I've been meaning to make that would make some of this PR easier is that PRegSet
could itself implement Iterator
, instead of having a separate PRegSetIter
type which is basically just a copy of PRegSet
. Then you can avoid calling .into_iter()
in a bunch of places. Since PRegSet
is Copy
, it's fine for its iterator to destructively mutate it; in places where the original value is still needed it's easy to just copy it.
I hope this helps with a broad strokes plan for this PR; I may have more detailed feedback once these changes are out of the way. Most of the above is not strictly necessary for this PR, I just think it would help, so feel free to push back if you disagree!
To find the nth-set bit you can use things like let mut bitmap: u64 = ...;
while bitmap != 0 {
let zeros = bitmap.trailing_zeros();
// zeros is the index of the least-significant-bit set
bitmap &= !(1<<zeros);
} |
That's right, @lpereira, and that's the kind of code we have now: Lines 259 to 271 in b13cda1
However, that takes time linear in the number of set bits, and there are faster alternatives if you want the index of the nth set bit. I'm still hoping to instead find that we can rewrite the loop where this PR is calling |
Thanks for the feedback! As the Regarding an Have updated to implement I believe |
I think it might be good to take a step back here: why are we considering bitsets to compact the set? What is the goal? From the linked comment above, we'd like to have a statically-initialized But I think there's a better way: why not take a borrow to a Alternately, there is this suggestion I left on the original thread. I'm not sure if that got overlooked or lost -- but allocating the Either of these could work, and both I think are preferable to complex Hacker's Delight bit twiddling tricks and compromises in iteration speed... |
I'm not sure I understand this question, but I think your implementation of
Does that help? I'm happy to answer more questions but I'm not yet sure which part is confusing.
Fortunately, |
Thanks: yes the confusion was regarding how PRegSet encoded the types of register, this clears it up: I think as it is being used currently (ie where it is only ever iterating over one class at a time), wouldn't merging all three lists into one then filtering out the data we don't need by quite inefficient however? Also understand @cfallin point about performance: is performance testing solely regalloc2 possible in sightglass or would it also need a forked wasmtime to work with? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(r-'ing this PR until above points about why PRegSet
at all are addressed; I don't think we should use it at all)
There are two big-picture questions to examine here:
Regarding the first question, I agree with Chris: if we can construct the If that doesn't work, it's certainly also possible to require that the registers be in a slice with I think there is still an interesting conversation to have about the second question. I believe When
Rotating a list by an offset is equivalent to splitting the list at that offset, and iterating over the second part before the first one. We certainly could duplicate that behavior efficiently with I think this is both simpler and faster than the current implementation. With some optimization, |
Thanks for writing all this down, Jamey -- I agree with your assessment that actually a bitset could work here with an O(1) step, with the split-the-bitset-in-half-at-the-rotation-point approach. I think my main point of confusion through all of this discussion -- prior to the above -- has been trying to understand the motivation; my initial sense was that this would be a loss both in performance (complex bit-twiddling for linear-time access of Nth element) and an increase in complexity, to pay for... constness that can be had other ways too? But given the above at least the linear-time concern is gone. The question I'd ask of any new approach is twofold: does it increase clarity of implementation, and/or does it make a measurable reduction in runtime? (Either alone is IMHO good justification for a change. Conversely, an implementation that's about the same complexity and has no change in performance isn't worth the effort and risk to undergo.) My guess for the latter (based only on experience profiling this loop 2-3 years ago) is that the traversal iter implementation doesn't matter all that much; there's a ton of logic in the loop (the btree lookup and liverange collision checks) that takes the majority of the time. But it's possible that I'm wrong about that. And we'd want to see an implementation (with the new approach) to evaluate the former. |
If Definitely think this has significantly broader implications than I'd intended however. |
I've had this on my to-do list to investigate since working on bytecodealliance/wasmtime#8457, because at a quick glance it didn't make sense to me to use arrays of Now that I better understand how
Yeah, me too! If you would prefer to try out Chris' suggestions (or my variations) about other ways to remove the
I don't understand this question. I think we can make it fast and simple to use |
Thanks, yes this does answer my question |
@jameysharp I think this is close to ready, still a bit unsure on the select n-th code from https://graphics.stanford.edu/~seander/bithacks.html , but I think it has somewhat simplified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is coming together! Thanks for your patience waiting for me to give this a detailed review.
Personally, I think you've demonstrated that we can address @cfallin's concerns, but I won't speak for him. Chris, could you check whether you're still concerned about overall algorithmic complexity with this version?
I have a number of suggestions for simplifying this PR further which I think are important for keeping this code as maintainable as we can, but overall I'm in favor of these changes. Thank you for your efforts!
hint_regs: u64, | ||
is_fixed: bool, | ||
fixed: Option<PReg>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the fixed-register case is mutually exclusive with all the other cases, we could put that register in the hint_regs
, ensure all the other sets are 0, and drop the is_fixed
and fixed
fields.
if hint_reg.is_none() { | ||
hint_reg = hint2_reg; | ||
hint2_reg = None; | ||
if hint_reg != PReg::invalid() { | ||
let mask = 1u64 << (hint_reg.bits & 0b0011_1111); | ||
hint_mask |= mask; | ||
} | ||
let hints = [hint_reg, hint2_reg]; | ||
|
||
if hint2_reg != PReg::invalid() { | ||
let mask = 1u64 << (hint2_reg.bits & 0b0011_1111); | ||
hint_mask |= mask; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this so much better than having two separate Option<PReg>
for the hint registers!
I was going to be a little concerned that this could change the order that the hint registers are returned in. However it turns out that at most one hint register is ever used at a time, and that's been true since regalloc2 0.0.1 in 2021. If you want to delete hint2_reg
I think that would be worth doing, but it doesn't have to be in this PR.
I would like two changes here though:
debug_assert_eq!(hint_reg.class(), reg_class);
(and same forhint2_reg
if you don't delete it)- use
hint_reg.hw_enc()
to get the shift amount instead of accessingbits
directly. (I considered suggesting a debug-assert that this is less than 64, but that is automatic, part of https://github.com/rust-lang/rfcs/blob/master/text/0560-integer-overflow.md.)
// remove the hint registers from the bit vectors | ||
let pref_regs_first = pref_regs_first & !hint_mask; | ||
let pref_regs_second = pref_regs_second & !hint_mask; | ||
let non_pref_regs_first = non_pref_regs_first & !hint_mask; | ||
let non_pref_regs_second = non_pref_regs_second & !hint_mask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove the hint registers from pref_regs_by_class
and non_pref_regs_by_class
above, before splitting the sets, then you would only have to mask twice instead of four times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd done it this way in order to be as close to the original split as possible: I may try they way Chris suggested, that doesn't depend on doing a bit trick to find the split point instead as it does seem 'simplest'
if self.hint_regs != 0 { | ||
let index = self.hint_regs.trailing_zeros() as u8; | ||
self.hint_regs &= !(1u64 << index); | ||
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); | ||
} | ||
|
||
// iterate over the preferred register rotated by offset | ||
// iterate over first half | ||
if self.pref_regs_first != 0 { | ||
let index = self.pref_regs_first.trailing_zeros() as u8; | ||
self.pref_regs_first &= !(1u64 << index); | ||
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); | ||
} | ||
if self.hint_idx < 2 && self.hints[self.hint_idx].is_some() { | ||
let h = self.hints[self.hint_idx]; | ||
self.hint_idx += 1; | ||
return h; | ||
// iterate over second half | ||
if self.pref_regs_second != 0 { | ||
let index = self.pref_regs_second.trailing_zeros() as u8; | ||
self.pref_regs_second &= !(1u64 << index); | ||
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); | ||
} | ||
while self.pref_idx < self.env.preferred_regs_by_class[self.class].len() { | ||
let arr = &self.env.preferred_regs_by_class[self.class][..]; | ||
let r = arr[wrap(self.pref_idx + self.offset_pref, arr.len())]; | ||
self.pref_idx += 1; | ||
if Some(r) == self.hints[0] || Some(r) == self.hints[1] { | ||
continue; | ||
} | ||
return Some(r); | ||
|
||
// iterate over the nonpreferred register rotated by offset | ||
// iterate over first half | ||
if self.non_pref_regs_first != 0 { | ||
let index = self.non_pref_regs_first.trailing_zeros() as u8; | ||
self.non_pref_regs_first &= !(1u64 << index); | ||
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); | ||
} | ||
while self.non_pref_idx < self.env.non_preferred_regs_by_class[self.class].len() { | ||
let arr = &self.env.non_preferred_regs_by_class[self.class][..]; | ||
let r = arr[wrap(self.non_pref_idx + self.offset_non_pref, arr.len())]; | ||
self.non_pref_idx += 1; | ||
if Some(r) == self.hints[0] || Some(r) == self.hints[1] { | ||
continue; | ||
} | ||
return Some(r); | ||
// iterate over second half | ||
if self.non_pref_regs_second != 0 { | ||
let index = self.non_pref_regs_second.trailing_zeros() as u8; | ||
self.non_pref_regs_second &= !(1u64 << index); | ||
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because all five of these if statements do the same thing, I would suggest defining an array of [u64; 5]
in struct RegTraversalIter
, and looping over its elements here.
If you do that, there are additional tricks you could use, but they may not be faster at this scale, so I think it would be good to stop there.
let reg_index = index as u8 | self.class_mask; | ||
return Some(PReg::from(reg_index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not introduce a From
-conversion from u8
to PReg
. I would go ahead and store the enum RegClass
and use the PReg::new
constructor, I think. Conceptually I like the micro-optimization of only doing the left-shift once in RegTraversalIter::new()
, but it only saves one instruction and makes the code harder to read and reason about, so I don't think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I'd actually done this for the opposite reason: I found it easier to reason/think about what was occurring when I could think of it as a 2 bit number for the class and 6 bit number for the position in the class, though that also requires fully committing to that representation, as opposed to mentally going through the associated constants and casts between types (I'm not actually sure how much leeway there is to change the bitpacked representation as doing it this way definitely would concrete it as it is)
impl From<u8> for PReg { | ||
fn from(raw_index: u8) -> Self { | ||
PReg { bits: raw_index } | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it in a different spot but I'll just say it again here: I'd rather not have this From<u8>
implementation for PReg
.
fn to_preg_class(&self, class: usize) -> PRegClass { | ||
PRegClass { | ||
class_mask: (class as u8) << 6, | ||
regs: self.bits[class], | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that PRegClass
is the minimum size needed to iterate over a single class. But that ends up meaning there's a lot of duplicated code between the single-class cases and the all-registers cases. Also, because of padding, it's 16 bytes while a PRegSet
is 24 bytes, so it's not actually saving that much.
To keep this simpler, I would prefer that PRegSet::to_preg_class
return another PRegSet
, identical to self
except that the other classes are 0. Then you can delete PRegClass
.
Also, I would prefer to use .to_preg_class().last()
or .to_preg_class().len()
instead of the above last_in_class
and len_class
methods. It might be a few more instructions, but I actually suspect rustc/LLVM will inline those methods, constant-fold the zeroes, and generate the same code in the end anyway. So I'm much more interested in keeping the source code simple.
self.bits[0] |= other.bits[0]; | ||
self.bits[1] |= other.bits[1]; | ||
self.bits[2] |= other.bits[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point there are enough elements in the bits
array that I would prefer to loop over it, both here and in all the Iterator
methods below. I expect LLVM to unroll the fixed-length loops and end up producing the same code in the end, so I'm more concerned about keeping the source code simple.
const C: u64 = 0b00000000_11111111_00000000_11111111_00000000_11111111_00000000_11111111; // 0x00FF00FF | ||
const D: u64 = 0b00001111_00001111_00001111_00001111_00001111_00001111_00001111_00001111; // 0xF0F0F0F0 | ||
const E: u64 = 0b00110011_00110011_00110011_00110011_00110011_00110011_00110011_00110011; // 0x33333333 | ||
const F: u64 = 0b01010101_01010101_01010101_01010101_01010101_01010101_01010101_01010101; // 0x55555555 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All ways of writing these constants are a bit magic, but I think I'd prefer a shorter form than a full 64-bit binary constant. Here's another form that is also magic and maybe not any better:
const C: u64 = u64::MAX / (1 << 8 + 1);
const D: u64 = u64::MAX / (1 << 4 + 1);
const E: u64 = u64::MAX / (1 << 2 + 1);
const F: u64 = u64::MAX / (1 << 1 + 1);
Probably the hex form is best here though I think.
@@ -1380,7 +1533,7 @@ pub struct MachineEnv { | |||
/// | |||
/// If an explicit scratch register is provided in `scratch_by_class` then | |||
/// it must not appear in this list. | |||
pub preferred_regs_by_class: [Vec<PReg>; 3], | |||
pub preferred_regs_by_class: PRegSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary, but I'm inclined to remove _by_class
from the PRegSet
field names in MachineEnv
. That makes these fields a little inconsistent with scratch_by_class
where I think the name still makes sense, but I think the inconsistency is okay.
I think this is going in the right direction for sure -- thanks @KGrewal1 for your continued work on this despite my earlier skepticism! Seeing the full glory of However I think we can actually avoid it altogether, along with the modulo operator in the The thought goes something like: I had originally added the "offset and wraparound" behavior to But there's no reason we need to preserve the exact traversal order and distribution scheme; all that's needed is some distribution, and it need not even be perfectly spread.
This still has a bias but not as bad; it works out well because the segment size of 32 elements is just a bit bigger than the 0..24 register range. In practice I suspect a segment size of 32 bits may work well enough. We can evaluate this by checking the
Anyway, let me know what you think! If I had to choose I think I'd be interested in seeing the last one implemented as it keeps the even distribution of probes perfectly while doing only a few clz/ctz ops per traversal step and per iter step. |
Thanks for the feedback: just finished my finals, so will try do a PR to address them this weekend |
Thanks for writing all that up, Chris! I found it very helpful. I think in various places where you said "128" you probably meant "64", right? That's how many PRegs we have in each class right now. The
Congratulations, and I'd suggest taking some time to relax! I'll be pretty busy next week so I may not be able to respond that quickly anyway. |
Err, yes, sorry, I think I meant I do very much think the last idea is the way to go though (keeping state, starting where we left off and rotating one place) both for simplicity and performance, so hopefully the above isn't actually needed :-) |
Really sorry but started an internship and don't think I'm going to be able to finish this in the next few months😢. All this work was done whilst I was a student, so it's available if anyone else wants to finish it/use it to build off. |
This seems to be unnecessary given greater refactors in PRegSet now anyway |
As mentioned in bytecodealliance/wasmtime#8489 (comment) , use PRegSet instead of Vecs for storing the physical registers. I'm slightly unsure if I have done this correctly for the
reg_traversal
iterator however and am unsure how it's performance will change now its calling n iterator steps as opposed to indexing straight into an array