-
Notifications
You must be signed in to change notification settings - Fork 382
[LLHD] Remove redundant destination operands from wait operation #8870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Very neat! It's definitely a good idea to remove redundant operands from the wait list, and get rid of some of the redundant block arguments. I'm wondering if MLIR's |
You're right; with implemented |
Hmmm, do you need the I was thinking about using What do you think about that? |
Agreed. That was my concern while looking at the
This should work, but i suspect we may also need to propagate changes to the successors of |
// CHECK-LABEL: @SkipIfWaitHasDestinationOperands( | ||
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42) { | ||
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42, in %b: i42) { | ||
// CHECK: llhd.process | ||
llhd.process { | ||
cf.br ^bb1 | ||
^bb1: | ||
llhd.wait ^bb2(%a : i42) | ||
^bb2(%0: i42): | ||
cf.br ^bb1 | ||
// CHECK: llhd.wait yield ({{.*}} : i42), ^bb2({{.*}} : i42) | ||
%0 = llhd.process -> i42 { | ||
cf.br ^bb1(%a : i42) | ||
^bb1(%1: i42): | ||
%false = hw.constant false | ||
%c100 = hw.constant 100 : i42 | ||
cf.cond_br %false, ^bb2(%1 : i42), ^bb3(%c100 : i42) | ||
^bb2(%2: i42): | ||
llhd.wait yield (%2 : i42), ^bb2(%2 : i42) | ||
^bb3(%3: i42): | ||
%4 = comb.add %3, %b : i42 | ||
cf.br ^bb1(%4 : i42) | ||
} | ||
} | ||
|
||
// CHECK-LABEL: @SkipIfEntryAndWaitConvergeInWrongSpot( | ||
hw.module @SkipIfEntryAndWaitConvergeInWrongSpot(in %a: i42) { | ||
// CHECK: llhd.process | ||
%c100 = hw.constant 100 : i42 | ||
llhd.process { | ||
cf.br ^bb2 // skip logic after wait | ||
cf.br ^bb2(%c100 : i42) | ||
^bb1: | ||
%0 = comb.add %a, %a : i42 | ||
cf.br ^bb2 | ||
^bb2: | ||
llhd.wait ^bb1 | ||
cf.br ^bb2(%0 : i42) | ||
^bb2(%1 : i42): | ||
llhd.wait (%1 : i42), ^bb1 | ||
} | ||
} | ||
|
||
// CHECK-LABEL: @SkipIfEntryAndWaitConvergeWithDifferentBlockArgs( | ||
hw.module @SkipIfEntryAndWaitConvergeWithDifferentBlockArgs(in %a: i42, in %b: i42) { | ||
// CHECK: llhd.process | ||
llhd.process { | ||
%0 = llhd.process -> i42 { | ||
cf.br ^bb2(%a : i42) | ||
^bb1: | ||
cf.br ^bb2(%b : i42) | ||
^bb2(%0: i42): | ||
llhd.wait ^bb1 | ||
llhd.wait yield (%0 : i42), ^bb1 | ||
} | ||
} | ||
|
||
// CHECK-LABEL: @SkipIfValueUnobserved( | ||
hw.module @SkipIfValueUnobserved(in %a: i42) { | ||
// CHECK: llhd.process | ||
llhd.process { | ||
%0 = llhd.process -> i42 { | ||
cf.br ^bb1 | ||
^bb1: | ||
%0 = comb.add %a, %a : i42 | ||
llhd.wait ^bb1 | ||
llhd.wait yield (%0 : i42), ^bb1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes needed with the new block arg removal? These don't seem related to what the individual test cases try to check. 🤔
One thing worth noting is that in an MLIR blob like this:
llhd.wait ^bb2(%a : i42)
^bb2(%0: i42):
you can't really replace all uses of %0
with %a
. The problem is that the %a
may change its value across the llhd.wait
. The block argument %0
receives the value of %a
before the wait suspends execution, which may be different than if you read %a
again in ^bb2
. This is the mechanism by which llhd.wait
models the weird way in which SystemVerilog and VHDL describe registers 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes needed with the new block arg removal? These don't seem related to what the individual test cases try to check. 🤔
I'll check again, but current patch changes this:
// CHECK-LABEL: @SkipIfWaitHasDestinationOperands(
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42) {
// CHECK: llhd.process
llhd.process {
cf.br ^bb1
^bb1:
llhd.wait ^bb2(%a : i42)
^bb2(%0: i42):
cf.br ^bb1
}
}
into this:
hw.module @SkipIfWaitHasDestinationOperands(in %a : i42) {
llhd.combinational {
cf.br ^bb1
^bb1: // pred: ^bb0
llhd.yield
}
hw.output
}
So i made some adjustments to keep the original testcase idea.
One thing worth noting is that in an MLIR blob like this:
llhd.wait ^bb2(%a : i42) ^bb2(%0: i42):you can't really replace all uses of
%0
with%a
. The problem is that the%a
may change its value across thellhd.wait
. The block argument%0
receives the value of%a
before the wait suspends execution, which may be different than if you read%a
again in^bb2
. This is the mechanism by whichllhd.wait
models the weird way in which SystemVerilog and VHDL describe registers 😢
Does it include values defined outside of llhd.process
? We could restrict pruning to only those values (i.e. defined in the parent region?) It seems that in Rocket chip lowered module, all llhd.process
operations contain wait
ops that use top-level module %clock
parameter as the wait destination operand.
If we can only prune constant-like values, then the patch idea is pointless. In that case, i have no idea how something like always @(posedge clock) begin ... end
can be lowered into something executable/simulatable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that includes values defined outside of llhd.process
. Basically any block operand flowing through llhd.wait
cannot be forwarded through the llhd.wait
by replacing the destination block's argument with that value. The thing we can optimize is something like llhd.wait ^bb1(%a, %a : i42)
to llhd.wait ^bb1(%a : i42)
, but the values must always cross through the wait op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the need for wait
dest operands by using llhd.var + llhd.store + llhd.load
combination to preserve values before it suspends execution? Spill values before wait
and in all other predecessors of the destination block and then load them again at the destination block?
The issues with basic block convergence, different block arguments, unobserved values are still ahead. Trying to tackle them one at a time, in order. And the current one is wait
destination operands. This might be suboptimal, since i don't yet have a full picture, only that PicoRV successfully loweres to LLVM, while Rocket Chip does not (from verilog).
That said, the llhd.process code generated from Rocket Chip doesn’t appear significantly more complex than what we have for PicoRV, so I think it should be feasible to get it working. Do you have suggestions on the best direction to go? Maybe adjusting firrtl -> verilog
or verilog -> moore
path would produce code that better fits current constrains, without requiring changes at the LLHD level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could definitely remove destination operands from wait
and replace them with a variable. That feels like a step backwards though, almost like going from SSA back to alloca/load/store. Having a very clear picture of which values in the process are from the past (aka passed in as dest operand from wait) and which are from the present is important in passes like Deseq, and in LowerProcesses too. If we were to turn them into a side channel through variables, the wait op would become easier, but everything else more complicated 😞
It's also okay for LowerProcesses to not be able to convert all processes. The responsibility of the passes basically are:
- Deseq: convert processes that describe registers and latches. These have the signals in
@(...)
as destination operands of the wait, because they need to detect edges. - LowerProcesses: convert processes that describe combinational circuits. These cannot have destination operands on the wait, because that would imply some edge detection or other stateful behavior.
- All other processes, e.g.
initial begin ... end
containing test stimulus, stay asllhd.process
.
In a nutshell, anything that's synthesizable should be handled by the two passes. Everything else is okay to stay asllhd.process
, which then simply captures the event queue simulation semantics of Verilog/VHDL for those remaining processes.
There are only a few situations where past values of signals are passed across an llhd.wait
into the present as destination operands. They all stem from @(...)
. But if they are present, they are capturing and essential edge detection feature of the input Verilog.
What might be happening in Rocket is that the Deseq pass might erroneously not detect a register or latch, which then causes the llhd.process
to survive with an llhd.wait
with additional destination operands. Do you have a concrete snippet of MLIR as an example of where this happens? Might point at a Deseq bug. Running circt-verilog with --debug-only=llhd-deseq
or --debug-only=llhd-lower-processes
can also provide some information. Most passes print a debug note if a process doesn't match the structure they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Deseq: convert processes that describe registers and latches. These have the signals in
@(...)
as destination operands of the wait, because they need to detect edges.
What might be happening in Rocket is that the Deseq pass might erroneously not detect a register or latch, which then causes the
llhd.process
to survive with anllhd.wait
with additional destination operands. Do you have a concrete snippet of MLIR as an example of where this happens? Might point at a Deseq bug. Running circt-verilog with--debug-only=llhd-deseq
or--debug-only=llhd-lower-processes
can also provide some information. Most passes print a debug note if a process doesn't match the structure they need.
Ah, i see, thanks for the direction. I'll get back with update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this being so convoluted. Some of this really needs to go into the rationale document for the LLHD dialect 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, i tried running module through deseq
, and with a few tweaks, it almost fully transformed.
The first obstacle are numerous verif.assert
which are marked as side-effect
operations and originate from assert
statements in the Verilog. I removed them manually, but it might be a good idea to strip them automatically at some point?
if (io_in_a_valid & io_in_a_bits_opcode == 3'h6 & ~reset) begin // src/main/scala/tilelink/Monitor.scala:42:11, :81:{25,54}
assert__assert_2: assert(_GEN_4); // src/main/scala/tilelink/Monitor.scala:42:11, :82:72, src/main/scala/tilelink/Parameters.scala:671:54
assert__assert_3: assert(_GEN_12); // src/main/scala/chisel3/util/Mux.scala:30:73, src/main/scala/tilelink/Monitor.scala:42:11, :83:78, src/main/scala/tilelink/Parameters.scala:671:54
assert__assert_4: assert(source_ok); // src/main/scala/tilelink/Monitor.scala:42:11, src/main/scala/tilelink/Parameters.scala:1126:46
assert__assert_5: assert(_mask_T); // src/main/scala/tilelink/Monitor.scala:42:11, src/main/scala/util/Misc.scala:206:21
assert__assert_6: assert(is_aligned); // src/main/scala/tilelink/Edges.scala:22:{16,24}, src/main/scala/tilelink/Monitor.scala:42:11
assert__assert_7: assert(_GEN_13); // src/main/scala/tilelink/Bundles.scala:111:27, src/main/scala/tilelink/Monitor.scala:42:11
assert__assert_8: assert(&io_in_a_bits_mask); // src/main/scala/tilelink/Monitor.scala:42:11, :88:31
assert__assert_9: assert(~io_in_a_bits_corrupt); // src/main/scala/tilelink/Monitor.scala:42:11, :89:18
end
converted into:
^bb4: // pred: ^bb3
verif.assert %243 label "" : i1
verif.assert %268 label "" : i1
verif.assert %197 label "" : i1
verif.assert %205 label "" : i1
verif.assert %204 label "" : i1
verif.assert %269 label "" : i1
%422 = comb.icmp eq %io_in_a_bits_mask, %c-1_i8 : i8
verif.assert %422 label "" : i1
%423 = comb.xor %io_in_a_bits_corrupt, %true : i1
verif.assert %423 label "" : i1
cf.br ^bb5
In the simulation pipeline, i guess we can't really make real use of those, can we?
The second issue involves enable conditions computed for certain drives that depend on values that come from hw.instance
. Currently these are marked as poisoned
because there is no handle for them in TruthTable Deseq::computeBoolean(OpResult value)
. Intuitively, i think we could treat them as unknown boolean
, this would allow llhd.processes
ops to be lowered succesfully. A more elaborate approach would be to inspect the body of hw.module
referenced by hw.instance
and attempt to build a predicate based on its operations? Would that be the right approach?
minified example of problem:
```mlir
module {
hw.module (...) {
%55:18 = llhd.process -> ... {
...
^bb3:
%97 = comb.and %queue_arw_deq_q.io_enq_ready, %33 : i1
...
}
llhd.drv %count_1, %55#0 after %0 if %55#1 : !hw.inout<i1>
...
%queue_arw_deq_q.io_enq_ready, ... = hw.instance "queue_arw_deq_q" @Queue1_AXI4BundleARW(clock: %clock: i1, reset: %reset: i1, ...) -> (io_enq_ready: i1, io_deq_valid: i1, ...) {sv.namehint = "_queue_arw_deq_q_io_deq_valid"}
...
hw.output ...
}
}
The third issue is llhd.process
operations that contain only a single halt
operation. I believe these can be safely removed, and their operands can be substituted directly at their use sites (example below)?
hw.module private @AsyncResetRegVec_w1_i0(in %clock : i1, in %reset : i1, in %io_d : i1, out io_q : i1) {
%0 = llhd.constant_time <0ns, 0d, 1e>
%false = hw.constant false
%reg_0 = llhd.sig %false : i1
%1 = seq.to_clock %clock
%reg_0_0 = seq.firreg %io_d clock %1 reset async %reset, %false {name = "reg_0"} : i1
llhd.drv %reg_0, %reg_0_0 after %0 : !hw.inout<i1>
%2:2 = llhd.process -> i1, i1 {
llhd.halt %false, %reset : i1, i1
}
llhd.drv %reg_0, %2#0 after %0 if %2#1 : !hw.inout<i1>
%3 = llhd.prb %reg_0 : !hw.inout<i1>
hw.output %3 : i1
}
Subset of Rocket Chip lowered code is attached.
subset.deseq.txt
93806b3
to
42f1c0a
Compare
@mvpant The always_ff @(posedge clock) begin
assert(d > 0);
q <= d;
end we'd still want to lower the Note that it's also totally valid to keep |
@mvpant Regarding the What might be happening here is that there is a conservative check in there that marks any ops that depend on the clock or reset as poisoned, because the pass can't reason about how the value they produce depends on the clock. But I'm wondering now if that is even needed in case of the boolean values. It might actually be totally okay to just treat boolean values as unknown even if they come from unknown ops that depend on a clock or reset. In the worst case, this will lead to a register that has an enable condition derived from a clock, which highly dubious hardware engineering but a valid thing to have in the IR. Might be worth dropping that poisoning step on unhandled ops for the boolean value analysis 🤔 WDYT? |
@mvpant Regarding the initial process that just ends in an %2:2 = llhd.process -> i1, i1 {
llhd.halt %false, %reset : i1, i1
}
llhd.drv %reg_0, %2#0 after %0 if %2#1 : !hw.inout<i1> This should actually be pretty straightforward to handle. I think we're just missing a folder or canonicalizer that replaces It's important to only do this for constant operands, since the process runs only once at very beginning. This means that any non-constant value would basically be "snapshot" at that specific point in time and produced as a result from the process. This result won't change throughout the simulation, even if the original non-constant value does. Doing this only for constants works around this problem. |
I'm very much in favor of separating synthesizable and non-synthesizable IR as much as possible. And I still mostly stand behind what I've written in #7676. It's just that I've got very few free cycles at the moment to put any work into it. For the moment nothing prevents us from just using I briefly glanced over the "edge-detection detection"-logic in |
@fzi-hielscher Yes please do! A second pair of watchful eyes would be much appreciated. A lot of that logic is derived from roughly what SystemVerilog and VHDL expect, but it's also very clunky and easy to break. Would be great to come up with some better foundations for register detection 🙂 |
@fabianschuiki, I've discussed this with some Verilog folks (they focus on static analysis), and they suggested it might be reasonable to treat boolean values as P.S. Disabling SYNTHESIS logic via a macro which turns off initial blocks and thus prevents the generation of |
@mvpant That sounds fantastic! So there are only a few steps towards getting this to work seamlessly. Feel free to remove that poisoning behavior and use It would be great if we could somehow get those Same for |
While lowering Rocket Chip from arc-tests through the
fir
->verilog
->moore
->core
->llhd
pipeline, a template is generated in which theclock
module input is propagated through all control-flow branches. The corresponding basic block argument is then used as an operand in the destination operand list of wait operation. This patch checks whether all incoming values for the block argument from its predecessors are equivalent. If so, the block argument can be removed and the value can be used directly at its use sites.