Skip to content

Conversation

mvpant
Copy link
Contributor

@mvpant mvpant commented Aug 18, 2025

While lowering Rocket Chip from arc-tests through the fir -> verilog -> moore -> core -> llhd pipeline, a template is generated in which the clock module input is propagated through all control-flow branches. The corresponding basic block argument is then used as an operand in the destination operand list of wait operation. This patch checks whether all incoming values for the block argument from its predecessors are equivalent. If so, the block argument can be removed and the value can be used directly at its use sites.

@fabianschuiki
Copy link
Contributor

Very neat! It's definitely a good idea to remove redundant operands from the wait list, and get rid of some of the redundant block arguments. I'm wondering if MLIR's simplifyRegions would do something like that already 🤔 Might be worth calling that explicitly before lowering processes?

@mvpant
Copy link
Contributor Author

mvpant commented Aug 25, 2025

Very neat! It's definitely a good idea to remove redundant operands from the wait list, and get rid of some of the redundant block arguments. I'm wondering if MLIR's simplifyRegions would do something like that already 🤔 Might be worth calling that explicitly before lowering processes?

You're right; with implemented BranchOpInterface in the llhd.wait op, simplifyRegions does the job.

@fabianschuiki
Copy link
Contributor

fabianschuiki commented Aug 25, 2025

Hmmm, do you need the BranchOpInterface on WaitOp? I remember talking about this with @maerhart, and he pointed out that the llhd.wait op has the side effect of suspending execution, which the regular branch ops do not. Implementing the BranchOpInterface makes other optimizations potentially forward branch operands from the wait op to the destination block where execution resumes. This is invalid though, since that SSA value may change value across the wait op.

I was thinking about using simplifyRegions to simplify all other branches and block operands in the input, but then do the removal of redundant operands from wait manually, to ensure we don't implement the BranchOpInterface on wait.

What do you think about that?

@mvpant
Copy link
Contributor Author

mvpant commented Aug 26, 2025

Hmmm, do you need the BranchOpInterface on WaitOp? I remember talking about this with @maerhart, and he pointed out that the llhd.wait op has the side effect of suspending execution, which the regular branch ops do not. Implementing the BranchOpInterface makes other optimizations potentially forward branch operands from the wait op to the destination block where execution resumes. This is invalid though, since that SSA value may change value across the wait op.

Agreed. That was my concern while looking at the simplifyRegions code. Something like BranchOpInterface + SideEffect would be nice to have, but we don't.

I was thinking about using simplifyRegions to simplify all other branches and block operands in the input, but then do the removal of redundant operands from wait manually, to ensure we don't implement the BranchOpInterface on wait.

This should work, but i suspect we may also need to propagate changes to the successors of llhd.wait destination block, or re-run simplifyRegions iteratively until convergence. For now, though, we can keep it like this.

Comment on lines 160 to 211
// CHECK-LABEL: @SkipIfWaitHasDestinationOperands(
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42) {
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42, in %b: i42) {
// CHECK: llhd.process
llhd.process {
cf.br ^bb1
^bb1:
llhd.wait ^bb2(%a : i42)
^bb2(%0: i42):
cf.br ^bb1
// CHECK: llhd.wait yield ({{.*}} : i42), ^bb2({{.*}} : i42)
%0 = llhd.process -> i42 {
cf.br ^bb1(%a : i42)
^bb1(%1: i42):
%false = hw.constant false
%c100 = hw.constant 100 : i42
cf.cond_br %false, ^bb2(%1 : i42), ^bb3(%c100 : i42)
^bb2(%2: i42):
llhd.wait yield (%2 : i42), ^bb2(%2 : i42)
^bb3(%3: i42):
%4 = comb.add %3, %b : i42
cf.br ^bb1(%4 : i42)
}
}

// CHECK-LABEL: @SkipIfEntryAndWaitConvergeInWrongSpot(
hw.module @SkipIfEntryAndWaitConvergeInWrongSpot(in %a: i42) {
// CHECK: llhd.process
%c100 = hw.constant 100 : i42
llhd.process {
cf.br ^bb2 // skip logic after wait
cf.br ^bb2(%c100 : i42)
^bb1:
%0 = comb.add %a, %a : i42
cf.br ^bb2
^bb2:
llhd.wait ^bb1
cf.br ^bb2(%0 : i42)
^bb2(%1 : i42):
llhd.wait (%1 : i42), ^bb1
}
}

// CHECK-LABEL: @SkipIfEntryAndWaitConvergeWithDifferentBlockArgs(
hw.module @SkipIfEntryAndWaitConvergeWithDifferentBlockArgs(in %a: i42, in %b: i42) {
// CHECK: llhd.process
llhd.process {
%0 = llhd.process -> i42 {
cf.br ^bb2(%a : i42)
^bb1:
cf.br ^bb2(%b : i42)
^bb2(%0: i42):
llhd.wait ^bb1
llhd.wait yield (%0 : i42), ^bb1
}
}

// CHECK-LABEL: @SkipIfValueUnobserved(
hw.module @SkipIfValueUnobserved(in %a: i42) {
// CHECK: llhd.process
llhd.process {
%0 = llhd.process -> i42 {
cf.br ^bb1
^bb1:
%0 = comb.add %a, %a : i42
llhd.wait ^bb1
llhd.wait yield (%0 : i42), ^bb1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes needed with the new block arg removal? These don't seem related to what the individual test cases try to check. 🤔

One thing worth noting is that in an MLIR blob like this:

  llhd.wait ^bb2(%a : i42)
^bb2(%0: i42):

you can't really replace all uses of %0 with %a. The problem is that the %a may change its value across the llhd.wait. The block argument %0 receives the value of %a before the wait suspends execution, which may be different than if you read %a again in ^bb2. This is the mechanism by which llhd.wait models the weird way in which SystemVerilog and VHDL describe registers 😢

Copy link
Contributor Author

@mvpant mvpant Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes needed with the new block arg removal? These don't seem related to what the individual test cases try to check. 🤔

I'll check again, but current patch changes this:

// CHECK-LABEL: @SkipIfWaitHasDestinationOperands(
hw.module @SkipIfWaitHasDestinationOperands(in %a: i42) {
  // CHECK: llhd.process
  llhd.process {
    cf.br ^bb1
  ^bb1:
    llhd.wait ^bb2(%a : i42)
  ^bb2(%0: i42):
    cf.br ^bb1
  }
}

into this:

hw.module @SkipIfWaitHasDestinationOperands(in %a : i42) {
  llhd.combinational {
    cf.br ^bb1
  ^bb1:  // pred: ^bb0
    llhd.yield
  }
  hw.output
}

So i made some adjustments to keep the original testcase idea.

One thing worth noting is that in an MLIR blob like this:

  llhd.wait ^bb2(%a : i42)
^bb2(%0: i42):

you can't really replace all uses of %0 with %a. The problem is that the %a may change its value across the llhd.wait. The block argument %0 receives the value of %a before the wait suspends execution, which may be different than if you read %a again in ^bb2. This is the mechanism by which llhd.wait models the weird way in which SystemVerilog and VHDL describe registers 😢

Does it include values defined outside of llhd.process? We could restrict pruning to only those values (i.e. defined in the parent region?) It seems that in Rocket chip lowered module, all llhd.process operations contain wait ops that use top-level module %clock parameter as the wait destination operand.

If we can only prune constant-like values, then the patch idea is pointless. In that case, i have no idea how something like always @(posedge clock) begin ... end can be lowered into something executable/simulatable code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that includes values defined outside of llhd.process. Basically any block operand flowing through llhd.wait cannot be forwarded through the llhd.wait by replacing the destination block's argument with that value. The thing we can optimize is something like llhd.wait ^bb1(%a, %a : i42) to llhd.wait ^bb1(%a : i42), but the values must always cross through the wait op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the need for wait dest operands by using llhd.var + llhd.store + llhd.load combination to preserve values before it suspends execution? Spill values before wait and in all other predecessors of the destination block and then load them again at the destination block?

The issues with basic block convergence, different block arguments, unobserved values are still ahead. Trying to tackle them one at a time, in order. And the current one is wait destination operands. This might be suboptimal, since i don't yet have a full picture, only that PicoRV successfully loweres to LLVM, while Rocket Chip does not (from verilog).

That said, the llhd.process code generated from Rocket Chip doesn’t appear significantly more complex than what we have for PicoRV, so I think it should be feasible to get it working. Do you have suggestions on the best direction to go? Maybe adjusting firrtl -> verilog or verilog -> moore path would produce code that better fits current constrains, without requiring changes at the LLHD level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could definitely remove destination operands from wait and replace them with a variable. That feels like a step backwards though, almost like going from SSA back to alloca/load/store. Having a very clear picture of which values in the process are from the past (aka passed in as dest operand from wait) and which are from the present is important in passes like Deseq, and in LowerProcesses too. If we were to turn them into a side channel through variables, the wait op would become easier, but everything else more complicated 😞

It's also okay for LowerProcesses to not be able to convert all processes. The responsibility of the passes basically are:

  • Deseq: convert processes that describe registers and latches. These have the signals in @(...) as destination operands of the wait, because they need to detect edges.
  • LowerProcesses: convert processes that describe combinational circuits. These cannot have destination operands on the wait, because that would imply some edge detection or other stateful behavior.
  • All other processes, e.g. initial begin ... end containing test stimulus, stay as llhd.process.
    In a nutshell, anything that's synthesizable should be handled by the two passes. Everything else is okay to stay as llhd.process, which then simply captures the event queue simulation semantics of Verilog/VHDL for those remaining processes.

There are only a few situations where past values of signals are passed across an llhd.wait into the present as destination operands. They all stem from @(...). But if they are present, they are capturing and essential edge detection feature of the input Verilog.

What might be happening in Rocket is that the Deseq pass might erroneously not detect a register or latch, which then causes the llhd.process to survive with an llhd.wait with additional destination operands. Do you have a concrete snippet of MLIR as an example of where this happens? Might point at a Deseq bug. Running circt-verilog with --debug-only=llhd-deseq or --debug-only=llhd-lower-processes can also provide some information. Most passes print a debug note if a process doesn't match the structure they need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Deseq: convert processes that describe registers and latches. These have the signals in @(...) as destination operands of the wait, because they need to detect edges.

What might be happening in Rocket is that the Deseq pass might erroneously not detect a register or latch, which then causes the llhd.process to survive with an llhd.wait with additional destination operands. Do you have a concrete snippet of MLIR as an example of where this happens? Might point at a Deseq bug. Running circt-verilog with --debug-only=llhd-deseq or --debug-only=llhd-lower-processes can also provide some information. Most passes print a debug note if a process doesn't match the structure they need.

Ah, i see, thanks for the direction. I'll get back with update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this being so convoluted. Some of this really needs to go into the rationale document for the LLHD dialect 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, i tried running module through deseq, and with a few tweaks, it almost fully transformed.

The first obstacle are numerous verif.assert which are marked as side-effect operations and originate from assert statements in the Verilog. I removed them manually, but it might be a good idea to strip them automatically at some point?

    if (io_in_a_valid & io_in_a_bits_opcode == 3'h6 & ~reset) begin	// src/main/scala/tilelink/Monitor.scala:42:11, :81:{25,54}
      assert__assert_2: assert(_GEN_4);	// src/main/scala/tilelink/Monitor.scala:42:11, :82:72, src/main/scala/tilelink/Parameters.scala:671:54
      assert__assert_3: assert(_GEN_12);	// src/main/scala/chisel3/util/Mux.scala:30:73, src/main/scala/tilelink/Monitor.scala:42:11, :83:78, src/main/scala/tilelink/Parameters.scala:671:54
      assert__assert_4: assert(source_ok);	// src/main/scala/tilelink/Monitor.scala:42:11, src/main/scala/tilelink/Parameters.scala:1126:46
      assert__assert_5: assert(_mask_T);	// src/main/scala/tilelink/Monitor.scala:42:11, src/main/scala/util/Misc.scala:206:21
      assert__assert_6: assert(is_aligned);	// src/main/scala/tilelink/Edges.scala:22:{16,24}, src/main/scala/tilelink/Monitor.scala:42:11
      assert__assert_7: assert(_GEN_13);	// src/main/scala/tilelink/Bundles.scala:111:27, src/main/scala/tilelink/Monitor.scala:42:11
      assert__assert_8: assert(&io_in_a_bits_mask);	// src/main/scala/tilelink/Monitor.scala:42:11, :88:31
      assert__assert_9: assert(~io_in_a_bits_corrupt);	// src/main/scala/tilelink/Monitor.scala:42:11, :89:18
    end

converted into:

    ^bb4:  // pred: ^bb3
      verif.assert %243 label "" : i1
      verif.assert %268 label "" : i1
      verif.assert %197 label "" : i1
      verif.assert %205 label "" : i1
      verif.assert %204 label "" : i1
      verif.assert %269 label "" : i1
      %422 = comb.icmp eq %io_in_a_bits_mask, %c-1_i8 : i8
      verif.assert %422 label "" : i1
      %423 = comb.xor %io_in_a_bits_corrupt, %true : i1
      verif.assert %423 label "" : i1
      cf.br ^bb5

In the simulation pipeline, i guess we can't really make real use of those, can we?

The second issue involves enable conditions computed for certain drives that depend on values that come from hw.instance. Currently these are marked as poisoned because there is no handle for them in TruthTable Deseq::computeBoolean(OpResult value). Intuitively, i think we could treat them as unknown boolean, this would allow llhd.processes ops to be lowered succesfully. A more elaborate approach would be to inspect the body of hw.module referenced by hw.instance and attempt to build a predicate based on its operations? Would that be the right approach?

minified example of problem:
```mlir
module {
  hw.module (...) {
    %55:18 = llhd.process -> ... {
      ...
      ^bb3:
        %97 = comb.and %queue_arw_deq_q.io_enq_ready, %33 : i1
        ...
    }
    llhd.drv %count_1, %55#0 after %0 if %55#1 : !hw.inout<i1>
    ...
    %queue_arw_deq_q.io_enq_ready, ... = hw.instance "queue_arw_deq_q" @Queue1_AXI4BundleARW(clock: %clock: i1, reset: %reset: i1, ...) -> (io_enq_ready: i1, io_deq_valid: i1, ...) {sv.namehint = "_queue_arw_deq_q_io_deq_valid"}
    ...
    hw.output ...
  }
}

The third issue is llhd.process operations that contain only a single halt operation. I believe these can be safely removed, and their operands can be substituted directly at their use sites (example below)?

  hw.module private @AsyncResetRegVec_w1_i0(in %clock : i1, in %reset : i1, in %io_d : i1, out io_q : i1) {
    %0 = llhd.constant_time <0ns, 0d, 1e>
    %false = hw.constant false
    %reg_0 = llhd.sig %false : i1
    %1 = seq.to_clock %clock
    %reg_0_0 = seq.firreg %io_d clock %1 reset async %reset, %false {name = "reg_0"} : i1
    llhd.drv %reg_0, %reg_0_0 after %0 : !hw.inout<i1>
    %2:2 = llhd.process -> i1, i1 {
      llhd.halt %false, %reset : i1, i1
    }
    llhd.drv %reg_0, %2#0 after %0 if %2#1 : !hw.inout<i1>
    %3 = llhd.prb %reg_0 : !hw.inout<i1>
    hw.output %3 : i1
  }

Subset of Rocket Chip lowered code is attached.
subset.deseq.txt

@mvpant mvpant force-pushed the lower-process-improve branch from 93806b3 to 42f1c0a Compare August 29, 2025 13:51
@fabianschuiki
Copy link
Contributor

@mvpant The verif.assert case is interesting. I think this is the first time asserts came up inside of a process. We definitely want to retain these throughout the lowering pipeline, but ideally they shouldn't block register detection. I'm wondering how we could make that work 🤔. Ideally if a user writes this:

always_ff @(posedge clock) begin
  assert(d > 0);
  q <= d;
end

we'd still want to lower the q <= d part to a register, but also keep the assert(d < 0) around as a clocked assert. I'm wondering if we'd have to try and partition processes somehow, such that the llhd.process with the verif.assert remains, but the register logic gets pulled out of it. The Deseq pass is already creating copies of the process to specialize it for the enable condition and clock edge; maybe we should keep the original process around with all unused yield operands removed. That might give us a process with the remaining side-effecting ops that weren't absorbed into registers.

Note that it's also totally valid to keep llhd.processes around for the sake of having these simulation-only constructs like verif.assert that might be embedded in them. We may even want to try and map that llhd.process to something like a sim.triggered_on_clock { ... } op which we could cook up, to lower llhd.processes that just trivially check a clock edge to something easier to transform. @fzi-hielscher had some ideas there.

@fabianschuiki
Copy link
Contributor

@mvpant Regarding the hw.instance results being used as boolean conditions: I think that should theoretically work. It's strange that the instance output is considered poisoned, and not unknown. What should happen is that the analysis part in Deseq should explicitly track clock and reset values in the truth table, and gobble up everything else as an unknown value.

What might be happening here is that there is a conservative check in there that marks any ops that depend on the clock or reset as poisoned, because the pass can't reason about how the value they produce depends on the clock. But I'm wondering now if that is even needed in case of the boolean values. It might actually be totally okay to just treat boolean values as unknown even if they come from unknown ops that depend on a clock or reset. In the worst case, this will lead to a register that has an enable condition derived from a clock, which highly dubious hardware engineering but a valid thing to have in the IR.

Might be worth dropping that poisoning step on unhandled ops for the boolean value analysis 🤔 WDYT?

@fabianschuiki
Copy link
Contributor

@mvpant Regarding the initial process that just ends in an llhd.halt:

    %2:2 = llhd.process -> i1, i1 {
      llhd.halt %false, %reset : i1, i1
    }
    llhd.drv %reg_0, %2#0 after %0 if %2#1 : !hw.inout<i1>

This should actually be pretty straightforward to handle. I think we're just missing a folder or canonicalizer that replaces llhd.process results with the corresponding llhd.halt constant operand, if there's only a single llhd.halt in the last block.

It's important to only do this for constant operands, since the process runs only once at very beginning. This means that any non-constant value would basically be "snapshot" at that specific point in time and produced as a result from the process. This result won't change throughout the simulation, even if the original non-constant value does. Doing this only for constants works around this problem.

@fzi-hielscher
Copy link
Contributor

Note that it's also totally valid to keep llhd.processes around for the sake of having these simulation-only constructs like verif.assert that might be embedded in them. We may even want to try and map that llhd.process to something like a sim.triggered_on_clock { ... } op which we could cook up, to lower llhd.processes that just trivially check a clock edge to something easier to transform. @fzi-hielscher had some ideas there.

I'm very much in favor of separating synthesizable and non-synthesizable IR as much as possible. And I still mostly stand behind what I've written in #7676. It's just that I've got very few free cycles at the moment to put any work into it. For the moment nothing prevents us from just using hw.triggered, right? 😅

I briefly glanced over the "edge-detection detection"-logic in llhd-deseq and now I've got many questions. 😁 I might bother you with them at some point, but I guess most of it would be apparent if I had a better intuition on how LLHD works.

@fabianschuiki
Copy link
Contributor

I might bother you with them at some point

@fzi-hielscher Yes please do! A second pair of watchful eyes would be much appreciated. A lot of that logic is derived from roughly what SystemVerilog and VHDL expect, but it's also very clunky and easy to break. Would be great to come up with some better foundations for register detection 🙂

@mvpant
Copy link
Contributor Author

mvpant commented Sep 4, 2025

@mvpant Regarding the hw.instance results being used as boolean conditions: I think that should theoretically work. It's strange that the instance output is considered poisoned, and not unknown. What should happen is that the analysis part in Deseq should explicitly track clock and reset values in the truth table, and gobble up everything else as an unknown value.

What might be happening here is that there is a conservative check in there that marks any ops that depend on the clock or reset as poisoned, because the pass can't reason about how the value they produce depends on the clock. But I'm wondering now if that is even needed in case of the boolean values. It might actually be totally okay to just treat boolean values as unknown even if they come from unknown ops that depend on a clock or reset. In the worst case, this will lead to a register that has an enable condition derived from a clock, which highly dubious hardware engineering but a valid thing to have in the IR.

Might be worth dropping that poisoning step on unhandled ops for the boolean value analysis 🤔 WDYT?

@fabianschuiki, I've discussed this with some Verilog folks (they focus on static analysis), and they suggested it might be reasonable to treat boolean values as unknown if they come from unknown operations. From my side, lacking the full picture, so take this with some grain of salt, i'd lean towards keeping the current poison behavior. It's mainly used to catch unexpected paths and we can gradually relax things as we encounter real cases and reason through them one by one (similar to what's happening now with Rocket Chip and hw.instance operation).

P.S. Disabling SYNTHESIS logic via a macro which turns off initial blocks and thus prevents the generation of llhd.process + llhd.halt combinations, along with using an uknown boolean value for hw.instance results and manually removing verif.asserts statements, allows succesfully lowering Rocket Chip into LLVM. However, i haven't yet compared it to the existing lowering from FIRRTL.

@fabianschuiki
Copy link
Contributor

@mvpant That sounds fantastic! So there are only a few steps towards getting this to work seamlessly. Feel free to remove that poisoning behavior and use unknown instead 😀. This would make the pass a bit less complex to reason about, and it would accept more inputs. As long as we can lower it to a valid thing -- which we can, I think all is well.

It would be great if we could somehow get those initial blocks to work. We do have that seq.initial operation, but I don't think that works as a lowering target if the llhd.process drives any signals. These will probably just have to stay llhd.processes and Arcilator needs to add support for them. We do have an arc.initial op for this, but the Arc dialect currently doesn't handle LLHD ops. It needs a way to translate signals, drives, and probes into ops that allocate an event queue or a signal (or a slot in a global event queue), and ops that insert stuff into that event queue. Together with some coroutine-style lowering for processes this would let Arcilator simulate LLHD ops in the input.

Same for verif.asserts. Arcilator should probably just lower them to a condition check and function call, at least when they are booleans. Not entirely sure how we'd handle property and sequence inputs... that would be cool to think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants