Skip to content

Fix FFT live-in canonicalization without regressing ReLU II#320

Merged
ShangkunLi merged 4 commits into
coredac:mainfrom
ShangkunLi:fix-live-in
Jun 11, 2026
Merged

Fix FFT live-in canonicalization without regressing ReLU II#320
ShangkunLi merged 4 commits into
coredac:mainfrom
ShangkunLi:fix-live-in

Conversation

@ShangkunLi

Copy link
Copy Markdown
Collaborator

Fix FFT live-in canonicalization without regressing ReLU II

Background

This PR fixes an FFT correctness issue caused by an unsafe live-in canonicalization across nested loop control flow.

The FFT kernel has stage-level variables:

groupsPerStage = 1;
buttersPerGroup = NPOINTS / 2;
coef_base = 0;

for (i = 0; i < LOGN; ++i) {
  for (j = 0; j < groupsPerStage; ++j) {
    for (k = 0; k < buttersPerGroup; ++k) {
      // butterfly body
    }
  }

  groupsPerStage *= 2;
  buttersPerGroup /= 2;
  coef_base = (coef_base << 1) + 1;
}

The important semantic requirement is that groupsPerStage, buttersPerGroup, coef_base, and the stage index are updated once per FFT stage, after both inner loops finish. They must not update once per butterfly iteration.

What went wrong

After canonicalize-live-in, the FFT control IR contained this pattern:

^bb1(%4: i32, %5: i32, %6: i32, %7: i32, %8: i64):
  %9 = "neura.icmp"(%5) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
  neura.cond_br %9 : i1
    then %4, %5, %6, %8 : i32, i32, i32, i64 to ^bb2
    else %8 : i64 to ^bb6

...

^bb6(%79: i64):  // stage update block
  %80 = "neura.shl"(%6) {rhs_value = 1 : i32} : (i32) -> i32
  %81 = "neura.div"(%5) {rhs_value = 2 : i32} : (i32) -> i32
  %82 = "neura.shl"(%4) {rhs_value = 1 : i32} : (i32) -> i32
  %83 = "neura.or"(%82) {rhs_value = 1 : i32} : (i32) -> i32
  %84 = "neura.add"(%7) {rhs_value = 1 : i32} : (i32) -> i32
  %85 = "neura.icmp"(%84) <{cmpType = "eq"}> {rhs_value = 8 : i32} : (i32) -> i1
  neura.cond_br %85 : i1
    then %85 : i1 to ^bb7
    else %83, %81, %80, %84, %79 : i32, i32, i32, i32, i64 to ^bb1

Here:

%4 = coef_base
%5 = buttersPerGroup
%6 = groupsPerStage
%7 = stage index

The bug is that ^bb6 only receives %79 as a block argument, but directly uses %4, %5, %6, and %7 from the outer loop header ^bb1.

This is valid under simple SSA dominance because ^bb1 dominates ^bb6. However, it is not safe for our lowering pipeline. The later transform-ctrl-to-data-flow pass relies on branch/block arguments to preserve control predicates when CFG control flow is flattened into dataflow. Direct dominating live-ins bypass that predicate structure.

As a result, the FFT stage update expressions became available too often after dataflow lowering. This explains the observed behavior where stage-level variables such as groupsPerStage, buttersPerGroup, and coef_base were updated at the inner butterfly rate instead of once per FFT stage.

Compiler bug

The problematic logic was in:

lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp

Specifically, the direct-live-in detection was too permissive:

isSingleSourceSingleSinkPattern(...)
isDirectUnconditionalPattern(...)
identifyDirectDominatingLiveIns(...)

These helpers allowed a value defined in a dominating block to be used directly in a later block, even if the path from the definition to the use passed through nested loop/backedge control flow.

That optimization is safe for simple straight-line or acyclic control regions, but it is unsafe for the FFT pattern because the direct live-in crosses loop control that must become an explicit predicate in dataflow form.

Fix

The fix is to make the direct-live-in optimization more conservative only when it needs to be.

This PR adds a CFG reachability check that detects whether there is a backedge before the target use block:

bool hasBackEdgeBeforeTarget(Block *from, Block *to, DominanceInfo &dom_info) {
  Region *region = from->getParent();
  for (Block &block : region->getBlocks()) {
    // A back edge that leaves the use block is after this live-in is consumed.
    // Rejecting it is too conservative for loop latch / merge blocks.
    if (&block == to) {
      continue;
    }

    if (!canReachBlock(from, &block) || !canReachBlock(&block, to)) {
      continue;
    }

    for (Block *succ : block.getSuccessors()) {
      if (succ == &block || dom_info.dominates(succ, &block)) {
        return true;
      }
    }
  }

  return false;
}

Then both direct-live-in fast paths reject candidates if a backedge exists before the use:

if (hasBackEdgeBeforeTarget(defining_block, using_block, dom_info)) {
  return false;
}

When this direct-live-in optimization is rejected, the existing canonicalization logic promotes the value through explicit block arguments instead.

Conceptually, the FFT stage update should become closer to this shape:

^bb6(%stage_idx: i32,
     %stage_butters: i32,
     %stage_groups: i32,
     %stage_coef_base: i32,
     %other_arg: i64):
  %next_groups = "neura.shl"(%stage_groups) {rhs_value = 1 : i32} : (i32) -> i32
  %next_butters = "neura.div"(%stage_butters) {rhs_value = 2 : i32} : (i32) -> i32
  %coef_shifted = "neura.shl"(%stage_coef_base) {rhs_value = 1 : i32} : (i32) -> i32
  %next_coef = "neura.or"(%coef_shifted) {rhs_value = 1 : i32} : (i32) -> i32
  %next_stage = "neura.add"(%stage_idx) {rhs_value = 1 : i32} : (i32) -> i32

The key difference is that the stage-update values are now explicit block arguments. That gives transform-ctrl-to-data-flow the information it needs to attach the correct stage-boundary predicates.

Why this is not over-conservative

A simpler fix would be to reject every direct live-in whenever there is any backedge between the defining block and the using block. That would fix FFT, but it would also be too conservative for simple loop latch / merge patterns.

ReLU is the important example. Its loop has a latch/merge block where the value is consumed before the loop backedge leaves that same block. In that case, the backedge is after the live-in use, so rejecting the direct-live-in optimization would add unnecessary block arguments and data movement. That can increase the mapped recurrence/resource pressure and risk degrading the final II.

This PR avoids that by skipping the target block itself when looking for backedges:

if (&block == to) {
  continue;
}

So the rule is:

Reject direct live-ins when a backedge appears before the use block.
Do not reject only because the use/latch block itself has the loop backedge after consuming the value.

That is the distinction that fixes FFT without unnecessarily pessimizing ReLU.

ReLU II protection

ReLU is covered by the existing full-pipeline mapping test:

test/neura/for_loop/relu_test.mlir

The test runs the relevant pipeline:

canonicalize-live-in
leverage-predicated-value
transform-ctrl-to-data-flow
insert-data-mov
map-to-accelerator

and checks that the final mapped function still has:

mapping_info = {
  compiled_ii = 5 : i32,
  rec_mii = 5 : i32,
  res_mii = 2 : i32,
  ...
}

This guarantees that the fix does not regress the ReLU mapped II from 5.

@ShangkunLi ShangkunLi linked an issue Jun 8, 2026 that may be closed by this pull request
@ShangkunLi ShangkunLi self-assigned this Jun 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens canonicalize-live-in so direct-dominating live-ins are not used when the def→use path crosses loop backedges (except for latch/merge-style use blocks), fixing an FFT correctness issue in later ctrl→dataflow lowering while preserving the ReLU II guardrail.

Changes:

  • Add CFG reachability/backedge detection to reject unsafe direct-live-in fast paths in loop-nested control flow.
  • Update FFT/SPMV end-to-end mapping checks to reflect the new lowered/mapped form (FFT mapping II changes accordingly).
  • Adjust the ReLU pipeline test to write intermediate outputs to files and run FileCheck against those artifacts (still checking mapped II = 5).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp Adds reachability/backedge checks to make direct-live-in canonicalization safe across loop control-flow.
test/neura/for_loop/relu_test.mlir Updates RUN lines to materialize intermediate outputs to files before running FileCheck, retaining the II regression check.
test/e2e/spmv/spmv_kernel.mlir Updates expected mapping/YAML/ASM checks to match new lowered/mapped output.
test/e2e/fft/fft_kernel.mlir Updates expected mapping/YAML/ASM checks to match new lowered/mapped output (including updated mapping II expectations).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp
@ShangkunLi ShangkunLi changed the title Fix over-conservative live-in canonicalization around loop latch blocks Fix FFT live-in canonicalization without regressing ReLU II Jun 9, 2026
Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp Outdated
Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp
@ShangkunLi ShangkunLi added the bug Something isn't working label Jun 10, 2026
@ShangkunLi ShangkunLi merged commit 5763d85 into coredac:main Jun 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] FFT kernel issue

3 participants