Fix FFT live-in canonicalization without regressing ReLU II by ShangkunLi · Pull Request #320 · coredac/neura

ShangkunLi · 2026-06-08T14:05:49Z

Fix FFT live-in canonicalization without regressing ReLU II

Background

This PR fixes an FFT correctness issue caused by an unsafe live-in canonicalization across nested loop control flow.

The FFT kernel has stage-level variables:

groupsPerStage = 1;
buttersPerGroup = NPOINTS / 2;
coef_base = 0;

for (i = 0; i < LOGN; ++i) {
  for (j = 0; j < groupsPerStage; ++j) {
    for (k = 0; k < buttersPerGroup; ++k) {
      // butterfly body
    }
  }

  groupsPerStage *= 2;
  buttersPerGroup /= 2;
  coef_base = (coef_base << 1) + 1;
}

The important semantic requirement is that groupsPerStage, buttersPerGroup, coef_base, and the stage index are updated once per FFT stage, after both inner loops finish. They must not update once per butterfly iteration.

What went wrong

After canonicalize-live-in, the FFT control IR contained this pattern:

^bb1(%4: i32, %5: i32, %6: i32, %7: i32, %8: i64):
  %9 = "neura.icmp"(%5) <{cmpType = "sgt"}> {rhs_value = 0 : i32} : (i32) -> i1
  neura.cond_br %9 : i1
    then %4, %5, %6, %8 : i32, i32, i32, i64 to ^bb2
    else %8 : i64 to ^bb6

...

^bb6(%79: i64):  // stage update block
  %80 = "neura.shl"(%6) {rhs_value = 1 : i32} : (i32) -> i32
  %81 = "neura.div"(%5) {rhs_value = 2 : i32} : (i32) -> i32
  %82 = "neura.shl"(%4) {rhs_value = 1 : i32} : (i32) -> i32
  %83 = "neura.or"(%82) {rhs_value = 1 : i32} : (i32) -> i32
  %84 = "neura.add"(%7) {rhs_value = 1 : i32} : (i32) -> i32
  %85 = "neura.icmp"(%84) <{cmpType = "eq"}> {rhs_value = 8 : i32} : (i32) -> i1
  neura.cond_br %85 : i1
    then %85 : i1 to ^bb7
    else %83, %81, %80, %84, %79 : i32, i32, i32, i32, i64 to ^bb1

Here:

%4 = coef_base
%5 = buttersPerGroup
%6 = groupsPerStage
%7 = stage index

The bug is that ^bb6 only receives %79 as a block argument, but directly uses %4, %5, %6, and %7 from the outer loop header ^bb1.

This is valid under simple SSA dominance because ^bb1 dominates ^bb6. However, it is not safe for our lowering pipeline. The later transform-ctrl-to-data-flow pass relies on branch/block arguments to preserve control predicates when CFG control flow is flattened into dataflow. Direct dominating live-ins bypass that predicate structure.

As a result, the FFT stage update expressions became available too often after dataflow lowering. This explains the observed behavior where stage-level variables such as groupsPerStage, buttersPerGroup, and coef_base were updated at the inner butterfly rate instead of once per FFT stage.

Compiler bug

The problematic logic was in:

lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp

Specifically, the direct-live-in detection was too permissive:

isSingleSourceSingleSinkPattern(...)
isDirectUnconditionalPattern(...)
identifyDirectDominatingLiveIns(...)

These helpers allowed a value defined in a dominating block to be used directly in a later block, even if the path from the definition to the use passed through nested loop/backedge control flow.

That optimization is safe for simple straight-line or acyclic control regions, but it is unsafe for the FFT pattern because the direct live-in crosses loop control that must become an explicit predicate in dataflow form.

Fix

The fix is to make the direct-live-in optimization more conservative only when it needs to be.

This PR adds a CFG reachability check that detects whether there is a backedge before the target use block:

bool hasBackEdgeBeforeTarget(Block *from, Block *to, DominanceInfo &dom_info) {
  Region *region = from->getParent();
  for (Block &block : region->getBlocks()) {
    // A back edge that leaves the use block is after this live-in is consumed.
    // Rejecting it is too conservative for loop latch / merge blocks.
    if (&block == to) {
      continue;
    }

    if (!canReachBlock(from, &block) || !canReachBlock(&block, to)) {
      continue;
    }

    for (Block *succ : block.getSuccessors()) {
      if (succ == &block || dom_info.dominates(succ, &block)) {
        return true;
      }
    }
  }

  return false;
}

Then both direct-live-in fast paths reject candidates if a backedge exists before the use:

if (hasBackEdgeBeforeTarget(defining_block, using_block, dom_info)) {
  return false;
}

When this direct-live-in optimization is rejected, the existing canonicalization logic promotes the value through explicit block arguments instead.

Conceptually, the FFT stage update should become closer to this shape:

^bb6(%stage_idx: i32,
     %stage_butters: i32,
     %stage_groups: i32,
     %stage_coef_base: i32,
     %other_arg: i64):
  %next_groups = "neura.shl"(%stage_groups) {rhs_value = 1 : i32} : (i32) -> i32
  %next_butters = "neura.div"(%stage_butters) {rhs_value = 2 : i32} : (i32) -> i32
  %coef_shifted = "neura.shl"(%stage_coef_base) {rhs_value = 1 : i32} : (i32) -> i32
  %next_coef = "neura.or"(%coef_shifted) {rhs_value = 1 : i32} : (i32) -> i32
  %next_stage = "neura.add"(%stage_idx) {rhs_value = 1 : i32} : (i32) -> i32

The key difference is that the stage-update values are now explicit block arguments. That gives transform-ctrl-to-data-flow the information it needs to attach the correct stage-boundary predicates.

Why this is not over-conservative

A simpler fix would be to reject every direct live-in whenever there is any backedge between the defining block and the using block. That would fix FFT, but it would also be too conservative for simple loop latch / merge patterns.

ReLU is the important example. Its loop has a latch/merge block where the value is consumed before the loop backedge leaves that same block. In that case, the backedge is after the live-in use, so rejecting the direct-live-in optimization would add unnecessary block arguments and data movement. That can increase the mapped recurrence/resource pressure and risk degrading the final II.

This PR avoids that by skipping the target block itself when looking for backedges:

if (&block == to) {
  continue;
}

So the rule is:

Reject direct live-ins when a backedge appears before the use block.
Do not reject only because the use/latch block itself has the loop backedge after consuming the value.

That is the distinction that fixes FFT without unnecessarily pessimizing ReLU.

ReLU II protection

ReLU is covered by the existing full-pipeline mapping test:

test/neura/for_loop/relu_test.mlir

The test runs the relevant pipeline:

canonicalize-live-in
leverage-predicated-value
transform-ctrl-to-data-flow
insert-data-mov
map-to-accelerator

and checks that the final mapped function still has:

mapping_info = {
  compiled_ii = 5 : i32,
  rec_mii = 5 : i32,
  res_mii = 2 : i32,
  ...
}

This guarantees that the fix does not regress the ReLU mapped II from 5.

Copilot

Pull request overview

This PR tightens canonicalize-live-in so direct-dominating live-ins are not used when the def→use path crosses loop backedges (except for latch/merge-style use blocks), fixing an FFT correctness issue in later ctrl→dataflow lowering while preserving the ReLU II guardrail.

Changes:

Add CFG reachability/backedge detection to reject unsafe direct-live-in fast paths in loop-nested control flow.
Update FFT/SPMV end-to-end mapping checks to reflect the new lowered/mapped form (FFT mapping II changes accordingly).
Adjust the ReLU pipeline test to write intermediate outputs to files and run FileCheck against those artifacts (still checking mapped II = 5).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp	Adds reachability/backedge checks to make direct-live-in canonicalization safe across loop control-flow.
test/neura/for_loop/relu_test.mlir	Updates RUN lines to materialize intermediate outputs to files before running FileCheck, retaining the II regression check.
test/e2e/spmv/spmv_kernel.mlir	Updates expected mapping/YAML/ASM checks to match new lowered/mapped output.
test/e2e/fft/fft_kernel.mlir	Updates expected mapping/YAML/ASM checks to match new lowered/mapped output (including updated mapping II expectations).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ShangkunLi added 3 commits June 7, 2026 00:40

fix canonicalize-live-in

1120a3e

update test

be211c7

release the direct dominant live-in pattern

0a1ee62

ShangkunLi requested review from Copilot, n0thingNoob and tancheng June 8, 2026 14:06

Copilot started reviewing on behalf of ShangkunLi June 8, 2026 14:06 View session

ShangkunLi linked an issue Jun 8, 2026 that may be closed by this pull request

[P1] FFT kernel issue #316

Closed

ShangkunLi self-assigned this Jun 8, 2026

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp

ShangkunLi changed the title ~~Fix over-conservative live-in canonicalization around loop latch blocks~~ Fix FFT live-in canonicalization without regressing ReLU II Jun 9, 2026

tancheng approved these changes Jun 9, 2026

View reviewed changes

tancheng reviewed Jun 9, 2026

View reviewed changes

Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp Outdated

tancheng reviewed Jun 9, 2026

View reviewed changes

Comment thread lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp

add comments

ed8ca76

ShangkunLi added the bug Something isn't working label Jun 10, 2026

tancheng approved these changes Jun 11, 2026

View reviewed changes

ShangkunLi merged commit 5763d85 into coredac:main Jun 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FFT live-in canonicalization without regressing ReLU II#320

Fix FFT live-in canonicalization without regressing ReLU II#320
ShangkunLi merged 4 commits into
coredac:mainfrom
ShangkunLi:fix-live-in

ShangkunLi commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShangkunLi commented Jun 8, 2026

Fix FFT live-in canonicalization without regressing ReLU II

Background

What went wrong

Compiler bug

Fix

Why this is not over-conservative

ReLU II protection

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants