Skip to content

Conversation

hjagasiaAMD
Copy link
Contributor

Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.

Ensure global variables accessed by only one kernel can stay in kernel
scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass.
This to prevent LDS limit from being exceeded for the kernel. At higher
Opt levels, additional passes run can acheive this without switching to
table strategy.
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Sep 22, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.


Full diff: https://github.com/llvm/llvm-project/pull/160181.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp (+10-1)
  • (added) llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll (+92)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index f01d5f6726822..dae2bd53b6623 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -588,7 +588,7 @@ class AMDGPULowerModuleLDS {
     return OrderedKernels;
   }
 
-  static void partitionVariablesIntoIndirectStrategies(
+  void partitionVariablesIntoIndirectStrategies(
       Module &M, LDSUsesInfoTy const &LDSUsesInfo,
       VariableFunctionMap &LDSToKernelsThatNeedToAccessItIndirectly,
       DenseSet<GlobalVariable *> &ModuleScopeVariables,
@@ -596,6 +596,9 @@ class AMDGPULowerModuleLDS {
       DenseSet<GlobalVariable *> &KernelAccessVariables,
       DenseSet<GlobalVariable *> &DynamicVariables) {
 
+    if (TM.getOptLevel() == CodeGenOptLevel::None)
+      LoweringKindLoc = LoweringKind::table;
+
     GlobalVariable *HybridModuleRoot =
         LoweringKindLoc != LoweringKind::hybrid
             ? nullptr
@@ -1188,6 +1191,8 @@ class AMDGPULowerModuleLDS {
           // Allocated at zero, recorded once on construction, not once per
           // kernel
           Offset += DL.getTypeAllocSize(MaybeModuleScopeStruct->getValueType());
+          LLVM_DEBUG(dbgs() << "amdgpu-lds-size after ModuleScopeStruct"
+                            << Offset << "\n");
         }
 
         if (AllocateKernelScopeStruct) {
@@ -1195,6 +1200,8 @@ class AMDGPULowerModuleLDS {
           Offset = alignTo(Offset, AMDGPU::getAlign(DL, KernelStruct));
           recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
           Offset += DL.getTypeAllocSize(KernelStruct->getValueType());
+          LLVM_DEBUG(dbgs()
+                     << "amdgpu-lds-size after KernelStruct" << Offset << "\n");
         }
 
         // If there is dynamic allocation, the alignment needed is included in
@@ -1205,6 +1212,8 @@ class AMDGPULowerModuleLDS {
           GlobalVariable *DynamicVariable = KernelToCreatedDynamicLDS[&Func];
           Offset = alignTo(Offset, AMDGPU::getAlign(DL, DynamicVariable));
           recordLDSAbsoluteAddress(&M, DynamicVariable, Offset);
+          LLVM_DEBUG(dbgs() << "amdgpu-lds-size after DynamicVariable" << Offset
+                            << "\n");
         }
 
         if (Offset != 0) {
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
new file mode 100644
index 0000000000000..fec5b47198917
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
@@ -0,0 +1,92 @@
+; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s
+; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'
+
+@gA = internal addrspace(3) global [32768 x i8] undef, align 4
+@gB = internal addrspace(3) global [32768 x i8] undef, align 4
+@gC = internal addrspace(3) global [32768 x i8] undef, align 4
+
+; ---- Helpers ----
+
+define internal void @helperA() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gA, i32 0, i32 0
+  store i8 1, ptr addrspace(3) %p
+  ret void
+}
+
+define internal void @helperB() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gB, i32 0, i32 0
+  store i8 2, ptr addrspace(3) %p
+  ret void
+}
+
+define internal void @helperC() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gC, i32 0, i32 0
+  store i8 3, ptr addrspace(3) %p
+  ret void
+}
+
+; ---------------------------------------------------------------------------
+; Dispatch: takes an index and calls the appropriate helper.
+; If dispatch is NOT inlined, a backend lowering pass that conservatively
+; examines call targets may think all helpers (and thus all globals) are
+; potentially referenced by every kernel that calls dispatch.
+; ---------------------------------------------------------------------------
+
+define void @dispatch(i32 %idx) inlinehint {
+entry:
+  %cmp1 = icmp eq i32 %idx, 1
+  br i1 %cmp1, label %case1, label %check2
+
+check2:
+  %cmp2 = icmp eq i32 %idx, 2
+  br i1 %cmp2, label %case2, label %check3
+
+check3:
+  %cmp3 = icmp eq i32 %idx, 3
+  br i1 %cmp3, label %case3, label %default
+
+case1:
+  call void @helperA()
+  br label %done
+
+case2:
+  call void @helperB()
+  br label %done
+
+case3:
+  call void @helperC()
+  br label %done
+
+default:
+  ; fallthrough: call helperA to have a default behaviour
+  call void @helperA()
+  br label %done
+
+done:
+  ret void
+}
+
+; ---- Kernels ----
+
+define amdgpu_kernel void @k0() {
+entry:
+  call void @dispatch(i32 1)
+  call void @dispatch(i32 2)
+  ret void
+}
+
+define amdgpu_kernel void @k1() {
+entry:
+  call void @dispatch(i32 2)
+  call void @dispatch(i32 1)
+  ret void
+}
+
+define amdgpu_kernel void @k2() {
+entry:
+  call void @helperC()
+  ret void
+}

Comment on lines 599 to 600
if (TM.getOptLevel() == CodeGenOptLevel::None)
LoweringKindLoc = LoweringKind::table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not override the explicit flag. This also seems like a dubious way to avoid going over the limit; we can't rely on other optimizations without -O0 either. It it possible to compute the size usage with the different strategies before committing to one?

Comment on lines 4 to 6
@gA = internal addrspace(3) global [32768 x i8] undef, align 4
@gB = internal addrspace(3) global [32768 x i8] undef, align 4
@gC = internal addrspace(3) global [32768 x i8] undef, align 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@gA = internal addrspace(3) global [32768 x i8] undef, align 4
@gB = internal addrspace(3) global [32768 x i8] undef, align 4
@gC = internal addrspace(3) global [32768 x i8] undef, align 4
@gA = internal addrspace(3) global [32768 x i8] poison, align 4
@gB = internal addrspace(3) global [32768 x i8] poison, align 4
@gC = internal addrspace(3) global [32768 x i8] poison, align 4

@@ -0,0 +1,92 @@
; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s
; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK-NOT should be avoided, this should check the actual output. Not erroring is sufficient


; ---- Helpers ----

define internal void @helperA() inlinehint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all the inlinehint, they aren't doing anything

recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
Offset += DL.getTypeAllocSize(KernelStruct->getValueType());
LLVM_DEBUG(dbgs()
<< "amdgpu-lds-size after KernelStruct" << Offset << "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "amdgpu-lds-size after KernelStruct" << Offset << "\n");
<< "amdgpu-lds-size after KernelStruct" << Offset << '\n');

} else if (set_is_subset(K.second, HybridModuleRootKernels)) {
ModuleScopeVariables.insert(GV);
uint64_t LocalMemLimit = 0;
for (Function &F : M) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment what this is doing

for (Function &F : M) {
if (!F.isDeclaration()) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
LocalMemLimit = ST.getAddressableLocalMemorySize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit should really come from the entry point kernel, not just the first function you happen to find

KernelAccessVariables.insert(GV);
} else if (set_is_subset(K.second, HybridModuleRootKernels)) {
ModuleScopeVariables.insert(GV);
uint64_t LocalMemLimit = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the top of the file claims the hybrid strategy offers precise allocation, so is there just a bug somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think there is a bug. Below is the access pattern of the test.

+; This test has the following kernels with following GV access pattern
+; EN32 kernels
+; EN32_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN32_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN32_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN32_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7
+; EN64 kernels
+; EN64_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN64_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN64_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN64_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7

ret void
}

define i32 @_Z17HlifCompressBatchILi1ERN7hipcomp25cascaded_compress_wrapperIhmLi128ELi4096EEERN18cooperative_groups12thread_blockEEvRK12CompressArgsOT0_OT1_() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix all of these variable names, replace them with more meaningful testcase names

@@ -0,0 +1,237 @@
; RUN: llc -mtriple=amdgcn-amd-amdhsa < %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check the output, preferably of the IR pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should use a real subtarget

@arsenm
Copy link
Contributor

arsenm commented Sep 25, 2025

Partially reduced: https://godbolt.org/z/aT1sxa765

Title doesn't match anymore. I also think there's just a bug somewhere and there doesn't need to be a strategy change

Ensure global variables accessed by only one kernel can stay in kernel
scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass.
This to prevent LDS limit from being exceeded for the kernel. At higher
Opt levels, additional passes run can acheive this without switching to
table strategy.
@JonChesterfield
Copy link
Collaborator

The different strategies change how variables are accessed but not where they are allocated. This change may improve compile time. It will definitely regress runtime.

If it changes the the reported amount of lds used by any kernel, there is an error in this pass or elsewhere.

@@ -0,0 +1,264 @@
; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s
; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -passes=amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't reduced enough. I previously posted a godbolt link which is smaller than this, and I'm sure it can be shrunk further

Copy link
Collaborator

@JonChesterfield JonChesterfield Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the godbolt IR. In what/which sense is this believed to change allocated lds size? Based on debug prints added to this pass, on IR metadata, on the binary metadata, other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls note, the godbolt IR does not show the issue. The reproducer in the squashed patch (which Jon has also pasted in his comment does)

Copy link
Collaborator

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is noisy enough that I'm not sure what it's trying to do. The goal stated in the commit message can't be met by the change stated in the commit message.

What's the test case that this change decreases the amount of allocated lds for?

Tangentially related, was this report prepared using an LLM?

@JonChesterfield
Copy link
Collaborator

JonChesterfield commented Sep 26, 2025

Reproduces (on whatever I had in ~/llvm-install)

~/llvm-install/bin/opt -mtriple=amdgcn -mcpu=gfx942 lower-lds.ll -amdgpu-lower-module-lds -S -o lower-lds.baseline.ll

~/llvm-install/bin/opt -mtriple=amdgcn -mcpu=gfx942 lower-lds.ll -amdgpu-lower-module-lds -amdgpu-lower-module-lds-strategy=table -S -o lower-lds.table.ll 

The backend allocates static lds based on the amdgpu-lds-size metadata. These numbers are different for the above two paths but shouldn't be. I note that 'table' lowering only exists at all to isolate part of the pass for testing and is not expected to be used by anyone.

I suspect we're counting it wrong on the table path, as opposed to the default properly allocating 68k and the table one properly allocating between 11k and 26k, but there's definitely something not working as intended here. Thanks for the reproducer

; lower-lds.ll
%RawStorage1 = type { [1056 x i8] }
%RawStorage2 = type { [4 x i8] }
%RawStorage3 = type { [16 x i8] }

@one = addrspace(3) global [1026 x i32] poison
@two = addrspace(3) global [1026 x i32] poison
@three = external addrspace(3) global [2048 x i32]
@four = addrspace(3) global [2050 x i32] poison
@five = addrspace(3) global [16 x i32] poison
@six = external addrspace(3) global %RawStorage1
@seven = addrspace(3) global %RawStorage2 poison
@eight = addrspace(3) global [1026 x i32] poison
@nine = addrspace(3) global [1026 x i32] poison
@ten = external addrspace(3) global [1024 x i32]
@eleven = addrspace(3) global [1026 x i32] poison
@twelve = external addrspace(3) global [16 x i32]
@thirteen = external addrspace(3) global %RawStorage1
@fourteen = external addrspace(3) global [1 x i32]
@fifteen = addrspace(3) global [1026 x i32] poison
@sixteen = addrspace(3) global [1026 x i32] poison
@seventeen = external addrspace(3) global [512 x i32]
@eighteen = addrspace(3) global [514 x i32] poison
@nineteen = external addrspace(3) global [16 x i32]
@twenty = external addrspace(3) global %RawStorage1
@twentyone = external addrspace(3) global [514 x i64]
@twentytwo = external addrspace(3) global [514 x i64]
@twentythree = external addrspace(3) global [256 x i32]
@twentyfour = external addrspace(3) global [258 x i32]
@twentyfive = external addrspace(3) global [16 x i32]
@twentysix = external addrspace(3) global %RawStorage1
@twentyseven = external addrspace(3) global %RawStorage3

define amdgpu_kernel void @EN32_compress_wrapperIhm() {
entry:
  %0 = call i32 @Ihm_one()
  ret void
}

define i32 @Ihm_one() {
entry:
  %0 = call i32 @Ihm_chunk()
  ret i32 %0
}

define i32 @Ihm_chunk() {
entry:
  %0 = call i32 @Ihm_CascadedOpts()
  ret i32 %0
}

define i32 @Ihm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @one to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @two to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @five to ptr), i64 1), i64 0
  call void @Ihm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @three to ptr), ptr addrspacecast (ptr addrspace(3) @four to ptr))
  ret i32 %call69
}

define void @Ihm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @six to ptr), ptr null, align 8
  ret void
}

define i32 @foo(ptr %input, ptr %temp_storage) {
entry:
  call void @Itm_PjPS4()
  ret i32 0
}

define void @Itm_PjPS4() {
entry:
  call void @Itm_PS1_Pj()
  ret void
}

define void @Itm_PS1_Pj() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @seven to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperItm() {
entry:
  %0 = call i32 @Itm_one()
  ret void
}

define i32 @Itm_one() {
entry:
  %0 = call i32 @Itm_chunk()
  ret i32 %0
}

define i32 @Itm_chunk() {
entry:
  %0 = call i32 @Itm_CascadedOpts()
  ret i32 %0
}

define i32 @Itm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @eight to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @nine to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @twelve to ptr), i64 1), i64 0
  call void @Itm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @ten to ptr), ptr addrspacecast (ptr addrspace(3) @eleven to ptr))
  ret i32 %call69
}

define void @Itm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @thirteen to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperIjm() {
entry:
  %arrayidx = getelementptr [1 x i32], ptr addrspacecast (ptr addrspace(3) @fourteen to ptr), i64 0, i64 0
  %0 = call i32 @Ijm_one()
  ret void
}

define i32 @Ijm_one() {
entry:
  %0 = call i32 @Ijm_chunk()
  ret i32 %0
}

define i32 @Ijm_chunk() {
entry:
  %0 = call i32 @Ijm_CascadedOpts()
  ret i32 %0
}

define i32 @Ijm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @fifteen to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @sixteen to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @nineteen to ptr), i64 1), i64 0
  call void @Ijm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @seventeen to ptr), ptr addrspacecast (ptr addrspace(3) @eighteen to ptr))
  ret i32 %call69
}

define void @Ijm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twenty to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperImm() {
entry:
  %0 = call i32 @Imm_one()
  ret void
}

define i32 @Imm_one() {
entry:
  %0 = call i32 @Imm_chunk()
  ret i32 %0
}

define i32 @Imm_chunk() {
entry:
  %0 = call i32 @Imm_CascadedOpts()
  ret i32 %0
}

define i32 @Imm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentyone to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @twentytwo to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @twentyfive to ptr), i64 1), i64 0
  br i1 false, label %for.body65, label %for.end102

for.body65:
  call void @Imm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @twentythree to ptr), ptr addrspacecast (ptr addrspace(3) @twentyfour to ptr))
  ret i32 %call69

for.end102:
  %call106 = call i32 @Imm_PjPKjPS5_S6_b()
  ret i32 0
}

define void @Imm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentysix to ptr), ptr null, align 8
  ret void
}

define i32 @Imm_PjPKjPS5_S6_b() {
entry:
  call void @Imm_PjPS4()
  ret i32 0
}

define void @Imm_PjPS4() {
entry:
  call void @Imm_PS1_Pj()
  ret void
}

define void @Imm_PS1_Pj() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentyseven to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperIhm() {
entry:
  %0 = call i32 @Ihm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperItm() {
entry:
  %0 = call i32 @Itm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperIjm() {
entry:
  %0 = call i32 @Ijm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperImm() {
entry:
  %0 = call i32 @Imm_one()
  ret void
}

@JonChesterfield
Copy link
Collaborator

There is an underlying error here. We are too eager to promote variables to the structure that is allocated at address zero and that can lead to allocating variables in kernels that should not do so.

There's an easy correctness fix, I'm still considering whether there's a reasonable way to get a better result. Forcing table mode for everything would get you correct behaviour I believe, but you should get performance regressions from it relative to fixing the underlying error

@JonChesterfield
Copy link
Collaborator

Github won't let me comment on the right place in the diff.

set_is_subset(K.second, HybridModuleRootKernels)

That^ is too optimistic, needs to be set equality, not subset. That'll reduce how often the module path is taken which will make some kernels slower, so the fix probably needs to be change that to equality and then a second patch work harder to retrieve the anticipated performance loss.

Essentially currently we have a path that chooses faster instruction execution over minimising allocation. That's not a deliberate design choice, more an oversight from the original implementation that went unnoticed. Thank you for picking up on it!

@JonChesterfield
Copy link
Collaborator

Alternative fix implemented at #161464

@hjagasiaAMD
Copy link
Contributor Author

closing this PR. Follow alternate fix at #161464

@hjagasiaAMD hjagasiaAMD closed this Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants