[AMDGPU] Use table strategy for LowerModuleLDSPass at O0 #160181

hjagasiaAMD · 2025-09-22T19:26:54Z

Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.

github-actions · 2025-09-22T19:27:15Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-09-22T19:27:50Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.

Full diff: https://github.com/llvm/llvm-project/pull/160181.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp (+10-1)
(added) llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll (+92)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index f01d5f6726822..dae2bd53b6623 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -588,7 +588,7 @@ class AMDGPULowerModuleLDS {
     return OrderedKernels;
   }
 
-  static void partitionVariablesIntoIndirectStrategies(
+  void partitionVariablesIntoIndirectStrategies(
       Module &M, LDSUsesInfoTy const &LDSUsesInfo,
       VariableFunctionMap &LDSToKernelsThatNeedToAccessItIndirectly,
       DenseSet<GlobalVariable *> &ModuleScopeVariables,
@@ -596,6 +596,9 @@ class AMDGPULowerModuleLDS {
       DenseSet<GlobalVariable *> &KernelAccessVariables,
       DenseSet<GlobalVariable *> &DynamicVariables) {
 
+    if (TM.getOptLevel() == CodeGenOptLevel::None)
+      LoweringKindLoc = LoweringKind::table;
+
     GlobalVariable *HybridModuleRoot =
         LoweringKindLoc != LoweringKind::hybrid
             ? nullptr
@@ -1188,6 +1191,8 @@ class AMDGPULowerModuleLDS {
           // Allocated at zero, recorded once on construction, not once per
           // kernel
           Offset += DL.getTypeAllocSize(MaybeModuleScopeStruct->getValueType());
+          LLVM_DEBUG(dbgs() << "amdgpu-lds-size after ModuleScopeStruct"
+                            << Offset << "\n");
         }
 
         if (AllocateKernelScopeStruct) {
@@ -1195,6 +1200,8 @@ class AMDGPULowerModuleLDS {
           Offset = alignTo(Offset, AMDGPU::getAlign(DL, KernelStruct));
           recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
           Offset += DL.getTypeAllocSize(KernelStruct->getValueType());
+          LLVM_DEBUG(dbgs()
+                     << "amdgpu-lds-size after KernelStruct" << Offset << "\n");
         }
 
         // If there is dynamic allocation, the alignment needed is included in
@@ -1205,6 +1212,8 @@ class AMDGPULowerModuleLDS {
           GlobalVariable *DynamicVariable = KernelToCreatedDynamicLDS[&Func];
           Offset = alignTo(Offset, AMDGPU::getAlign(DL, DynamicVariable));
           recordLDSAbsoluteAddress(&M, DynamicVariable, Offset);
+          LLVM_DEBUG(dbgs() << "amdgpu-lds-size after DynamicVariable" << Offset
+                            << "\n");
         }
 
         if (Offset != 0) {
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
new file mode 100644
index 0000000000000..fec5b47198917
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
@@ -0,0 +1,92 @@
+; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s
+; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'
+
+@gA = internal addrspace(3) global [32768 x i8] undef, align 4
+@gB = internal addrspace(3) global [32768 x i8] undef, align 4
+@gC = internal addrspace(3) global [32768 x i8] undef, align 4
+
+; ---- Helpers ----
+
+define internal void @helperA() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gA, i32 0, i32 0
+  store i8 1, ptr addrspace(3) %p
+  ret void
+}
+
+define internal void @helperB() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gB, i32 0, i32 0
+  store i8 2, ptr addrspace(3) %p
+  ret void
+}
+
+define internal void @helperC() inlinehint {
+entry:
+  %p = getelementptr [32768 x i8], ptr addrspace(3) @gC, i32 0, i32 0
+  store i8 3, ptr addrspace(3) %p
+  ret void
+}
+
+; ---------------------------------------------------------------------------
+; Dispatch: takes an index and calls the appropriate helper.
+; If dispatch is NOT inlined, a backend lowering pass that conservatively
+; examines call targets may think all helpers (and thus all globals) are
+; potentially referenced by every kernel that calls dispatch.
+; ---------------------------------------------------------------------------
+
+define void @dispatch(i32 %idx) inlinehint {
+entry:
+  %cmp1 = icmp eq i32 %idx, 1
+  br i1 %cmp1, label %case1, label %check2
+
+check2:
+  %cmp2 = icmp eq i32 %idx, 2
+  br i1 %cmp2, label %case2, label %check3
+
+check3:
+  %cmp3 = icmp eq i32 %idx, 3
+  br i1 %cmp3, label %case3, label %default
+
+case1:
+  call void @helperA()
+  br label %done
+
+case2:
+  call void @helperB()
+  br label %done
+
+case3:
+  call void @helperC()
+  br label %done
+
+default:
+  ; fallthrough: call helperA to have a default behaviour
+  call void @helperA()
+  br label %done
+
+done:
+  ret void
+}
+
+; ---- Kernels ----
+
+define amdgpu_kernel void @k0() {
+entry:
+  call void @dispatch(i32 1)
+  call void @dispatch(i32 2)
+  ret void
+}
+
+define amdgpu_kernel void @k1() {
+entry:
+  call void @dispatch(i32 2)
+  call void @dispatch(i32 1)
+  ret void
+}
+
+define amdgpu_kernel void @k2() {
+entry:
+  call void @helperC()
+  ret void
+}

arsenm · 2025-09-23T05:19:28Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

+    if (TM.getOptLevel() == CodeGenOptLevel::None)
+      LoweringKindLoc = LoweringKind::table;


This should not override the explicit flag. This also seems like a dubious way to avoid going over the limit; we can't rely on other optimizations without -O0 either. It it possible to compute the size usage with the different strategies before committing to one?

arsenm · 2025-09-23T05:19:47Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll

+@gA = internal addrspace(3) global [32768 x i8] undef, align 4
+@gB = internal addrspace(3) global [32768 x i8] undef, align 4
+@gC = internal addrspace(3) global [32768 x i8] undef, align 4


Suggested change

@gA = internal addrspace(3) global [32768 x i8] undef, align 4

@gB = internal addrspace(3) global [32768 x i8] undef, align 4

@gC = internal addrspace(3) global [32768 x i8] undef, align 4

@gA = internal addrspace(3) global [32768 x i8] poison, align 4

@gB = internal addrspace(3) global [32768 x i8] poison, align 4

@gC = internal addrspace(3) global [32768 x i8] poison, align 4

arsenm · 2025-09-23T05:20:06Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll

@@ -0,0 +1,92 @@
+; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s
+; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'


CHECK-NOT should be avoided, this should check the actual output. Not erroring is sufficient

arsenm · 2025-09-23T05:20:23Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll

+
+; ---- Helpers ----
+
+define internal void @helperA() inlinehint {


Remove all the inlinehint, they aren't doing anything

arsenm · 2025-09-23T05:21:35Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

          recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
          Offset += DL.getTypeAllocSize(KernelStruct->getValueType());
+          LLVM_DEBUG(dbgs()
+                     << "amdgpu-lds-size after KernelStruct" << Offset << "\n");


Suggested change

<< "amdgpu-lds-size after KernelStruct" << Offset << "\n");

<< "amdgpu-lds-size after KernelStruct" << Offset << '\n');

arsenm · 2025-09-25T05:04:46Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

        } else if (set_is_subset(K.second, HybridModuleRootKernels)) {
-          ModuleScopeVariables.insert(GV);
+          uint64_t LocalMemLimit = 0;
+          for (Function &F : M) {


Comment what this is doing

arsenm · 2025-09-25T05:09:43Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

+          for (Function &F : M) {
+            if (!F.isDeclaration()) {
+              const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
+              LocalMemLimit = ST.getAddressableLocalMemorySize();


The limit should really come from the entry point kernel, not just the first function you happen to find

arsenm · 2025-09-25T05:10:04Z

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

          KernelAccessVariables.insert(GV);
        } else if (set_is_subset(K.second, HybridModuleRootKernels)) {
-          ModuleScopeVariables.insert(GV);
+          uint64_t LocalMemLimit = 0;


The comment at the top of the file claims the hybrid strategy offers precise allocation, so is there just a bug somewhere?

I dont think there is a bug. Below is the access pattern of the test.

+; This test has the following kernels with following GV access pattern
+; EN32 kernels
+; EN32_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN32_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN32_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN32_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7
+; EN64 kernels
+; EN64_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN64_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN64_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN64_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7

arsenm · 2025-09-25T05:10:44Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table.ll

+  ret void
+}
+
+define i32 @_Z17HlifCompressBatchILi1ERN7hipcomp25cascaded_compress_wrapperIhmLi128ELi4096EEERN18cooperative_groups12thread_blockEEvRK12CompressArgsOT0_OT1_() {


Fix all of these variable names, replace them with more meaningful testcase names

arsenm · 2025-09-25T05:11:29Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table.ll

@@ -0,0 +1,237 @@
+; RUN: llc -mtriple=amdgcn-amd-amdhsa < %s


Need to check the output, preferably of the IR pass

Also should use a real subtarget

arsenm · 2025-09-25T05:51:01Z

Partially reduced: https://godbolt.org/z/aT1sxa765

Title doesn't match anymore. I also think there's just a bug somewhere and there doesn't need to be a strategy change

Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.

JonChesterfield · 2025-09-26T05:49:32Z

The different strategies change how variables are accessed but not where they are allocated. This change may improve compile time. It will definitely regress runtime.

If it changes the the reported amount of lds used by any kernel, there is an error in this pass or elsewhere.

arsenm · 2025-09-26T05:54:30Z

llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table.ll

@@ -0,0 +1,264 @@
+; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s
+; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -passes=amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s
+


This test isn't reduced enough. I previously posted a godbolt link which is smaller than this, and I'm sure it can be shrunk further

I see the godbolt IR. In what/which sense is this believed to change allocated lds size? Based on debug prints added to this pass, on IR metadata, on the binary metadata, other?

Pls note, the godbolt IR does not show the issue. The reproducer in the squashed patch (which Jon has also pasted in his comment does)

JonChesterfield

The change is noisy enough that I'm not sure what it's trying to do. The goal stated in the commit message can't be met by the change stated in the commit message.

What's the test case that this change decreases the amount of allocated lds for?

Tangentially related, was this report prepared using an LLM?

JonChesterfield · 2025-09-26T20:49:14Z

Reproduces (on whatever I had in ~/llvm-install)

~/llvm-install/bin/opt -mtriple=amdgcn -mcpu=gfx942 lower-lds.ll -amdgpu-lower-module-lds -S -o lower-lds.baseline.ll

~/llvm-install/bin/opt -mtriple=amdgcn -mcpu=gfx942 lower-lds.ll -amdgpu-lower-module-lds -amdgpu-lower-module-lds-strategy=table -S -o lower-lds.table.ll

The backend allocates static lds based on the amdgpu-lds-size metadata. These numbers are different for the above two paths but shouldn't be. I note that 'table' lowering only exists at all to isolate part of the pass for testing and is not expected to be used by anyone.

I suspect we're counting it wrong on the table path, as opposed to the default properly allocating 68k and the table one properly allocating between 11k and 26k, but there's definitely something not working as intended here. Thanks for the reproducer

; lower-lds.ll
%RawStorage1 = type { [1056 x i8] }
%RawStorage2 = type { [4 x i8] }
%RawStorage3 = type { [16 x i8] }

@one = addrspace(3) global [1026 x i32] poison
@two = addrspace(3) global [1026 x i32] poison
@three = external addrspace(3) global [2048 x i32]
@four = addrspace(3) global [2050 x i32] poison
@five = addrspace(3) global [16 x i32] poison
@six = external addrspace(3) global %RawStorage1
@seven = addrspace(3) global %RawStorage2 poison
@eight = addrspace(3) global [1026 x i32] poison
@nine = addrspace(3) global [1026 x i32] poison
@ten = external addrspace(3) global [1024 x i32]
@eleven = addrspace(3) global [1026 x i32] poison
@twelve = external addrspace(3) global [16 x i32]
@thirteen = external addrspace(3) global %RawStorage1
@fourteen = external addrspace(3) global [1 x i32]
@fifteen = addrspace(3) global [1026 x i32] poison
@sixteen = addrspace(3) global [1026 x i32] poison
@seventeen = external addrspace(3) global [512 x i32]
@eighteen = addrspace(3) global [514 x i32] poison
@nineteen = external addrspace(3) global [16 x i32]
@twenty = external addrspace(3) global %RawStorage1
@twentyone = external addrspace(3) global [514 x i64]
@twentytwo = external addrspace(3) global [514 x i64]
@twentythree = external addrspace(3) global [256 x i32]
@twentyfour = external addrspace(3) global [258 x i32]
@twentyfive = external addrspace(3) global [16 x i32]
@twentysix = external addrspace(3) global %RawStorage1
@twentyseven = external addrspace(3) global %RawStorage3

define amdgpu_kernel void @EN32_compress_wrapperIhm() {
entry:
  %0 = call i32 @Ihm_one()
  ret void
}

define i32 @Ihm_one() {
entry:
  %0 = call i32 @Ihm_chunk()
  ret i32 %0
}

define i32 @Ihm_chunk() {
entry:
  %0 = call i32 @Ihm_CascadedOpts()
  ret i32 %0
}

define i32 @Ihm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @one to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @two to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @five to ptr), i64 1), i64 0
  call void @Ihm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @three to ptr), ptr addrspacecast (ptr addrspace(3) @four to ptr))
  ret i32 %call69
}

define void @Ihm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @six to ptr), ptr null, align 8
  ret void
}

define i32 @foo(ptr %input, ptr %temp_storage) {
entry:
  call void @Itm_PjPS4()
  ret i32 0
}

define void @Itm_PjPS4() {
entry:
  call void @Itm_PS1_Pj()
  ret void
}

define void @Itm_PS1_Pj() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @seven to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperItm() {
entry:
  %0 = call i32 @Itm_one()
  ret void
}

define i32 @Itm_one() {
entry:
  %0 = call i32 @Itm_chunk()
  ret i32 %0
}

define i32 @Itm_chunk() {
entry:
  %0 = call i32 @Itm_CascadedOpts()
  ret i32 %0
}

define i32 @Itm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @eight to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @nine to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @twelve to ptr), i64 1), i64 0
  call void @Itm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @ten to ptr), ptr addrspacecast (ptr addrspace(3) @eleven to ptr))
  ret i32 %call69
}

define void @Itm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @thirteen to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperIjm() {
entry:
  %arrayidx = getelementptr [1 x i32], ptr addrspacecast (ptr addrspace(3) @fourteen to ptr), i64 0, i64 0
  %0 = call i32 @Ijm_one()
  ret void
}

define i32 @Ijm_one() {
entry:
  %0 = call i32 @Ijm_chunk()
  ret i32 %0
}

define i32 @Ijm_chunk() {
entry:
  %0 = call i32 @Ijm_CascadedOpts()
  ret i32 %0
}

define i32 @Ijm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @fifteen to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @sixteen to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @nineteen to ptr), i64 1), i64 0
  call void @Ijm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @seventeen to ptr), ptr addrspacecast (ptr addrspace(3) @eighteen to ptr))
  ret i32 %call69
}

define void @Ijm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twenty to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN32_compress_wrapperImm() {
entry:
  %0 = call i32 @Imm_one()
  ret void
}

define i32 @Imm_one() {
entry:
  %0 = call i32 @Imm_chunk()
  ret i32 %0
}

define i32 @Imm_chunk() {
entry:
  %0 = call i32 @Imm_CascadedOpts()
  ret i32 %0
}

define i32 @Imm_CascadedOpts() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentyone to ptr), ptr null, align 8
  store ptr addrspacecast (ptr addrspace(3) @twentytwo to ptr), ptr null, align 8
  %add.ptr = getelementptr i32, ptr getelementptr inbounds (i32, ptr addrspacecast (ptr addrspace(3) @twentyfive to ptr), i64 1), i64 0
  br i1 false, label %for.body65, label %for.end102

for.body65:
  call void @Imm_PS1_PT1_PS4_S7()
  %call69 = call i32 @foo(ptr addrspacecast (ptr addrspace(3) @twentythree to ptr), ptr addrspacecast (ptr addrspace(3) @twentyfour to ptr))
  ret i32 %call69

for.end102:
  %call106 = call i32 @Imm_PjPKjPS5_S6_b()
  ret i32 0
}

define void @Imm_PS1_PT1_PS4_S7() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentysix to ptr), ptr null, align 8
  ret void
}

define i32 @Imm_PjPKjPS5_S6_b() {
entry:
  call void @Imm_PjPS4()
  ret i32 0
}

define void @Imm_PjPS4() {
entry:
  call void @Imm_PS1_Pj()
  ret void
}

define void @Imm_PS1_Pj() {
entry:
  store ptr addrspacecast (ptr addrspace(3) @twentyseven to ptr), ptr null, align 8
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperIhm() {
entry:
  %0 = call i32 @Ihm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperItm() {
entry:
  %0 = call i32 @Itm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperIjm() {
entry:
  %0 = call i32 @Ijm_one()
  ret void
}

define amdgpu_kernel void @EN64_compress_wrapperImm() {
entry:
  %0 = call i32 @Imm_one()
  ret void
}

JonChesterfield · 2025-09-30T20:35:04Z

There is an underlying error here. We are too eager to promote variables to the structure that is allocated at address zero and that can lead to allocating variables in kernels that should not do so.

There's an easy correctness fix, I'm still considering whether there's a reasonable way to get a better result. Forcing table mode for everything would get you correct behaviour I believe, but you should get performance regressions from it relative to fixing the underlying error

JonChesterfield · 2025-09-30T20:41:23Z

Github won't let me comment on the right place in the diff.

set_is_subset(K.second, HybridModuleRootKernels)

That^ is too optimistic, needs to be set equality, not subset. That'll reduce how often the module path is taken which will make some kernels slower, so the fix probably needs to be change that to equality and then a second patch work harder to retrieve the anticipated performance loss.

Essentially currently we have a path that chooses faster instruction execution over minimising allocation. That's not a deliberate design choice, more an oversight from the original implementation that went unnoticed. Thank you for picking up on it!

JonChesterfield · 2025-10-01T00:23:29Z

Alternative fix implemented at #161464

hjagasiaAMD · 2025-10-02T01:05:23Z

closing this PR. Follow alternate fix at #161464

llvmbot added the backend:AMDGPU label Sep 22, 2025

ronlieb requested review from arsenm, changpeng, dpalermo and bcahoon September 22, 2025 19:31

Merge branch 'main' into hjagasiaAMD

0544d57

arsenm reviewed Sep 23, 2025

View reviewed changes

[AMDGPU] Use table strategy for LowerModuleLDSPass if LDS limit exceeded

a31054a

arsenm requested a review from JonChesterfield September 25, 2025 05:07

arsenm reviewed Sep 25, 2025

View reviewed changes

arsenm reviewed Sep 26, 2025

View reviewed changes

JonChesterfield requested changes Sep 26, 2025

View reviewed changes

hjagasiaAMD closed this Oct 2, 2025

		if (TM.getOptLevel() == CodeGenOptLevel::None)
		LoweringKindLoc = LoweringKind::table;

		@@ -0,0 +1,92 @@
		; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 \| FileCheck --check-prefix=CHECK %s
		; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'


		; ---- Helpers ----

		define internal void @helperA() inlinehint {

	<< "amdgpu-lds-size after KernelStruct" << Offset << "\n");
	<< "amdgpu-lds-size after KernelStruct" << Offset << '\n');

		@@ -0,0 +1,264 @@
		; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -amdgpu-lower-module-lds < %s 2>&1 \| FileCheck %s
		; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -passes=amdgpu-lower-module-lds < %s 2>&1 \| FileCheck %s

[AMDGPU] Use table strategy for LowerModuleLDSPass at O0 #160181

[AMDGPU] Use table strategy for LowerModuleLDSPass at O0 #160181

Conversation

hjagasiaAMD commented Sep 22, 2025

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

llvmbot commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm commented Sep 25, 2025

Uh oh!

JonChesterfield commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JonChesterfield Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JonChesterfield left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JonChesterfield commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonChesterfield commented Sep 30, 2025

Uh oh!

JonChesterfield commented Sep 30, 2025

Uh oh!

JonChesterfield commented Oct 1, 2025

Uh oh!

hjagasiaAMD commented Oct 2, 2025

Uh oh!

Uh oh!

JonChesterfield Sep 26, 2025 •

edited

Loading

JonChesterfield left a comment •

edited

Loading

JonChesterfield commented Sep 26, 2025 •

edited

Loading