-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[AMDGPU] Use table strategy for LowerModuleLDSPass at O0 #160181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-amdgpu Author: None (hjagasiaAMD) ChangesEnsure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy. Full diff: https://github.com/llvm/llvm-project/pull/160181.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index f01d5f6726822..dae2bd53b6623 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -588,7 +588,7 @@ class AMDGPULowerModuleLDS {
return OrderedKernels;
}
- static void partitionVariablesIntoIndirectStrategies(
+ void partitionVariablesIntoIndirectStrategies(
Module &M, LDSUsesInfoTy const &LDSUsesInfo,
VariableFunctionMap &LDSToKernelsThatNeedToAccessItIndirectly,
DenseSet<GlobalVariable *> &ModuleScopeVariables,
@@ -596,6 +596,9 @@ class AMDGPULowerModuleLDS {
DenseSet<GlobalVariable *> &KernelAccessVariables,
DenseSet<GlobalVariable *> &DynamicVariables) {
+ if (TM.getOptLevel() == CodeGenOptLevel::None)
+ LoweringKindLoc = LoweringKind::table;
+
GlobalVariable *HybridModuleRoot =
LoweringKindLoc != LoweringKind::hybrid
? nullptr
@@ -1188,6 +1191,8 @@ class AMDGPULowerModuleLDS {
// Allocated at zero, recorded once on construction, not once per
// kernel
Offset += DL.getTypeAllocSize(MaybeModuleScopeStruct->getValueType());
+ LLVM_DEBUG(dbgs() << "amdgpu-lds-size after ModuleScopeStruct"
+ << Offset << "\n");
}
if (AllocateKernelScopeStruct) {
@@ -1195,6 +1200,8 @@ class AMDGPULowerModuleLDS {
Offset = alignTo(Offset, AMDGPU::getAlign(DL, KernelStruct));
recordLDSAbsoluteAddress(&M, KernelStruct, Offset);
Offset += DL.getTypeAllocSize(KernelStruct->getValueType());
+ LLVM_DEBUG(dbgs()
+ << "amdgpu-lds-size after KernelStruct" << Offset << "\n");
}
// If there is dynamic allocation, the alignment needed is included in
@@ -1205,6 +1212,8 @@ class AMDGPULowerModuleLDS {
GlobalVariable *DynamicVariable = KernelToCreatedDynamicLDS[&Func];
Offset = alignTo(Offset, AMDGPU::getAlign(DL, DynamicVariable));
recordLDSAbsoluteAddress(&M, DynamicVariable, Offset);
+ LLVM_DEBUG(dbgs() << "amdgpu-lds-size after DynamicVariable" << Offset
+ << "\n");
}
if (Offset != 0) {
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
new file mode 100644
index 0000000000000..fec5b47198917
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-force-table-O0.ll
@@ -0,0 +1,92 @@
+; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s
+; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2'
+
+@gA = internal addrspace(3) global [32768 x i8] undef, align 4
+@gB = internal addrspace(3) global [32768 x i8] undef, align 4
+@gC = internal addrspace(3) global [32768 x i8] undef, align 4
+
+; ---- Helpers ----
+
+define internal void @helperA() inlinehint {
+entry:
+ %p = getelementptr [32768 x i8], ptr addrspace(3) @gA, i32 0, i32 0
+ store i8 1, ptr addrspace(3) %p
+ ret void
+}
+
+define internal void @helperB() inlinehint {
+entry:
+ %p = getelementptr [32768 x i8], ptr addrspace(3) @gB, i32 0, i32 0
+ store i8 2, ptr addrspace(3) %p
+ ret void
+}
+
+define internal void @helperC() inlinehint {
+entry:
+ %p = getelementptr [32768 x i8], ptr addrspace(3) @gC, i32 0, i32 0
+ store i8 3, ptr addrspace(3) %p
+ ret void
+}
+
+; ---------------------------------------------------------------------------
+; Dispatch: takes an index and calls the appropriate helper.
+; If dispatch is NOT inlined, a backend lowering pass that conservatively
+; examines call targets may think all helpers (and thus all globals) are
+; potentially referenced by every kernel that calls dispatch.
+; ---------------------------------------------------------------------------
+
+define void @dispatch(i32 %idx) inlinehint {
+entry:
+ %cmp1 = icmp eq i32 %idx, 1
+ br i1 %cmp1, label %case1, label %check2
+
+check2:
+ %cmp2 = icmp eq i32 %idx, 2
+ br i1 %cmp2, label %case2, label %check3
+
+check3:
+ %cmp3 = icmp eq i32 %idx, 3
+ br i1 %cmp3, label %case3, label %default
+
+case1:
+ call void @helperA()
+ br label %done
+
+case2:
+ call void @helperB()
+ br label %done
+
+case3:
+ call void @helperC()
+ br label %done
+
+default:
+ ; fallthrough: call helperA to have a default behaviour
+ call void @helperA()
+ br label %done
+
+done:
+ ret void
+}
+
+; ---- Kernels ----
+
+define amdgpu_kernel void @k0() {
+entry:
+ call void @dispatch(i32 1)
+ call void @dispatch(i32 2)
+ ret void
+}
+
+define amdgpu_kernel void @k1() {
+entry:
+ call void @dispatch(i32 2)
+ call void @dispatch(i32 1)
+ ret void
+}
+
+define amdgpu_kernel void @k2() {
+entry:
+ call void @helperC()
+ ret void
+}
|
if (TM.getOptLevel() == CodeGenOptLevel::None) | ||
LoweringKindLoc = LoweringKind::table; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not override the explicit flag. This also seems like a dubious way to avoid going over the limit; we can't rely on other optimizations without -O0 either. It it possible to compute the size usage with the different strategies before committing to one?
@gA = internal addrspace(3) global [32768 x i8] undef, align 4 | ||
@gB = internal addrspace(3) global [32768 x i8] undef, align 4 | ||
@gC = internal addrspace(3) global [32768 x i8] undef, align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gA = internal addrspace(3) global [32768 x i8] undef, align 4 | |
@gB = internal addrspace(3) global [32768 x i8] undef, align 4 | |
@gC = internal addrspace(3) global [32768 x i8] undef, align 4 | |
@gA = internal addrspace(3) global [32768 x i8] poison, align 4 | |
@gB = internal addrspace(3) global [32768 x i8] poison, align 4 | |
@gC = internal addrspace(3) global [32768 x i8] poison, align 4 |
@@ -0,0 +1,92 @@ | |||
; RUN: not llc -O0 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck --check-prefix=CHECK %s | |||
; CHECK-NOT: error: <unknown>:0:0: local memory (98304) exceeds limit (65536) in function 'k2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK-NOT should be avoided, this should check the actual output. Not erroring is sufficient
|
||
; ---- Helpers ---- | ||
|
||
define internal void @helperA() inlinehint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all the inlinehint, they aren't doing anything
recordLDSAbsoluteAddress(&M, KernelStruct, Offset); | ||
Offset += DL.getTypeAllocSize(KernelStruct->getValueType()); | ||
LLVM_DEBUG(dbgs() | ||
<< "amdgpu-lds-size after KernelStruct" << Offset << "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "amdgpu-lds-size after KernelStruct" << Offset << "\n"); | |
<< "amdgpu-lds-size after KernelStruct" << Offset << '\n'); |
} else if (set_is_subset(K.second, HybridModuleRootKernels)) { | ||
ModuleScopeVariables.insert(GV); | ||
uint64_t LocalMemLimit = 0; | ||
for (Function &F : M) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what this is doing
for (Function &F : M) { | ||
if (!F.isDeclaration()) { | ||
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F); | ||
LocalMemLimit = ST.getAddressableLocalMemorySize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit should really come from the entry point kernel, not just the first function you happen to find
KernelAccessVariables.insert(GV); | ||
} else if (set_is_subset(K.second, HybridModuleRootKernels)) { | ||
ModuleScopeVariables.insert(GV); | ||
uint64_t LocalMemLimit = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at the top of the file claims the hybrid strategy offers precise allocation, so is there just a bug somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think there is a bug. Below is the access pattern of the test.
+; This test has the following kernels with following GV access pattern
+; EN32 kernels
+; EN32_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN32_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN32_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN32_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7
+; EN64 kernels
+; EN64_compress_wrapperIhm - GV's 1, 2, 3, 4, 5, 6, 7
+; EN64_compress_wrapperItm - GV's 8, 9, 10, 11, 12, 13, 7
+; EN64_compress_wrapperIjm - GV's 15, 16, 17, 18, 19, 20, 7
+; EN64_compress_wrapperImm - GV's 21, 22, 23, 24, 25, 26, 27, 7
ret void | ||
} | ||
|
||
define i32 @_Z17HlifCompressBatchILi1ERN7hipcomp25cascaded_compress_wrapperIhmLi128ELi4096EEERN18cooperative_groups12thread_blockEEvRK12CompressArgsOT0_OT1_() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix all of these variable names, replace them with more meaningful testcase names
@@ -0,0 +1,237 @@ | |||
; RUN: llc -mtriple=amdgcn-amd-amdhsa < %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check the output, preferably of the IR pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should use a real subtarget
Partially reduced: https://godbolt.org/z/aT1sxa765 Title doesn't match anymore. I also think there's just a bug somewhere and there doesn't need to be a strategy change |
Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.
The different strategies change how variables are accessed but not where they are allocated. This change may improve compile time. It will definitely regress runtime. If it changes the the reported amount of lds used by any kernel, there is an error in this pass or elsewhere. |
@@ -0,0 +1,264 @@ | |||
; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s | |||
; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx942 -passes=amdgpu-lower-module-lds < %s 2>&1 | FileCheck %s | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't reduced enough. I previously posted a godbolt link which is smaller than this, and I'm sure it can be shrunk further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the godbolt IR. In what/which sense is this believed to change allocated lds size? Based on debug prints added to this pass, on IR metadata, on the binary metadata, other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls note, the godbolt IR does not show the issue. The reproducer in the squashed patch (which Jon has also pasted in his comment does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is noisy enough that I'm not sure what it's trying to do. The goal stated in the commit message can't be met by the change stated in the commit message.
What's the test case that this change decreases the amount of allocated lds for?
Tangentially related, was this report prepared using an LLM?
Reproduces (on whatever I had in ~/llvm-install)
The backend allocates static lds based on the amdgpu-lds-size metadata. These numbers are different for the above two paths but shouldn't be. I note that 'table' lowering only exists at all to isolate part of the pass for testing and is not expected to be used by anyone. I suspect we're counting it wrong on the table path, as opposed to the default properly allocating 68k and the table one properly allocating between 11k and 26k, but there's definitely something not working as intended here. Thanks for the reproducer
|
There is an underlying error here. We are too eager to promote variables to the structure that is allocated at address zero and that can lead to allocating variables in kernels that should not do so. There's an easy correctness fix, I'm still considering whether there's a reasonable way to get a better result. Forcing table mode for everything would get you correct behaviour I believe, but you should get performance regressions from it relative to fixing the underlying error |
Github won't let me comment on the right place in the diff. set_is_subset(K.second, HybridModuleRootKernels) That^ is too optimistic, needs to be set equality, not subset. That'll reduce how often the module path is taken which will make some kernels slower, so the fix probably needs to be change that to equality and then a second patch work harder to retrieve the anticipated performance loss. Essentially currently we have a path that chooses faster instruction execution over minimising allocation. That's not a deliberate design choice, more an oversight from the original implementation that went unnoticed. Thank you for picking up on it! |
Alternative fix implemented at #161464 |
closing this PR. Follow alternate fix at #161464 |
Ensure global variables accessed by only one kernel can stay in kernel scope at O0 by switching to table strategy for AMDGPULowerModuleLDSPass. This to prevent LDS limit from being exceeded for the kernel. At higher Opt levels, additional passes run can acheive this without switching to table strategy.