PR #33671: fix(Triton/ROCm): Add missing createTritonGPUAllocateWarpGroups pass to pipeline

hugomano · Google-ML-Automation · commit 450be227dd97 · 2025-11-07T09:48:15.000-08:00
Imported from GitHub PR #33671 This PR fixes the Triton compilation pipeline for ROCm by adding the `createTritonGPUAllocateWarpGroups` pass, which was missing. This pass is necessary for the `ExtractThreadDims` function to work correctly during code generation. It adds the `ttg.total-num-warps` attribute to the MLIR module, which is later consumed in `emitter_helpers.cc`. Without this pass, the compilation fails when trying to extract thread dimensions. c/ @khasanovaa @chsigg @AleksaArsic Copybara import of the project: -- 3f9c437 by Hugo Mano <hugo@zml.ai>: fix(Triton/ROCm): Add missing createTritonGPUAllocateWarpGroups pass to pipeline -- 4ec8907 by Hugo Mano <hugo@zml.ai>: format Merging this change closes #33671 COPYBARA_INTEGRATE_REVIEW=#33671 from hugomano:hugomano/fix-rocm-triton-compilation-pipeline 4ec8907 PiperOrigin-RevId: 829473885
diff --git a/xla/backends/gpu/codegen/triton/compilation_pipeline_rocm.cc b/xla/backends/gpu/codegen/triton/compilation_pipeline_rocm.cc
@@ -123,6 +123,10 @@ static void MakeLLIR(mlir::OpPassManager* pm,
                      const stream_executor::RocmComputeCapability& rocm_cc,
                      int num_stages) {
   const int custom_lds_size = 0;
+  // The `createTritonGPUAllocateWarpGroups` pass is not implemented in the
+  // upstream Triton, but is necessary for `ExtractThreadDims` in emitter
+  // helpers. It adds the `ttg.total-num-warps` attribute.
+  pm->addPass(mt::gpu::createTritonGPUAllocateWarpGroups());
   pm->addPass(mlir::triton::AMD::createOptimizeLDSUsagePass(
       rocm_cc.gfx_version(), custom_lds_size));
   pm->addPass(mlir::createSCFToControlFlowPass());