xe: ocl: sdpa: pull grouped scales out of gemm where possible #2409

petercad · 2025-01-15T00:22:06Z

Hoists the K scaling out of the GEMM microkernel where possible. Here, "where possible" means no zp and group size >= head size (so effectively per-token scaling).

umar456 · 2025-01-15T17:43:50Z

src/gpu/intel/ocl/micro_sdpa.cl

+#if VAL_SCALES == QUANTIZE_1D
+DECLARE_2D_TILE(v_scales_tile_type, KEY_ATTR_SCALES_DATA_T, SUBGROUP_SIZE,
+        ugemm_kq_sg_tile_m, 1, 1, 1)
+DECLARE_2D_TILE(v_scales_tile_type_float, float, SUBGROUP_SIZE,
+        ugemm_kq_sg_tile_m, 1, 1, 1)
+DECLARE_2D_TILE_BLOCK_OPS(v_scales_tile_type, KEY_ATTR_SCALES_DATA_T,
+        SUBGROUP_SIZE, ugemm_kq_sg_tile_m, 1, 1, 1)
+
+DECLARE_2D_TILE_HREDUCE(s_tile_type, SUBGROUP_SIZE, ugemm_kq_c_type_block0,
+        ugemm_kq_c_type_block1, ugemm_kq_c_type_nblock0,
+        ugemm_kq_c_type_nblock1, v_scales_tile_type_float, SUBGROUP_SIZE,
+        ugemm_kq_sg_tile_m, 1, 1, 1)


Suggested change

#if VAL_SCALES == QUANTIZE_1D

DECLARE_2D_TILE(v_scales_tile_type, KEY_ATTR_SCALES_DATA_T, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE(v_scales_tile_type_float, float, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE_BLOCK_OPS(v_scales_tile_type, KEY_ATTR_SCALES_DATA_T,

SUBGROUP_SIZE, ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE_HREDUCE(s_tile_type, SUBGROUP_SIZE, ugemm_kq_c_type_block0,

ugemm_kq_c_type_block1, ugemm_kq_c_type_nblock0,

ugemm_kq_c_type_nblock1, v_scales_tile_type_float, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

#if VAL_SCALES == QUANTIZE_1D

DECLARE_2D_TILE(v_scales_tile_type, VAL_ATTR_SCALES_DATA_T, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE(v_scales_tile_type_float, float, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE_BLOCK_OPS(v_scales_tile_type, VAL_ATTR_SCALES_DATA_T,

SUBGROUP_SIZE, ugemm_kq_sg_tile_m, 1, 1, 1)

DECLARE_2D_TILE_HREDUCE(s_tile_type, SUBGROUP_SIZE, ugemm_kq_c_type_block0,

ugemm_kq_c_type_block1, ugemm_kq_c_type_nblock0,

ugemm_kq_c_type_nblock1, v_scales_tile_type_float, SUBGROUP_SIZE,

ugemm_kq_sg_tile_m, 1, 1, 1)

Good catch, thanks.

dzarukin · 2025-01-16T17:59:12Z

src/gpu/intel/ocl/micro_sdpa.cl

@@ -21,11 +21,14 @@
 #include "gemm_kq.h"
 #include "gemm_vs.h"

-/* The quantization parameter may be unique for each token/element */


Random spot: do we have test cases covering new per-token scaling optimized path and the old path?

xe: ocl: sdpa: pull grouped scales out of gemm where possible

be7181a

petercad requested a review from a team as a code owner January 15, 2025 00:22

github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Jan 15, 2025

umar456 reviewed Jan 15, 2025

View reviewed changes

umar456 approved these changes Jan 15, 2025

View reviewed changes

dzarukin reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xe: ocl: sdpa: pull grouped scales out of gemm where possible #2409

xe: ocl: sdpa: pull grouped scales out of gemm where possible #2409

petercad commented Jan 15, 2025

umar456 Jan 15, 2025

petercad Jan 15, 2025

dzarukin Jan 16, 2025

xe: ocl: sdpa: pull grouped scales out of gemm where possible #2409

Are you sure you want to change the base?

xe: ocl: sdpa: pull grouped scales out of gemm where possible #2409

Conversation

petercad commented Jan 15, 2025

umar456 Jan 15, 2025

Choose a reason for hiding this comment

petercad Jan 15, 2025

Choose a reason for hiding this comment

dzarukin Jan 16, 2025

Choose a reason for hiding this comment