xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

petercad · 2025-02-05T23:36:28Z

Addresses MFDNN-12523 -- lower than expected performance of memory-bound TN compressed weights kernels on MTL/ARL when TLB is cold. Note that cold TLB can't be benchmarked directly in benchdnn -- even with --cold-cache=all the TLB will be warm. However, the driver knob DirectSubmissionNewResourceTlbFlush=1 can be used to flush TLB prior to each kernel launch.

This PR introduces tile "scrambling" -- reordering tiles in the m and/or n dimensions so that adjacent workgroups are not too close. Specifically, we want to avoid cases where two different workgroups are simultaneously accessing chunks of the weights matrix that belong to the same 64KB page. If this page is not found in the STLB, then both workgroups will sit idle waiting for the page walk to complete. If we must pay the penalty of a page walk to fill the STLB, we would be better off having each workgroup on a different page -- this will fill the STLB faster and reduce the total amount of "dead time" waiting for duplicate page walks.

In more detail, the scrambling algorithm replaces the original group ID (in m or n dimensions) with (ID * stride) mod #groups, where the stride is chosen to be coprime to #groups (so we don't miss any groups) as well as large enough that adjacent groups don't start on the same 64KB page. Currently, the scrambling is only applied to XeHPG/XeLPG compressed weights TN GEMV kernels, though it may be useful in other cases.

petercad · 2025-02-05T23:42:43Z

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable benchdnn_all
enable benchdnn_matmul

petercad · 2025-02-05T23:43:33Z

make test perf-gpu
set primitive=matmul
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable arch_gpu_xe2-lpg
disable arch_gpu_xe3-lpg

petercad · 2025-02-06T00:52:52Z

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable benchdnn_all
enable benchdnn_matmul

petercad · 2025-02-06T00:53:14Z

make test perf-gpu
set primitive=matmul
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable arch_gpu_xe2-lpg
disable arch_gpu_xe3-lpg

petercad · 2025-02-07T22:21:17Z

Closed in favor of #2631.

petercad requested a review from a team as a code owner February 5, 2025 23:36

github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 5, 2025

petercad mentioned this pull request Feb 5, 2025

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2608

Closed

petercad added 2 commits February 5, 2025 16:51

xe: jit: gemm: support tile scrambling

bbfe388

xehpg: jit: gemm: use scrambling for TN compressed wei gemv kernels

9455179

petercad force-pushed the petercad/scramble branch from 9d556d4 to 9455179 Compare February 6, 2025 00:52

petercad mentioned this pull request Feb 7, 2025

xe: jit: gemm: TLB warmup #2631

Open

petercad closed this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

petercad commented Feb 5, 2025 •

edited

Loading

petercad commented Feb 5, 2025

petercad commented Feb 5, 2025

petercad commented Feb 6, 2025

petercad commented Feb 6, 2025

petercad commented Feb 7, 2025

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

Conversation

petercad commented Feb 5, 2025 • edited Loading

petercad commented Feb 5, 2025

petercad commented Feb 5, 2025

petercad commented Feb 6, 2025

petercad commented Feb 6, 2025

petercad commented Feb 7, 2025

petercad commented Feb 5, 2025 •

edited

Loading