Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607

Closed
wants to merge 2 commits into from

Conversation

petercad
Copy link
Contributor

@petercad petercad commented Feb 5, 2025

Addresses MFDNN-12523 -- lower than expected performance of memory-bound TN compressed weights kernels on MTL/ARL when TLB is cold. Note that cold TLB can't be benchmarked directly in benchdnn -- even with --cold-cache=all the TLB will be warm. However, the driver knob DirectSubmissionNewResourceTlbFlush=1 can be used to flush TLB prior to each kernel launch.

This PR introduces tile "scrambling" -- reordering tiles in the m and/or n dimensions so that adjacent workgroups are not too close. Specifically, we want to avoid cases where two different workgroups are simultaneously accessing chunks of the weights matrix that belong to the same 64KB page. If this page is not found in the STLB, then both workgroups will sit idle waiting for the page walk to complete. If we must pay the penalty of a page walk to fill the STLB, we would be better off having each workgroup on a different page -- this will fill the STLB faster and reduce the total amount of "dead time" waiting for duplicate page walks.

In more detail, the scrambling algorithm replaces the original group ID (in m or n dimensions) with (ID * stride) mod #groups, where the stride is chosen to be coprime to #groups (so we don't miss any groups) as well as large enough that adjacent groups don't start on the same 64KB page. Currently, the scrambling is only applied to XeHPG/XeLPG compressed weights TN GEMV kernels, though it may be useful in other cases.

@petercad petercad requested a review from a team as a code owner February 5, 2025 23:36
@github-actions github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Feb 5, 2025
@petercad
Copy link
Contributor Author

petercad commented Feb 5, 2025

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable benchdnn_all
enable benchdnn_matmul

@petercad
Copy link
Contributor Author

petercad commented Feb 5, 2025

make test perf-gpu
set primitive=matmul
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable arch_gpu_xe2-lpg
disable arch_gpu_xe3-lpg

@petercad
Copy link
Contributor Author

petercad commented Feb 6, 2025

make test
disable test_device_cpu
disable build_cpu_runtime_omp
disable build_cpu_runtime_sycl
disable build_cpu_runtime_tbb
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable benchdnn_all
enable benchdnn_matmul

@petercad
Copy link
Contributor Author

petercad commented Feb 6, 2025

make test perf-gpu
set primitive=matmul
disable arch_gpu_xe-hpc
disable arch_gpu_xe-lp
disable arch_gpu_xe2-hpg-bmg
disable arch_gpu_xe2-lpg
disable arch_gpu_xe3-lpg

@petercad
Copy link
Contributor Author

petercad commented Feb 7, 2025

Closed in favor of #2631.

@petercad petercad closed this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant