xe: jit: gemm: tile scrambling for improved cold-TLB performance #2607
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses MFDNN-12523 -- lower than expected performance of memory-bound TN compressed weights kernels on MTL/ARL when TLB is cold. Note that cold TLB can't be benchmarked directly in benchdnn -- even with
--cold-cache=all
the TLB will be warm. However, the driver knobDirectSubmissionNewResourceTlbFlush=1
can be used to flush TLB prior to each kernel launch.This PR introduces tile "scrambling" -- reordering tiles in the m and/or n dimensions so that adjacent workgroups are not too close. Specifically, we want to avoid cases where two different workgroups are simultaneously accessing chunks of the weights matrix that belong to the same 64KB page. If this page is not found in the STLB, then both workgroups will sit idle waiting for the page walk to complete. If we must pay the penalty of a page walk to fill the STLB, we would be better off having each workgroup on a different page -- this will fill the STLB faster and reduce the total amount of "dead time" waiting for duplicate page walks.
In more detail, the scrambling algorithm replaces the original group ID (in m or n dimensions) with
(ID * stride) mod #groups
, where the stride is chosen to be coprime to #groups (so we don't miss any groups) as well as large enough that adjacent groups don't start on the same 64KB page. Currently, the scrambling is only applied to XeHPG/XeLPG compressed weights TN GEMV kernels, though it may be useful in other cases.