-
Notifications
You must be signed in to change notification settings - Fork 11.5k
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel #13053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Dan Johansson <[email protected]>
m_to_process = m - m_start; | ||
} | ||
|
||
if(m_start < m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(m_start < m) { | |
if (m_start < m) { |
variant_call<void>(kernel->run_kernel, m_to_process, n_to_process, k, lhs_ptr, rhs_ptr, dst_ptr, dst_stride, sizeof(float), -FLT_MAX, FLT_MAX); | ||
} | ||
|
||
ggml_barrier(params->threadpool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be necessary to add a barrier at the end of an operation, there is always a barrier between every op.
}, | ||
/* .required_cpu = */ CPU_FEATURE_SME, | ||
/* .lhs_type = */ GGML_TYPE_F32, | ||
/* .rhs_type = */ GGML_TYPE_F16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_TYPE_F16
is not BF16, it is standard float16. For BF16 you should use GGML_TYPE_BF16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defining the input types for this SME operation. In this case the RHS input type is GGML_TYPE_F16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, wouldn't it be more straightforward to use BF16 for the KV cache type and avoid the conversion from F16 -> BF16 for the RHS?
|
||
GGML_TENSOR_BINARY_OP_LOCALS | ||
bool compute_forward_kv_cache(ggml_compute_params * params, struct ggml_tensor * dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably would want to call these functions something more generic, like compute_forward_f16
and add comments that they are typically used during KV cache ops. Is there something that makes this work only for the KV cache matrix multiplications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the function naming and comment.
This function will be called for any MUL_MAT for which the RHS is GGML_TYPE_F16
- and this is the default case for the KV-cache matrix multiplications.
Edit: Correction - This currently only handles KV-cache ops since others are filtered out in
ggml::cpu::tensor_traits * get_tensor_traits(const struct ggml_tensor * op) override {
...
else if (ggml_kleidiai_select_kernels(ctx.features, op) &&
op->src[0]->op == GGML_OP_VIEW &&
(op->src[1]->op == GGML_OP_PERMUTE || op->src[1]->op == GGML_OP_SOFT_MAX) &&
op->src[1]->ne[1] > 1) {
So this function will not handle generic FP32 x FP16 ops. It will only handle KV-cache FP32 x FP16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding BF16 and KV-cache: We have currently only enabled the GEMM part of the KleidiAI kernel (m > 1) and let the default CPU implementation handle GEMV. As long as we don't accelerate GEMV I suspect switching to BF16 will harm performance. I'm currently out of office but will look into this when I get back end of next week!
const size_t nth = params->nth; | ||
const size_t ith = params->ith; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : for consistency
const size_t nth = params->nth; | |
const size_t ith = params->ith; | |
const int nth = params->nth; | |
const int ith = params->ith; |
Generally when counting items, enumerating elements, dimensions, etc. we tend to use int
or int64_t
. size_t
is used only for computing memory sizes, strides and offsets.
It's OK as it is and no change is needed, but for consistency with the rest of the ggml
code, you can consider following this convention.
variant_call<void>(kernels->rhs_info.pack_func, 1, n, k, nr, kr, sr, n * sizeof(float), | ||
rhs_kxn, bias, nullptr, rhs_packed, 0, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure that I'm following the logic:
- First we take the F16 RHS data and cast+transpose it to F32
- Next the F32 data is converted back to BF16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the RHS packing expects knx
F32 input so cast+transpose.
Following this, the data is then packed in BF16 for use in the SME kernel.
The SME kernel evaluates BF16 * BF16 and returns F32.
This commit adds a fp32 = bf16 X bf16 GEMM SME kernel to the KleidiAI integration and enables the kernel for use by KV-cache matmul operations.
The integration currently requires that LHS is F32 and RHS is F16. Both LHS and RHS are converted into the format expected by the KleidiAI kernel (bf16) at runtime before the kernel is executed.