Skip to content

Conversation

@GushanFall
Copy link

完成 T1-3-1 赛题所有算子的实现代码、Pytorch单元测试代码以及GGUF测试代码。
算子设计文档详见设计文档PR:InfiniTensor/InfiniCore-Documentation#48

测试情况

NVIDIA 平台

FlashAttention

FlashAttention-nvidia

实现的 FlashAttention-2,并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

FlashAttentionBackward-nvidia

只实现了基础计算,且有未能解决的问题。

METAX 平台

FlashAttention

FlashAttention-metax

实现的 FlashAttention-2,并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

FlashAttentionBackward-metax

只实现了基础计算,且有未能解决的问题。

ILUVATAR 平台

FlashAttention

FlashAttention-iluvatar

实现的 FlashAttention-2,并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

FlashAttentionBackward-iluvatar

只实现了基础计算,且有未能解决的问题。

@GushanFall
Copy link
Author

GGUF 测试结果

FlashAttention

Nvidia

f32
acdb9c716bc62e113b828ef550c46c3a

f16
4e320a27a00f676cd8fc0a3288e5d816

bf16
f39db3368357c9a1e6bb051d0cfa05d3

Metax

f32
image

f16
image

bf16
image

Iluvatar

f32
image

f16
image

bf16
image

FlashAttentionBackward

Nvidia

f32
2570e3bbe8c0eced22ffd2e06f9fbceb

f16
547adb45657c224b10cb7aa48e776cb5

bf16
66c51628112b045bf7e31da4cbd2a10c

Metax

f32
image

f16
image

bf16
image

Iluvatar

f32
image

f16
image

bf16
image

Comment on lines +153 to +154
hcMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
hcMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), hcMemcpyHostToDevice);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 calculate 里面为什么还需要 malloc 空间呢?理论上来说 calculate 阶段应该都是计算了,空间分配在 workspace 的时候应该就分配好了

Comment on lines +149 to +150
cudaMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
cudaMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), cudaMemcpyHostToDevice);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(同理)calculate 阶段 malloc 空间

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我原本是写的 mask_input = _info.mask; ,但是测试的时候会报错
image
这种类似的报错在反向算子的实现中也有,我不清楚具体是哪里的问题

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那检查一下写入和读取时的偏移是不是都无误吧

Comment on lines +276 to +325
hcMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
hcMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), hcMemcpyHostToDevice);
mask_input = mask_temp;
} else {
mask_input = mask;
}
}

size_t T_r = ceil(float(seq_len_q) / B_r);
size_t T_c = ceil(float(seq_len_kv) / B_c);

auto hc_stream = reinterpret_cast<hcStream_t>(stream);

void *out, *l;
if (_info.dtype == INFINI_DTYPE_F16) {
hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(half));
hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(half));
} else if (_info.dtype == INFINI_DTYPE_F32) {
hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(float));
hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(float));
} else if (_info.dtype == INFINI_DTYPE_BF16) {
hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(__hpcc_bfloat16));
hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(__hpcc_bfloat16));
} else {
return INFINI_STATUS_BAD_TENSOR_DTYPE;
}

CHECK_STATUS(launchForwardKernel(
out, l, q, k, v, mask_input,
batch_size,
nums_head_q, nums_head_kv,
seq_len_q, seq_len_kv,
head_dim, group,
B_r, B_c, T_r, T_c,
_info.qo_stride_b, _info.qo_stride_s, _info.qo_stride_n,
_info.kv_stride_b, _info.kv_stride_s, _info.kv_stride_n,
_info.l_stride_b, _info.l_stride_s, _info.l_stride_n,
_info.dtype,
hc_stream));

void *grad_k_expanded, *grad_v_expanded;
if (_info.dtype == INFINI_DTYPE_F16) {
hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(half));
hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(half));
} else if (_info.dtype == INFINI_DTYPE_F32) {
hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(float));
hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(float));
} else if (_info.dtype == INFINI_DTYPE_BF16) {
hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(__hpcc_bfloat16));
hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(__hpcc_bfloat16));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(同理)calculate 阶段 malloc 空间

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这边是很粗糙的实现方法,grad_k 和 grad_v 是在每块计算中累加进行迭代的,原本我想在 kernel 中就算出最后的值,但是会因为不同线程间相互覆盖导致算不出正确结果,没想出很好的解决办法,所以只好把每个都记录下来,单独用一个核函数来做累加了

Comment on lines +181 to +182
((1, 10, 2, 4), (1, 10, 2, 4), 0),
((4, 10, 8, 4), (4, 10, 2, 4), 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加两个稍大一点(正常模型中会出现规模的) testcases 吧

Comment on lines +35 to +36
((1, 10, 2, 4), (1, 10, 2, 4), 0),
((4, 10, 8, 4), (4, 10, 2, 4), 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加两个稍大一点(正常模型中会出现规模的) testcases 吧

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

from .utils import *
from .datatypes import *
from .structs import *
from .masktypes import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

末尾空一行

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

Comment on lines 1 to 11
class infiniopAttentionMaskType:
NONE = 0
FULL = 1
CAUSAL = 2


InfiniopAttentionMaskTypeNames = {
infiniopAttentionMaskType.NONE: "NONE",
infiniopAttentionMaskType.FULL: "FULL",
infiniopAttentionMaskType.CAUSAL: "CAUSAL",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可能不必要 直接放在 flash_attention.py 里就行,参考最新的 rope

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants