[T1-3-1] GushanFall #387

GushanFall · 2025-08-20T09:49:39Z

完成 T1-3-1 赛题所有算子的实现代码、Pytorch单元测试代码以及GGUF测试代码。
算子设计文档详见设计文档PR：InfiniTensor/InfiniCore-Documentation#48

测试情况

NVIDIA 平台

FlashAttention

实现的 FlashAttention-2，并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

只实现了基础计算，且有未能解决的问题。

METAX 平台

FlashAttention

实现的 FlashAttention-2，并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

只实现了基础计算，且有未能解决的问题。

ILUVATAR 平台

FlashAttention

实现的 FlashAttention-2，并支持GQA、任意Mask、不同的序列长度。

FlashAttentionBackward

只实现了基础计算，且有未能解决的问题。

GushanFall · 2025-08-27T11:02:50Z

GGUF 测试结果

FlashAttention

Nvidia

f32

f16

bf16

Metax

f32

f16

bf16

Iluvatar

f32

f16

bf16

FlashAttentionBackward

Nvidia

f32

f16

bf16

Metax

f32

f16

bf16

Iluvatar

f32

f16

bf16

src/infiniop/ops/flash_attention/cuda/kernel.cuh

include/infinicore.h

src/infiniop/ops/flash_attention/cpu/flash_attention_cpu.h

Ziminli · 2025-09-16T07:00:07Z

src/infiniop/ops/flash_attention/metax/flash_attention_metax.maca

+            hcMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
+            hcMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), hcMemcpyHostToDevice);


这个 calculate 里面为什么还需要 malloc 空间呢？理论上来说 calculate 阶段应该都是计算了，空间分配在 workspace 的时候应该就分配好了

Ziminli · 2025-09-16T07:00:46Z

src/infiniop/ops/flash_attention/nvidia/flash_attention_nvidia.cu

+            cudaMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
+            cudaMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), cudaMemcpyHostToDevice);


（同理）calculate 阶段 malloc 空间

我原本是写的 mask_input = _info.mask; ，但是测试的时候会报错

这种类似的报错在反向算子的实现中也有，我不清楚具体是哪里的问题

那检查一下写入和读取时的偏移是不是都无误吧

Ziminli · 2025-09-16T07:01:07Z

src/infiniop/ops/flash_attention_backward/metax/flash_attention_backward_metax.maca

+            hcMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
+            hcMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), hcMemcpyHostToDevice);
+            mask_input = mask_temp;
+        } else {
+            mask_input = mask;
+        }
+    }
+
+    size_t T_r = ceil(float(seq_len_q) / B_r);
+    size_t T_c = ceil(float(seq_len_kv) / B_c);
+
+    auto hc_stream = reinterpret_cast<hcStream_t>(stream);
+
+    void *out, *l;
+    if (_info.dtype == INFINI_DTYPE_F16) {
+        hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(half));
+        hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(half));
+    } else if (_info.dtype == INFINI_DTYPE_F32) {
+        hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(float));
+        hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(float));
+    } else if (_info.dtype == INFINI_DTYPE_BF16) {
+        hcMalloc(&out, batch_size * seq_len_kv * nums_head_q * head_dim * sizeof(__hpcc_bfloat16));
+        hcMalloc(&l, batch_size * seq_len_kv * nums_head_q * sizeof(__hpcc_bfloat16));
+    } else {
+        return INFINI_STATUS_BAD_TENSOR_DTYPE;
+    }
+
+    CHECK_STATUS(launchForwardKernel(
+        out, l, q, k, v, mask_input,
+        batch_size, 
+        nums_head_q, nums_head_kv,
+        seq_len_q, seq_len_kv,
+        head_dim, group,
+        B_r, B_c, T_r, T_c,
+        _info.qo_stride_b, _info.qo_stride_s, _info.qo_stride_n,
+        _info.kv_stride_b, _info.kv_stride_s, _info.kv_stride_n,
+        _info.l_stride_b, _info.l_stride_s, _info.l_stride_n,
+        _info.dtype,
+        hc_stream));
+
+    void *grad_k_expanded, *grad_v_expanded;
+    if (_info.dtype == INFINI_DTYPE_F16) {
+        hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(half));
+        hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(half));
+    } else if (_info.dtype == INFINI_DTYPE_F32) {
+        hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(float));
+        hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(float));
+    } else if (_info.dtype == INFINI_DTYPE_BF16) {
+        hcMalloc(&grad_k_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(__hpcc_bfloat16));
+        hcMalloc(&grad_v_expanded, batch_size * nums_head_kv * seq_len_kv * head_dim * group * sizeof(__hpcc_bfloat16));


（同理）calculate 阶段 malloc 空间

我这边是很粗糙的实现方法，grad_k 和 grad_v 是在每块计算中累加进行迭代的，原本我想在 kernel 中就算出最后的值，但是会因为不同线程间相互覆盖导致算不出正确结果，没想出很好的解决办法，所以只好把每个都记录下来，单独用一个核函数来做累加了

Ziminli · 2025-09-16T07:03:11Z

test/infiniop-test/test_generate/testcases/flash_attention.py

+        ((1, 10, 2, 4), (1, 10, 2, 4), 0),
+        ((4, 10, 8, 4), (4, 10, 2, 4), 1),


加两个稍大一点（正常模型中会出现规模的） testcases 吧

Ziminli · 2025-09-16T07:03:24Z

test/infiniop/flash_attention.py

+    ((1, 10, 2, 4), (1, 10, 2, 4), 0),
+    ((4, 10, 8, 4), (4, 10, 2, 4), 1),


加两个稍大一点（正常模型中会出现规模的） testcases 吧

Ziminli · 2025-09-16T07:03:46Z

test/infiniop/libinfiniop/__init__.py

 from .utils import *
 from .datatypes import *
 from .structs import *
+from .masktypes import *


末尾空一行

Ziminli · 2025-09-16T07:04:31Z

test/infiniop/libinfiniop/masktypes.py

+class infiniopAttentionMaskType:
+    NONE = 0
+    FULL = 1
+    CAUSAL = 2
+
+
+InfiniopAttentionMaskTypeNames = {
+    infiniopAttentionMaskType.NONE: "NONE",
+    infiniopAttentionMaskType.FULL: "FULL",
+    infiniopAttentionMaskType.CAUSAL: "CAUSAL",
+}


这个可能不必要直接放在 flash_attention.py 里就行，参考最新的 rope

GushanFall mentioned this pull request Aug 20, 2025

[T1-3-1] GushanFall InfiniTensor/InfiniCore-Documentation#48

Merged

format

a0591f5

GushanFall force-pushed the T1-3-1 branch from aa1c37c to a0591f5 Compare August 20, 2025 12:12

GushanFall added 3 commits August 21, 2025 12:14

fix some bugs

beec5dc

delete enable_gqa

34a89e4

update

6caf7cc

PanZezhong1725 requested a review from Ziminli September 3, 2025 06:44

GushanFall and others added 2 commits September 3, 2025 14:51

Merge branch 'main' into T1-3-1

be82b46

Merge branch 'main' into T1-3-1

9dc5330

Ziminli requested changes Sep 16, 2025

View reviewed changes

T1-3-1:Modified part of the code according to the suggestions.

1f3f915

PanZezhong1725 requested a review from Ziminli September 19, 2025 05:56

PanZezhong1725 added 类型：开发模块：算子 labels Sep 19, 2025

PanZezhong1725 force-pushed the main branch from 7300e69 to 37c76a9 Compare October 22, 2025 02:23

		hcMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
		hcMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), hcMemcpyHostToDevice);

		cudaMalloc(&mask_temp, seq_len_q * seq_len_kv * sizeof(float));
		cudaMemcpy(mask_temp, _info.mask, seq_len_q * seq_len_kv * sizeof(float), cudaMemcpyHostToDevice);

		((1, 10, 2, 4), (1, 10, 2, 4), 0),
		((4, 10, 8, 4), (4, 10, 2, 4), 1),

[T1-3-1] GushanFall #387

Are you sure you want to change the base?

[T1-3-1] GushanFall #387

Uh oh!

Conversation

GushanFall commented Aug 20, 2025

测试情况

NVIDIA 平台

FlashAttention

FlashAttentionBackward

METAX 平台

FlashAttention

FlashAttentionBackward

ILUVATAR 平台

FlashAttention

FlashAttentionBackward

Uh oh!

GushanFall commented Aug 27, 2025

GGUF 测试结果

FlashAttention

Nvidia

Metax

Iluvatar

FlashAttentionBackward

Nvidia

Metax

Iluvatar

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants