tile-ai · pbbb205 · Apr 22, 2026 · gemini-code-assist · Apr 22, 2026
diff --git a/examples/dequant_gemm/design_dequant_gemm_fp16xint4.md b/examples/dequant_gemm/design_dequant_gemm_fp16xint4.md
@@ -0,0 +1,116 @@
+# FP16×INT4 Dequantize GEMV 算子设计文档
+
+## 1. 概述
+
+### 1.1 算子名称
+dequant_gemv_fp16xint4
+
+### 1.2 功能描述
+矩阵-向量乘法算子，支持INT4量化权重。输入向量A为FP16，权重矩阵B为INT4量化（存储在INT8），计算 C = A × B^T。
+
+### 1.3 数学公式
+$$
+C = A \times B^T
+$$
+
+INT4 unpack公式（每个INT8存储两个INT4）：
+$$
+B_{dequant}[j] = (B_{packed}[j // 2] >> (4 \times (j \% 2))) \& 0xF
+$$
+
+## 2. Ascend硬件限制分析
+
+### 2.1 关键限制
+
+| 操作 | Ascend支持情况 | 说明 |
+|-----|--------------|------|
+| `_tir_packed_int_to_int_convert` | 不支持 | GPU专用TIR intrinsic，无Ascend codegen |
+| `T.tile.cast(int8→int16)` | 不支持 | 只支持int8→half, half→int16等 |
+| `T.Parallel + bitwise + cast` | 导致错误 | 生成v_thread变量，Ascend无法处理 |
+
+### 2.2 支持的操作
+
+| 操作 | Ascend支持 | 示例 |
+|-----|-----------|-----|
+| `T.gemm_v0(fp16×fp16→fp32)` | 支持 | 标准FP16 matmul |
+| Host端PyTorch bitwise操作 | 支持 | INT4 unpack在CPU执行 |
+
+## 3. 设计方案
+
+### 3.1 方案选择：Host端预处理 + NPU标准GEMV
+
+**选定理由**：
+1. 完全避开Ascend不支持的INT4 unpack操作
+2. NPU端使用已验证的标准FP16 GEMV（参考gemv_c）
+3. 简单可靠，易于调试
+
+### 3.2 数据流
+
+```
+Host (CPU):
+B_packed (N, K//2, int8) → unpack → B_fp16 (N, K, fp16)
+                                    ↓
+                              Send to NPU
+
+NPU:
+A (1, K, fp16) + B_fp16 (N, K, fp16) → GEMV → C (1, N, fp16)
+```
+
+### 3.3 核心代码结构
+
+```python
+# Host端unpack (PyTorch)
+def unpack_int4_to_fp16(B_packed):
+    N, K_compressed = B_packed.shape
+    K = K_compressed * 2
+    B = torch.zeros(N, K, dtype=torch.float16)
+    for j in range(K):
+        shift = 4 * (j % 2)
+        B[:, j] = ((B_packed[:, j // 2].int() >> shift) & 0xF).half()
+    return B
+
+# NPU端GEMV (TileLang)
+@tl.jit(out_idx=[-1], pass_configs={...})
+def gemv_fp16(N, K, block_N, block_K):
+    @T.prim_func
+    def main(A, B, C):
+        with T.Kernel(n_num, is_npu=True) as (bn_idx, _):
+            for bk in T.serial(k_num):
+                T.copy(A[0, bk * block_K], A_L1)
+                T.copy(B[bn_idx * block_N, bk * block_K], B_L1)
+                T.gemm_v0(A_L1, B_L1, C_L0, transpose_B=True, init=(bk == 0))
+            T.copy(C_L0, C[0, bn_idx * block_N])
+    return main
+```
+
+## 4. 文件结构
+
+```
+examples/dequant_gemv/
+├── example_dequant_gemv_fp16xint4.py  # 算子实现
+├── design_dequant_gemv_fp16xint4.md   # 本设计文档
+├── example_dequant_gemv_int8xint4.py  # INT8版本
+└── README.md                          # 使用说明
+```
+
+## 5. 性能考虑
+
+### 5.1 Host端开销
+- INT4 unpack在CPU执行，有额外开销
+- 数据传输：B_packed → unpack → B_fp16 → NPU
+- 可优化：预处理权重，避免每次推理都unpack
+
+### 5.2 NPU端性能
+- 使用标准FP16 GEMV，性能可预测
+- 可进一步优化block大小
+
+## 6. 验证标准
+
+| dtype | atol | rtol |
+|-------|------|------|
+| float16 | 1e-3 | 1e-3 |
+
+## 7. 参考
+
+- [tilelang/examples/dequantize_gemm/example_dequant_gemv_fp16xint4.py](../../tilelang/examples/dequantize_gemm/) - GPU版本
+- [examples/gemv/example_gemv_c.py](../gemv/example_gemv_c.py) - Ascend GEMV模式
diff --git a/examples/dequant_gemm/design_dequant_gemm_int8xint4.md b/examples/dequant_gemm/design_dequant_gemm_int8xint4.md
@@ -0,0 +1,119 @@
+# INT8×INT4 Dequantize GEMM 算子设计文档
+
+## 1. 概述
+
+### 1.1 算子名称
+dequant_gemm_int8xint4
+
+### 1.2 功能描述
+矩阵乘法算子，支持INT4量化权重。输入矩阵A为INT8，权重矩阵B为INT4量化（存储在INT8），计算 C = A × B^T。
+
+### 1.3 数学公式
+$$
+C = A \times B^T
+$$
+
+INT4 unpack公式（带符号扩展）：
+$$
+B_{dequant}[j] = \text{sign\_extend}((B_{packed}[j // 2] >> (4 \times (j \% 2))) \& 0xF)
+$$
+
+## 2. Ascend硬件限制分析
+
+### 2.1 关键限制
+
+| 操作 | Ascend支持情况 | 说明 |
+|-----|--------------|------|
+| `_tir_packed_int_to_int_convert` | 不支持 | GPU专用TIR intrinsic |
+| `_tir_u8_to_i4_to_i8` | 不支持 | GPU专用转换函数 |
+| `T.gemm_v0(int8×int8→int32)` | 支持 | Ascend原生INT8 matmul |
+
+## 3. 设计方案
+
+### 3.1 方案选择：Host端预处理 + NPU INT8 matmul
+
+**选定理由**：
+1. 完全避开Ascend不支持的INT4 unpack操作
+2. NPU端使用已验证的标准INT8×INT8→INT32 matmul（参考quant_matmul）
+3. INT8 matmul是Ascend原生支持的高效操作
+
+### 3.2 数据流
+
+```
+Host (CPU):
+A (M, K, int8) ─────────────────────┐
+                                    │
+B_packed (N, K//2, int8) → unpack → B_int8 (N, K, int8)
+                                    │
+                              Send to NPU
+
+NPU:
+A (M, K, int8) × B_int8^T (K, N, int8) → C (M, N, int32)
+```
+
+### 3.3 核心代码结构
+
+```python
+# Host端unpack (PyTorch)
+def unpack_int4_to_int8(B_packed):
+    N, K_compressed = B_packed.shape
+    K = K_compressed * 2
+    B = torch.zeros(N, K, dtype=torch.int8)
+    for j in range(K):
+        shift = 4 * (j % 2)
+        i4 = (B_packed[:, j // 2].to(torch.int32) >> shift) & 0xF
+        i4_signed = ((i4 << 28) >> 28)  # 符号扩展
+        B[:, j] = i4_signed.to(torch.int8)
+    return B
+
+# NPU端GEMM (TileLang)
+@tl.jit(out_idx=[-1])
+def gemm_int8(M, N, K, block_M, block_N, block_K):
+    @T.prim_func
+    def main(A, B, C):
+        with T.Kernel(m_num * n_num, is_npu=True) as (cid, _):
+            with T.Scope("C"):
+                for k in T.serial(k_num):
+                    T.copy(A[bx * block_M, k * block_K], A_L1)
+                    T.copy(B[k * block_K, by * block_N], B_L1)
+                    T.barrier_all()
+                    T.gemm_v0(A_L1, B_L1, C_L0, init=(k == 0))
+                    T.barrier_all()
+                T.copy(C_L0, C[bx * block_M, by * block_N])
+    return main
+```
+
+## 4. 文件结构
+
+```
+examples/dequant_gemv/
+├── example_dequant_gemv_int8xint4.py  # 算子实现
+├── design_dequant_gemv_int8xint4.md   # 本设计文档
+├── example_dequant_gemv_fp16xint4.py  # FP16版本
+└── README.md                          # 使用说明
+```
+
+## 5. 性能考虑
+
+### 5.1 Block参数
+
+| 参数 | 推荐值 | 说明 |
+|-----|-------|------|
+| block_M | 128 | M方向分块 |
+| block_N | 256 | N方向分块 |
+| block_K | 64 | K方向分块 |
+
+### 5.2 维度要求
+- M, N, K 应能被对应block整除，避免tail处理
+
+## 6. 验证标准
+
+| dtype | atol | rtol |
+|-------|------|------|
+| int32 | 0 | 0 (精确匹配) |
+
+## 7. 参考
+
+- [tilelang/examples/dequantize_gemm/example_dequant_gemm_w4a8.py](../../tilelang/examples/dequantize_gemm/) - GPU W4A8版本
+- [examples/quant_batch_matmul/example_quant_matmul.py](../quant_batch_matmul/example_quant_matmul.py) - Ascend INT8 matmul模式
+- [examples/gemm/example_gemm.py](../gemm/example_gemm.py) - Ascend GEMM模式
diff --git a/examples/dequant_gemm/example_dequant_gemm_fp16xint4.py b/examples/dequant_gemm/example_dequant_gemm_fp16xint4.py
@@ -0,0 +1,150 @@
+"""
+INT4 Dequantize GEMV on TileLang-Ascend
+
+设计思路：
+- Host端：使用PyTorch完成INT4 → FP16 unpack（避开Ascend不支持的操作）
+- NPU端：运行标准FP16×FP16 GEMV（使用已验证的gemv_c模式）
+
+参考：
+- tilelang/examples/gemv/example_gemv_c.py（Ascend GEMV基础模式）
+- examples/quant_batch_matmul/example_quant_batch_matmul.py（Ascend量化matmul模式）
+"""
+
+import argparse
+import torch
+import tilelang as tl
+import tilelang.language as T
+
+tl.cache.clear_cache()
+
+
+def unpack_int4_to_fp16(B_packed: torch.Tensor) -> torch.Tensor:
+    """
+    将INT4 packed权重unpack为FP16。
+
+    Args:
+        B_packed: (N, K//2) int8 tensor，每个byte存储2个INT4
+
+    Returns:
+        B: (N, K) float16 tensor
+    """
+    N, K_compressed = B_packed.shape
+    K = K_compressed * 2
+
+    B = torch.zeros(N, K, dtype=torch.float16, device=B_packed.device)
+    for j in range(K):
+        shift = 4 * (j % 2)
+        B[:, j] = ((B_packed[:, j // 2].int() >> shift) & 0xF).half()
+
+    return B
+
+
+@tl.jit(
+    out_idx=[-1],
+    pass_configs={
+        tl.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,
+        tl.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,
+    }
+)
+def gemv_fp16(N: int, K: int, block_N: int, block_K: int, dtype="float16", accum_dtype="float32"):
+    """
+    Ascend标准FP16 GEMV kernel。
+    参考 examples/gemv/example_gemv_c.py
+    """
+    FRACTAL_SIZE = 16
+
+    n_num = T.ceildiv(N, block_N)
+    k_num = T.ceildiv(K, block_K)
+
+    @T.prim_func
+    def main(
+        A: T.Tensor((1, K), dtype),    # type: ignore
+        B: T.Tensor((N, K), dtype),    # type: ignore
+        C: T.Tensor((1, N), dtype),    # type: ignore
+    ):
+        with T.Kernel(n_num, is_npu=True) as (bn_idx, _):
+            A_L1 = T.alloc_L1((FRACTAL_SIZE, block_K), dtype)
+            B_L1 = T.alloc_L1((block_N, block_K), dtype)
+            C_L0 = T.alloc_L0C((FRACTAL_SIZE, block_N), accum_dtype)
+
+            for bk in T.serial(k_num):
+                T.copy(A[0, bk * block_K], A_L1)
+                T.copy(B[bn_idx * block_N, bk * block_K], B_L1)
+                T.gemm_v0(A_L1, B_L1, C_L0, transpose_B=True, init=(bk == 0))
+
+            T.copy(C_L0, C[0, bn_idx * block_N])
+
+    return main
+
+
+def dequant_gemv_fp16xint4(A: torch.Tensor, B_packed: torch.Tensor) -> torch.Tensor:
+    """
+    INT4 Dequantize GEMV完整流程。
+
+    Args:
+        A: (1, K) float16 输入向量
+        B_packed: (N, K//2) int8 packed权重
+
+    Returns:
+        C: (1, N) float16 输出向量
+    """
+    N, K_compressed = B_packed.shape
+    K = K_compressed * 2
+
+    # Step 1: Host端INT4 → FP16 unpack
+    B_fp16 = unpack_int4_to_fp16(B_packed).npu()
+
+    # Step 2: NPU端标准GEMV
+    block_N = 128
+    block_K = 128
+    kernel = gemv_fp16(N, K, block_N, block_K)
+
+    C = kernel(A.npu(), B_fp16)
+
+    return C
+
+
+def ref_dequant_gemv(A: torch.Tensor, B_packed: torch.Tensor) -> torch.Tensor:
+    """PyTorch参考实现"""
+    B_fp16 = unpack_int4_to_fp16(B_packed)
+    return torch.matmul(A, B_fp16.T).half()
+
+
+def check_case(N: int, K: int):
+    """验证测试"""
+    K_compressed = K // 2
+
+    torch.manual_seed(42)
+
+    A = torch.randn(1, K, dtype=torch.float16)
+    B_packed = torch.randint(0, 127, (N, K_compressed), dtype=torch.int8)
+
+    C_npu = dequant_gemv_fp16xint4(A, B_packed).cpu()
+    C_ref = ref_dequant_gemv(A, B_packed)
+
+    torch.testing.assert_close(C_npu, C_ref, atol=1e-3, rtol=1e-3)
+
+
+def main(custom_args=None):
+    parser = argparse.ArgumentParser(description="FP16×INT4 Dequantize GEMV Example")
+    parser.add_argument("--n", type=int, default=1024, help="Output dimension N")
+    parser.add_argument("--k", type=int, default=1024, help="Input dimension K")
+    args, remains = parser.parse_known_args(custom_args)
+    if remains:
+        print(f"[{parser.description}]", "Unknown args:", remains)
+
+    torch.manual_seed(0)
+    tl.cache.clear_cache()
+
+    check_case(args.n, args.k)
+    check_case(512, 512)
+    check_case(4096, 4096)
+
+    print("FP16×INT4 Dequantize GEMV example passed!")
+    print("Kernel Output Match!")
+
+    return True
+
+
+if __name__ == "__main__":
+    main()