diff --git a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/SKILL.md b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/SKILL.md
index e1fa44907..2dc7c551e 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/SKILL.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/SKILL.md
@@ -25,6 +25,7 @@ description: TileLang Ascend API 使用最佳实践。提供内存分配、数
 | **Softmax/LayerNorm** | [api-compute](references/api-compute.md) | T.reduce_max/sum、T.tile.exp/sub/div |
 | **逐元素计算** | [api-compute](references/api-compute.md) | T.Parallel + 符号 API 或 T.tile.xxx 两种范式 |
 | **多 block/core 累加到 GM** | [api-compute](references/api-compute.md) | T.tile.atomic_add(dst_gm, src_local)，调用前显式清零 GM |
+| **CV 融合算子** | [api-kernel-memory](references/api-kernel-memory.md), [api-schedule-sync](references/api-schedule-sync.md) | workspace 索引一致性、AUTO_CV_COMBINE、vid 并行化 |
 | **流水线优化** | [api-schedule-sync](references/api-schedule-sync.md) | T.Pipelined num_stages、核间/核内流水线 |
 | **多核负载均衡** | [api-schedule-sync](references/api-schedule-sync.md) | T.Persistent 缓存友好调度 |
 | **排序** | [api-compute](references/api-compute.md) | T.tile.sort → T.tile.merge_sort → T.tile.topk |
@@ -90,4 +91,4 @@ description: TileLang Ascend API 使用最佳实践。提供内存分配、数
 | `TL_ASCEND_AUTO_SYNC: True` | 自动同步插入 |
 | `TL_ASCEND_MEMORY_PLANNING: True` | 自动内存规划 |
 | `TL_ASCEND_AUTO_CV_COMBINE: True` | 自动 CV 分离（核间流水线） |
-| `tl.ascend_auto_cross_core_sync: True` | 自动核间同步（核间流水线） |
+| `TL_ASCEND_AUTO_CV_SYNC: True` | 自动核间同步（核间流水线） |
diff --git a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-compute.md b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-compute.md
index 0965a7569..cd9a25e56 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-compute.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-compute.md
@@ -39,6 +39,34 @@ T.copy(C_L0, C[bx * block_M, by * block_N])
 T.gemm_v0(q_l1, k_l1, acc_s_l0c, transpose_B=True, init=True)
 ```
 
+**⚠️ 重要：矩阵 Buffer 分形限制**
+
+使用 `T.gemm_v0` 时，矩阵 Buffer 必须满足最小分形限制。分形大小固定为 512 Byte（L0A/L0B）或 256 元素（L0C），shape 与 dtype 相关：
+
+**分形公式**：
+- **L0A**：`16 × (32B / sizeof(AType))`，固定 512 Byte
+- **L0B**：`(32B / sizeof(BType)) × 16`，固定 512 Byte
+- **L0C**：`16 × 16`，固定 256 元素（不随 dtype 变化）
+
+**不同 dtype 的最小维度限制**：
+
+| dtype | sizeof | L0A 分形 | L0B 分形 | 最小限制 |
+|-------|--------|----------|----------|---------|
+| int8 / uint8 | 1 Byte | 16 × 32 | 32 × 16 | M ≥ 16, K ≥ 32, N ≥ 16 |
+| float16 / bfloat16 | 2 Byte | 16 × 16 | 16 × 16 | M ≥ 16, K ≥ 16, N ≥ 16 |
+| int32 / float32 | 4 Byte | 16 × 8 | 8 × 16 | M ≥ 16, K ≥ 8, N ≥ 16 |
+
+**L0C 分形固定为 16 × 16，不随 dtype 变化**，因此 M 和 N 的最小值始终为 16。
+
+**常见错误**：`block_N = 8` 不满足 L0C 分形限制（N ≥ 16），会导致计算结果错误。
+
+**示例**：int8 GEMM 的正确 block size 选择
+```python
+block_M = 64   # ≥ 16 ✓
+block_N = 16   # ≥ 16 ✓（满足 L0B/L0C 分形限制）
+block_K = 256  # ≥ 32 ✓（int8 的 L0A/L0B K 维度限制）
+```
+
 ### T.mma(A, B, C, init=False)
 
 NPU 级别的矩阵乘累加指令，比 `gemm_v0` 更底层。不支持 `transpose_A`/`transpose_B`。通常配合 `T.alloc_L0A`/`T.alloc_L0B` 和 `T.annotate_layout` 使用。
@@ -161,9 +189,76 @@ for i in range(block_M // VEC_NUM):  # 行顺序
         c_ub[i, j] = a_ub[i, j] * b_ub[i, j]
 ```
 
+### 3.1 T.Parallel 在 TileLang-Ascend 上的限制
+
+> **核心原理**：`T.Parallel` 在 TileLang-Ascend 上会被编译器 lowering 为 `T.tile.xxx` Buffer 级 SIMD 指令。因此，T.Parallel 的能力边界受限于 AscendC Vector 指令的能力。
+
+#### 支持的循环维度
+
+- ✅ **1D 并行**：`for j in T.Parallel(N)`
+- ✅ **2D 并行**：`for i, j in T.Parallel(M, N)`
+- ✅ **serial + parallel 组合**：`for i in range(M): for j in T.Parallel(N)`
+- ❌ **3D 或更高维并行**：不支持，会触发编译错误
+
+#### 支持的表达式类型
+
+`T.Parallel` 内的表达式会被自动分解并翻译为 Vector 指令。**仅支持以下模式**：
+
+| 类型 | 支持的表达式 | 备注 |
+|------|-------------|------|
+| 简单赋值 | `a[i] = b[i]` | 等价于 `T.copy` |
+| 简单运算 | `c[i] = a[i] + b[i]` | 等价于 `T.tile.add` |
+| 标量运算 | `c[i] = a[i] + scalar` | 等价于 `T.tile.add` |
+| 广播运算 | `c[i,j] = a[i,j] * b[j]` | 自动广播处理（仅支持 1D→2D，索引必须是简单变量） |
+| 复合表达式 | `c[i] = a[i] * b[i] + d[i]` | 自动分解为多步操作 |
+| 离散索引  | 非简单变量索引，如 `a[idx[i]]` | 编译器退回到 `T.serial` 循环 |
+
+#### 不支持的表达式
+
+以下表达式**无法在 T.Parallel 中使用**，需要改用其他方案：
+
+| 不支持的表达式 | 错误类型 | 替代方案 |
+|---------------|---------|---------|
+| `if-else` 条件分支 | 编译错误（SIMD 架构不支持元素级条件判断） | 使用 `T.tile.compare` + `T.tile.select` |
+| `T.if_then_else(...)` | 编译错误 ("undefined Variable v_thread") | 使用 `T.tile.compare` + `T.tile.select` |
+| `tir.reinterpret("int8", ...)` | 运行时错误 | 使用 `T.reinterpretcast`（整个 buffer） |
+| `T.int8(expr)` 或 `.astype("int8")` | 编译错误或数据异常 | 使用 `T.tile.cast`（整个 buffer） |
+| 非线性索引 `a[i*i]` | 未实现 | 使用 `T.tile.xxx` + 手动索引计算 |
+| 动态 shift `a[i] >> shift[i]` | 不支持（shift 必须是 scalar） | 使用固定 scalar shift |
+
+#### 循环范围要求
+
+`T.Parallel` 的循环范围必须是编译期可确定的常量值（IntImm），不支持动态变量作为循环边界。
+
+#### 从 CUDA TileLang 迁移注意事项
+
+TileLang-Ascend 的 T.Parallel 语法与 CUDA 版本对齐，但底层执行模型不同：
+
+- **CUDA (SIMT)**：每个元素独立执行，支持复杂控制流
+- **Ascend (SIMD)**：所有元素并行执行相同指令，不支持条件分支
+
+CUDA 代码中的以下模式在 Ascend 上需要改写：
+
+```python
+# CUDA 版本（SIMT，逐元素条件判断）
+for i in T.Parallel(N):
+    if a[i] > threshold:      # ❌ Ascend 不支持
+        b[i] = a[i] * scale
+    else:
+        b[i] = a[i]
+
+# Ascend 版本（SIMD，用 compare + select 替代）
+T.tile.compare(mask_ub, a_ub, threshold, "GT")
+T.tile.select(b_ub, mask_ub, a_scaled_ub, a_ub, "VSEL_CMPMASK_SPR")
+```
+
+详细用法参考 `docs/tutorials/t_parallel.md`。
+
+Pass 设计详见 `.agents/skills/tilelang-pass-analyzer/references/pass-designs/ascend_lower_parallel_to_vector_design.md`。
+
 ---
 
-## 4. Tile 扩展原语（Expert / 混合模式 T.tile.xxx）
+## 4. Tile 扩展原语（T.tile.xxx Buffer 级 SIMD 操作）
 
 `T.tile.xxx` 系列接口直接触发 Tile 级的 Ascend 操作。它们既可用于全手动 Expert 模式，也可在 Developer pass_configs 下作为混合模式原语使用。
 
@@ -406,7 +501,7 @@ T.tile.topk(topk_global, sort_result, K, actual_num)
 ### 4.12 两种编程范式对比
 
 ```python
-# 方式一：T.Parallel + 符号 API（推荐，跨平台兼容）
+# 方式一：T.Parallel + 符号 API（Developer 模式，跨平台兼容）
 for i, j in T.Parallel(block_M // VEC_NUM, block_N):
     b_ub[i, j] = T.exp(a_ub[i, j])
 
diff --git a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-kernel-memory.md b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-kernel-memory.md
index a5fe562a8..b44597b7d 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-kernel-memory.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-kernel-memory.md
@@ -6,7 +6,7 @@
 
 ### @T.prim_func
 
-定义一个 TileLang kernel 函数。参数类型为 `T.Tensor` 或 `T.Buffer`。
+定义一个 TileLang kernel 函数。参数类型为 `T.Tensor`。
 
 ```python
 @T.prim_func
@@ -18,7 +18,10 @@ def add_kernel(
     ...
 ```
 
-**支持的 dtype**：`float16, float32, bfloat16, int8, int16, int32, int64, uint8, uint16, uint32, uint64`
+**支持的 dtype**：
+- `float16`, `float32`, `bfloat16`
+- `int8`, `int16`, `int32`, `int64`
+- `uint8`, `uint16`, `uint32`, `uint64`
 
 ### 动态 shape 符号
 
@@ -54,8 +57,7 @@ with T.Kernel(m_num * n_num, is_npu=True) as (cid, vid):
 ```
 
 - **cid**：计算任务 ID，范围 [0, block_num)
-- **vid**：Vector 单元索引（0 或 1），A2/A3 架构 CV 核配比可为 1:2 或 1:1
-- **VEC_NUM**：通常设为 2，表示每个 AI Core 有 2 个 Vector 计算单元
+- **vid**：Vector 单元索引（0 或 1），C、V 核配比为 1:2
 
 ### @jit 装饰器
 
@@ -72,23 +74,61 @@ def tile_add(M, N, block_M, block_N, dtype='float'):
 
 **参数**：
 - `out_idx`：指定输出参数索引，如 `[-1]` 表示最后一个参数为输出
-- `workspace_idx`：工作空间参数索引（如 Flash Attention 中 `workspace_idx=[4,5,6]`）
+- `workspace_idx`：工作空间参数索引（详见下方 workspace 机制）
 - `pass_configs`：编译配置选项
 
 **常用 pass_configs**：
 ```python
 pass_configs = {
-    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,         # 自动同步插入
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,         # 自动同步插入（核内）
     tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,   # 自动内存规划
     tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,   # 自动CV分离（核间流水线需要）
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,      # 自动同步插入（CV核间）
 }
 ```
 
+#### workspace 机制
+
+**作用**：workspace buffer 用于 Cube 核（L1）和 Vector 核（UB）之间的数据中转。
+
+由于 Ascend 硬件限制，UB 和 L1 不能直接互通，必须通过 Global Memory 中转：
+
+```
+L0C → workspace(GM) → UB   # Cube 输出到 Vector 处理
+UB → workspace(GM) → L1    # Vector 输出到 Cube
+```
+
+**使用方式**：
+
+1. 在 `@jit` 中指定 `workspace_idx`：
+```python
+@jit(out_idx=[-1], workspace_idx=[3])  # workspace 是第 3 个参数
+def kernel(M, N, K, ...):
+    @T.prim_func
+    def main(
+        A: T.Tensor((M, K), dtype),
+        B: T.Tensor((K, N), dtype),
+        C: T.Tensor((M, N), dtype),
+        workspace: T.Tensor((M, N), accum_dtype),  # workspace_idx=[3]
+    ):
+        ...
+```
+> 注意：定义 workspace buffer 时，名称也应包含 "workspace"
+
+2. 数据流示例（来自 `examples/quant_batch_matmul`）：
+```python
+# GEMM 输出 (L0C) → workspace → Vector 核处理
+T.copy(C_L0, workspace[bm * block_M, bn * block_N])
+T.copy(workspace[bm * block_M + vid * block_M_2, bn * block_N], c_ub)
+
+# Vector 核处理后输出
+T.copy(c_out, C[bm * block_M + vid * block_M_2, bn * block_N])
+```
 ### 查看生成的 AscendC 代码
 
 ```python
-func = tile_add(M, N, block_M, block_N)
-print(f"{func.get_kernel_source()}")
+kernel = tile_add(M, N, block_M, block_N)
+print(kernel.get_kernel_source())
 ```
 
 ---
@@ -97,7 +137,7 @@ print(f"{func.get_kernel_source()}")
 
 ### Developer 模式
 
-TileLang 对存储层级进行了抽象，分为 Global、shared 和 fragment 三个级别。在 Ascend 平台中，shared 层级对应 L1 Buffer 和 Unified Buffer，fragment 层级对应 L0A/L0B/L0C Buffer。用户无需指定具体硬件存储，TileLang 编译器会根据程序上下文自动识别。
+TileLang 对存储层级进行了抽象，分为 global、shared 和 fragment 三个级别。在 Ascend 平台中，shared 层级对应 L1 Buffer 和 Unified Buffer (UB)，fragment 层级对应 L0A/L0B/L0C Buffer。用户无需指定具体硬件存储，TileLang 编译器会根据程序上下文自动识别。
 
 #### T.alloc_shared(shape, dtype)
 
@@ -129,13 +169,13 @@ b = T.alloc_var("int32", init=a)  # 用另一个变量的值初始化
 
 显式指定存储位置，适用于需要精确控制内存分配的场景。
 
-| API | 存储层级 | 说明 |
-|-----|---------|------|
-| `T.alloc_ub(shape, dtype)` | Unified Buffer (UB) | Vector 计算 |
-| `T.alloc_L1(shape, dtype)` | L1 Buffer | 片上缓存 |
-| `T.alloc_L0A(shape, dtype)` | L0A Buffer | Cube 左矩阵 |
-| `T.alloc_L0B(shape, dtype)` | L0B Buffer | Cube 右矩阵 |
-| `T.alloc_L0C(shape, dtype)` | L0C Buffer | Cube 输出/累加 |
+| API | 存储层级 | 抽象层级 | 说明 |
+|-----|---------|---------|-----|
+| `T.alloc_ub(shape, dtype)` | Unified Buffer | shared | Vector 存储单元 |
+| `T.alloc_L1(shape, dtype)` | L1 Buffer | shared | Cube 存储单元 |
+| `T.alloc_L0A(shape, dtype)` | L0A Buffer | fragment | Cube 左矩阵 |
+| `T.alloc_L0B(shape, dtype)` | L0B Buffer | fragment | Cube 右矩阵 |
+| `T.alloc_L0C(shape, dtype)` | L0C Buffer | fragment | Cube 输出/累加 |
 
 **实际使用示例**（来自 `examples/gemm/example_gemm.py`）：
 
@@ -145,6 +185,35 @@ B_L1 = T.alloc_L1([block_K, block_N], dtype)
 C_L0 = T.alloc_L0C([block_M, block_N], accum_dtype)
 ```
 
+**⚠️ 重要：存储单元对齐要求**
+
+Ascend NPU 不同存储单元有不同的对齐要求：
+
+| 存储单元 | 对齐要求 |
+|---------|---------|
+| Global Memory (GM) | 无对齐要求 |
+| Unified Buffer (UB) | 32 Byte |
+| L1 Buffer | 32 Byte |
+| L0A Buffer | 512 Byte |
+| L0B Buffer | 512 Byte |
+| L0C Buffer | 64 Byte |
+
+**⚠️ 重要：存储单元大小限制**
+
+根据 Ascend910B3 平台配置：
+
+| 存储单元 | 大小上限 |
+|---------|---------|
+| L0A | 65536 Byte |
+| L0B | 65536 Byte |
+| L0C | 131072 Byte |
+| L1 | 524288 Byte |
+| UB | 196608 Byte |
+
+> 更多参数参见：`$ASCEND_HOME_PATH/$(uname -m)-linux/data/platform_config/Ascend910B3.ini`
+
+分配 Buffer 时需确保不超出上限，并满足对齐要求。
+
 ---
 
 ## 3. 数据搬运原语
@@ -164,7 +233,8 @@ C_L0 = T.alloc_L0C([block_M, block_N], accum_dtype)
 | GM | UB | Global Memory → Unified Buffer |
 | UB | GM | Unified Buffer → Global Memory |
 | UB | UB | Unified Buffer → Unified Buffer |
-| UB | L1 | Unified Buffer → L1 Buffer |
+
+> **注意**：UB 和 L1 之间**不能直接搬运**。Cube 核（L1）和 Vector 核（UB）的数据传递需要通过 workspace buffer（GM）中转。
 
 **使用示例**：
 
@@ -187,7 +257,104 @@ T.copy(K[bz, by, k * block_N:(k + 1) * block_N, :], k_l1)
 
 ---
 
-## 4. 完整示例
+## 4. V 核并行化
+
+### 4.1 基本原理
+
+Ascend NPU 每个 AI Core 有 1 个 Cube 核和 2 个 Vector 核（C:V = 1:2）：
+
+- `cid`：计算任务 ID，范围 `[0, block_num)`
+- `vid`：Vector 核索引，取值 0 或 1
+- `VEC_NUM`：常量，通常设为 2
+
+**默认行为**：两个 V 核（vid=0 和 vid=1）执行完全相同的代码，算力浪费。
+
+**正确做法**：利用 `vid` 让两个 V 核分担任务。
+
+### 4.2 V 核分担任务的模式
+
+#### 模式一：按行切分（最常见）
+
+每个 V 核处理 `block_dim // VEC_NUM` 行：
+
+```python
+VEC_NUM = 2
+block_M_2 = block_M // VEC_NUM  # 每个 V 核处理一半行数
+
+with T.Kernel(grid_size, is_npu=True) as (cid, vid):
+    # 计算 V 核负责的起始行
+    row_start = cid * block_M + vid * block_M_2
+    
+    # 分配 buffer（只需分配 V 核负责的行数）
+    data_ub = T.alloc_shared((block_M_2, block_N), dtype)
+    
+    # 读入数据
+    T.copy(A[row_start, by * block_N], data_ub)
+    
+    # 计算
+    ...
+    
+    # 写出数据
+    T.copy(data_ub, B[row_start, by * block_N])
+```
+
+**关键点**：读写索引必须一致，都使用 `row_start` 或基于 `vid` 计算的索引。
+
+#### 模式二：按任务切分
+
+每个 V 核处理不同的计算任务：
+
+```python
+VEC_NUM = 2
+
+with T.Kernel(num_tasks, is_npu=True) as (cid, vid):
+    # 每个 V 核处理不同的任务
+    task_id = cid * VEC_NUM + vid
+    
+    if task_id < total_tasks:
+        # 处理 task_id
+        ...
+```
+
+### 4.3 workspace 索引一致性（易错点）
+
+当 V 核读写 workspace（或任何中间 buffer）时，**必须保持索引逻辑一致**：
+
+```python
+# 错误：读写索引不一致
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row
+    T.copy(src[actual_row, ...], temp_ub)   # 读用 actual_row ✓
+    # ... 处理 ...
+    T.copy(temp_ub, dst[bn * block_N + row, ...])  # ❌ 写没用 actual_row
+
+# 正确：读写索引一致
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row
+    T.copy(src[actual_row, ...], temp_ub)   # 读用 actual_row ✓
+    # ... 处理 ...
+    T.copy(temp_ub, dst[actual_row, ...])   # 写也用 actual_row ✓
+```
+
+**原则**：同一数据在不同阶段的索引必须基于相同的计算逻辑。
+
+### 4.4 Cube 核不涉及 vid
+
+Cube 核做 GEMM 时，不使用 vid 切分，读取完整的 block：
+
+```python
+# Cube 核部分（不涉及 vid）
+A_L1 = T.alloc_shared((block_M, block_K), dtype)  # 完整 block_M
+B_L1 = T.alloc_shared((block_N, block_K), dtype)  # 完整 block_N
+
+T.copy(A[bm * block_M, k_offset], A_L1)   # 完整 block_M
+T.copy(B[bn * block_N, k_offset], B_L1)   # 完整 block_N
+T.gemm_v0(A_L1, B_L1, C_L0, ...)
+```
+
+---
+
+## 5. 完整示例
 
 来自 `docs/TileLang-Ascend Programming Guide.md` §2.2：
 
diff --git a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-schedule-sync.md b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-schedule-sync.md
index 7fc125f83..eea97f338 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-schedule-sync.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-api-best-practices/references/api-schedule-sync.md
@@ -165,11 +165,23 @@ T.wait_flag("mte2", "v", 0)
 | `T.wait_cross_flag(flag)` | 等待核间同步标志 |
 
 ```python
-# Cube 核完成后通知 Vector 核
-T.set_cross_flag("MTE3", 0)
+# Vector 核完成后通知 Cube 核（从 UB → GM 的数据通路）
+T.copy(output_ub, workspace[...])  # UB → GM
+T.set_cross_flag("MTE3", 0)        # Vector 核使用 MTE3 通路
+
+# Cube 核等待 Vector 核完成
 T.wait_cross_flag(0)
 ```
 
+**⚠️ 重要：数据通路选择**
+
+| 核类型 | 数据通路 | 说明 |
+|-------|---------|------|
+| Vector (V) | `"MTE3"` | UB → GM 的写入通路 |
+| Cube (C) | `"FIX"` | L0C → GM 的写入通路 |
+
+**常见错误**：在 Vector 核执行逻辑中使用 `"FIX"` 会导致 `Illegal instruction (unaligned UUB addresses)` 错误。
+
 > `set_cross_flag` 源码（`ascend.py:114`）还支持第三个参数 `mode`（默认 2），控制同步范围：0=所有 AIC/AIV 之间，1=同组 AIV 之间，2=同组 AIC 和 AIV 之间。
 
 ---
@@ -196,7 +208,7 @@ with T.Scope("V"):   # Vector 域
 **格式说明符**：`%d`/`%i`（整数）, `%f`（浮点）, `%x`（十六进制）, `%s`（字符串）, `%p`（指针，建议使用 `%x`）
 
 ```python
-T.printf("fmt %s %d\n", "string", 0x123)
+T.printf("fmt %s %d\n", "string", cid)
 ```
 
 ### T.dump_tensor(tensor, desc, dump_size, shape_info=())
diff --git a/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/SKILL.md b/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/SKILL.md
index 4e7dce8ec..2c5f58a94 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/SKILL.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/SKILL.md
@@ -38,7 +38,7 @@ description: TileLang Ascend Developer/Expert 模式选择与 pass_configs 配
 import tilelang
 
 pass_configs = {
-    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,        # ① 自动同步
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,        # ① 自动核内同步
     tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,   # ② 自动内存规划
     tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,   # ③ 自动CV分离
     tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,      # ④ 自动核间同步
@@ -64,9 +64,11 @@ pass_configs = {
 
 - **底层 key**：`"tl.ascend_auto_cv_combine"`，默认 False
 - **功能**：自动将 kernel 中的 Cube 操作和 Vector 操作分离到不同的执行核
-- **开启时**：无需手写 `with T.Scope("C")` / `with T.Scope("V")`
+- **开启时**：无需手写 `with T.Scope("C")` / `with T.Scope("V")`，编译器根据 buffer 类型和所用原语自动识别
 - **关闭时**：必须手动用 `T.Scope` 标注每段代码的执行域
 
+> 注意：避免在开启 AUTO_CV_COMBINE 同时手写 `T.Scope`，可能会导致编译器无法正确处理代码
+
 #### ④ TL_ASCEND_AUTO_CV_SYNC（自动核间同步）
 
 - **底层 key**：`"tl.ascend_auto_cross_core_sync"`，默认 False
@@ -76,14 +78,14 @@ pass_configs = {
 
 ### 2.2 按场景选择 pass_configs
 
-
-| 场景 | AUTO_SYNC | MEMORY_PLANNING | AUTO_CV_COMBINE | AUTO_CV_SYNC |
-|------|-----------|-----------------|-----------------|--------------|
-| **纯 Vector 算子**（elementwise, softmax） | ✅ | ✅ | 不需要 | 不需要 |
-| **Developer GEMM** | ✅ | ✅ | ✅ | ✅ |
-| **Developer Flash Attention（核间流水线）** | ✅ | 视情况 | ✅ | ✅ |
-| **Expert 极致性能** | ❌ | ❌ | ❌ | ❌ |
-| **混合模式** | ✅ | ✅ | ✅ | ✅ |
+| 场景 | AUTO_SYNC | MEMORY_PLANNING | AUTO_CV_COMBINE | AUTO_CV_SYNC | 手动 Scope |
+|------|-----------|-----------------|-----------------|--------------|------------|
+| **纯 Vector 算子**（elementwise, softmax） | ✅ | ✅ | ❌ | ❌ | ❌ |
+| **Developer GEMM**（完全自动） | ✅ | ✅ | ✅ | ✅ | ❌ |
+| **Developer Flash Attention**（核间流水线） | ✅ | ✅ | ✅ | ✅ | ❌ |
+| **Developer CV 融合**（Vector计算+Cube GEMM） | ✅ | ✅ | ✅ | ✅ | ❌ |
+| **混合模式 CV 融合** | ✅ | ✅ | ❌ | ❌ | ✅ |
+| **Expert 极致性能** | ❌ | ❌ | ❌ | ❌ | ✅ |
 
 **纯 Vector 算子**（来自 Programming Guide §2.2）：
 ```python
@@ -93,13 +95,13 @@ pass_configs = {
 }
 ```
 
-**Developer GEMM**：
+**Developer GEMM / Developer CV 融合**（推荐配置）：
 ```python
 pass_configs = {
-    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,
-    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,
-    tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,
-    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,  # 自动分离 Cube/Vector
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,        # 自动核内同步
+    tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,  # 自动内存规划
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,     # 自动核间同步
 }
 ```
 
@@ -144,19 +146,16 @@ pass_configs = {
 
 ---
 
-## 4. 实际代码对比
-
-完整的 Developer vs Expert 代码对比，请参考：
-
-→ [mode-examples.md](references/mode-examples.md)
-
----
+## 4. 示例代码与代码对比
 
-## 5. 示例代码位置
+| 模式 | 目录 | 说明 |
+|------|------|------|
+| Developer | `examples/developer_mode/` | GEMM、elementwise 等 |
+| Expert | `examples/gemm/example_gemm_intrinsic.py`、`examples/flash_attention/fa_opt/flash_attn_bhsd_expert_*.py` | 极致性能优化 |
+| 混合（核间流水线） | `examples/flash_attention/flash_attn_bhsd_cc_sync.py`、`examples/flash_attention/fa_opt/flash_attn_bhsd_auto_pipeline_*.py` | FA 核间流水线 |
+| 纯 Vector | `examples/elementwise/`、`examples/softmax/` | 无 Cube 操作 |
+| CV 融合 | `examples/dequantize_gemm/`、`examples/quant_batch_matmul/` | Vector 计算 + Cube GEMM |
 
-| 模式 | 目录 |
-|------|------|
-| Developer | `examples/developer_mode/` |
-| Expert | `examples/gemm/example_gemm_intrinsic.py`、`examples/flash_attention/fa_opt/flash_attn_bhsd_expert_*.py` |
-| 混合（核间流水线） | `examples/flash_attention/flash_attn_bhsd_cc_sync.py`、`examples/flash_attention/fa_opt/flash_attn_bhsd_auto_pipeline_*.py` |
-| 纯 Vector | `examples/elementwise/`、`examples/softmax/` |
+**完整代码对比**（Developer vs Expert）：
+- → [mode-examples.md](references/mode-examples.md)
+- 包含 GEMM、Flash Attention、Softmax、CV 融合（W4A8 GEMM） 等示例
diff --git a/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/references/mode-examples.md b/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/references/mode-examples.md
index 8e957f34b..18afc207a 100644
--- a/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/references/mode-examples.md
+++ b/.agents/skills/tilelang-custom-skill/tilelang-expert-to-developer/references/mode-examples.md
@@ -161,3 +161,135 @@ with T.Kernel(m_num, is_npu=True) as (cid, vid):
 ```
 
 **关键点**：`T.tile.xxx` 和 `T.reduce_*` 可以在 Developer pass_configs 下正常工作，无需手写同步。
+
+---
+
+## 6. CV 融合 — Developer 模式（W4A8 GEMM）
+
+CV 融合典型场景：Vector 核解量化 + Cube 核 GEMM。
+
+```python
+import tilelang
+import tilelang.language as T
+
+PASS_CONFIGS = {
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,
+    tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,
+}
+
+VEC_NUM = 2
+BLOCK_K_HALF = 128
+
+@tilelang.jit(out_idx=[-1], pass_configs=PASS_CONFIGS)
+def w4a8_gemm_cv(M, N, K):
+    K_half = K // 2
+    block_M = 64
+    block_N = 16  # 满足 L0B/L0C 分形限制（必须 ≥ 16）
+    block_N_2 = block_N // VEC_NUM  # 每个 V 核处理 8 行
+    block_K_chunk = BLOCK_K_HALF * 2
+
+    k_num = T.ceildiv(K_half, BLOCK_K_HALF)
+    m_num = T.ceildiv(M, block_M)
+    n_num = T.ceildiv(N, block_N)
+
+    @T.prim_func
+    def main(
+        A: T.Tensor((M, K), "int8"),
+        B_packed: T.Tensor((N, K_half), "uint8"),
+        workspace: T.Tensor((N, K), "int8"),
+        C: T.Tensor((M, N), "int32"),
+    ):
+        with T.Kernel(m_num * n_num, is_npu=True) as (cid, vid):
+            bm = cid // n_num
+            bn = cid % n_num
+
+            # ===== Vector 核部分：W4 解量化 =====
+            # 使用 alloc_shared，编译器自动映射到 UB
+            packed_ub = T.alloc_shared((BLOCK_K_HALF,), "uint8")
+            output_ub = T.alloc_shared((BLOCK_K_HALF * 2,), "int8")
+            # ... 其他临时 buffer ...
+
+            # 每个 V 核处理 block_N_2 行
+            for row in T.serial(block_N_2):
+                actual_row = bn * block_N + vid * block_N_2 + row  # 关键索引
+
+                for k_chunk in T.serial(k_num):
+                    chunk_offset = k_chunk * BLOCK_K_HALF
+                    
+                    # 读数据（用 actual_row）
+                    T.copy(B_packed[actual_row, chunk_offset], packed_ub)
+                    
+                    # ... W4 解量化逻辑（T.tile.bitwise_and/rshift/cast/add）...
+                    
+                    # 写 workspace（必须用 actual_row！）
+                    T.copy(output_ub, workspace[actual_row, chunk_offset * 2])
+
+            # ===== Cube 核部分：GEMM =====
+            # 使用 alloc_shared/fragment，编译器自动映射到 L1/L0
+            A_L1 = T.alloc_shared((block_M, block_K_chunk), "int8")
+            B_L1 = T.alloc_shared((block_N, block_K_chunk), "int8")
+            C_L0 = T.alloc_fragment((block_M, block_N), "int32")
+
+            for k_chunk in T.serial(k_num):
+                k_offset = k_chunk * BLOCK_K_HALF * 2
+                
+                # Cube 核读取完整 block_N（不涉及 vid）
+                T.copy(A[bm * block_M, k_offset], A_L1)
+                T.copy(workspace[bn * block_N, k_offset], B_L1)  # 完整 16 行
+                
+                # init=(k_chunk == 0)：第一次调用清零 C_L0
+                T.gemm_v0(A_L1, B_L1, C_L0, transpose_B=True, init=(k_chunk == 0))
+
+            T.copy(C_L0, C[bm * block_M, bn * block_N])
+
+    return main
+```
+
+**特点**：
+- **无 `T.Scope`、无手动同步**：AUTO_CV_COMBINE 和 AUTO_CV_SYNC 自动处理
+- **V 核并行化**：`vid` 分配任务，每个 V 核处理 8 行
+- **workspace 索引一致性**：读写都使用 `actual_row`
+- **Cube 核读取完整 block_N**：GEMM 不涉及 vid
+- **满足分形限制**：`block_N = 16`（≥ L0B/L0C 最小要求）
+
+**关键 pass_configs**：
+- `AUTO_CV_COMBINE`：编译器识别 Vector 解量化 + Cube GEMM 并自动分离
+- `AUTO_CV_SYNC`：编译器自动在 Vector 写完 workspace 后通知 Cube 读取
+
+### 6.1 CV 融合算子特征
+
+**CV 融合算子** = Vector 核预处理/后处理 + Cube 核 GEMM
+
+典型场景：
+- **W4A8 GEMM**：Vector 核解量化（W4 → int8），Cube 核做 GEMM
+- **Flash Attention**：Vector 核 Softmax，Cube 核做两次 GEMM
+- **量化 GEMM**：Vector 核反量化/量化，Cube 核做 GEMM
+
+### 6.2 Developer 模式下 CV 融合的关键点
+
+**必须开启 4 个 pass_configs**：
+- `AUTO_CV_COMBINE`：编译器自动识别 Cube/Vector 操作并分离到不同核
+- `AUTO_CV_SYNC`：编译器自动在 Cube/Vector 写入 workspace 后插入核间同步
+- **不要手写 `T.Scope("C")` / `T.Scope("V")`**（会与 AUTO_CV_COMBINE 冲突）
+
+### 6.3 V 核并行化（避免算力浪费）
+
+Ascend NPU C:V = 1:2，两个 V 核默认执行相同工作。正确使用 `vid` 可让两个 V 核分担任务。
+
+**易错点**：
+- workspace 写入时忘记使用 `actual_row`（导致数据错乱）
+- Cube 核读取时使用 vid 切分（Cube 不涉及 vid）
+
+### 6.4 编译器警告解读
+
+Developer 模式下可能出现：
+```
+Warning: Cube loop times (= X) is not enough to catch up vec loop times (= Y)
+```
+
+**解读**：
+- Vector 循环次数 = `block_N_2 × k_num`
+- Cube 循环次数 = `k_num`
+- 此警告可忽略，AUTO_CV_SYNC 会确保同步正确
diff --git a/.agents/skills/tilelang-op-design/SKILL.md b/.agents/skills/tilelang-op-design/SKILL.md
index 69e5cc99a..929b4d644 100644
--- a/.agents/skills/tilelang-op-design/SKILL.md
+++ b/.agents/skills/tilelang-op-design/SKILL.md
@@ -32,6 +32,10 @@ description: "根据算子需求生成 TileLang-Ascend 算子设计文档（desi
 | 输入张量规格 | shape、dtype |
 | 输出张量规格 | shape、dtype |
 | 编程模式偏好 | Developer / Expert / 混合 |
+| **迁移算子路径** ⭐ | 原算子文件路径（迁移时必需），用于获取 golden 实现 |
+| **输出形状** ⭐ | 原算子输出 shape（迁移时必需），如 `(N, M)` 或 `(M, N)` |
+
+**迁移算子时必须提供原算子路径和输出形状**，否则无法证明迁移正确性。详见 [tilelang-op-generate/references/pr-ready-guide.md §1](../tilelang-op-generate/references/pr-ready-guide.md)。
 
 **提问规则（必须严格遵守）**：
 1. **每次只询问一个字段**：使用 `question` 工具时，`questions` 数组中只包含一个元素
@@ -214,23 +218,26 @@ grep "T.Scope\|T.barrier" examples/{同类实现}  # 同步方式
 算子数学公式
 ├─ 含 matmul / @ / 矩阵乘
 │   ├─ 仅 matmul → 纯 Cube
-│   │   参考: examples/gemm/example_gemm.py
-│   │   模式: Expert（手动管理 L0）
-│   │   API（Ascend 专用）: T.gemm_v0(A_L1, B_L1, C_L0C, transpose_A, init)
-│   │   API（通用版，本项目不推荐）: T.gemm
-│   │   内存（Expert）: T.alloc_L1 → T.alloc_L0C
-│   │   内存（Developer）: T.alloc_shared → T.alloc_fragment
-│   │   同步: T.barrier_all() + T.Scope("C")
-│   │   Kernel: T.Kernel(一维, is_npu=True) as (cid, _)
+│   │   模式: Developer (推荐) 或 Expert
+│   │   API: T.gemm_v0 / T.mma
+│   │   内存: GM→L1→L0A/L0B→L0C→GM
+│   │   pass_configs: 全开启（Developer）
+│   │   Kernel: T.Kernel(任务数, is_npu=True) as (cid, _)
 │   │
-│   └─ matmul + element-wise 后处理 → 混合（融合算子）
-│       模式: Developer + 自动同步（推荐）或 Expert + 手动同步
-│       API: T.gemm_v0 + T.tile.* / T.Parallel + workspace
-│       内存: GM→L1→L0A/L0B→L0C→workspace→UB→GM
-│       workspace: 数量/shape/dtype 自动推断，位于 GM
-│       pass_configs: AUTO_CV_COMBINE:True + AUTO_CV_SYNC:True + AUTO_SYNC:True
-│       同步: 自动（AUTO_CV_SYNC）或手动（T.set_cross_flag / T.wait_cross_flag）
-│       参考示例: examples/flash_attention/flash_attn_bhsd_cc_sync.py
+│   └─ matmul + element-wise 前处理/后处理 → CV 融合算子
+│       ├─ Developer 模式（推荐）
+│       │   模式: Developer + AUTO_CV_COMBINE
+│       │   API: T.tile.* (Vector) + T.gemm_v0 (Cube)
+│       │   内存: GM→L1→L0C→workspace→UB→GM
+│       │   pass_configs: AUTO_SYNC + AUTO_CV_COMBINE + AUTO_CV_SYNC
+│       │   同步: AUTO_SYNC + AUTO_CV_SYNC 自动处理
+│       │   V 核: 可用 vid 并行化（每个 V 核处理 block_N // VEC_NUM 行）
+│       │
+│       ├─ Expert 模式（极致性能）
+│       │   模式: Expert + T.Scope("C"/"V") + T.set_cross_flag
+│       │   同步: 手动核间同步（T.set_cross_flag / T.wait_cross_flag）
+│       │
+│    典型算子: W4A8 GEMM, Flash Attention, 量化 GEMM
 │
 ├─ 纯 element-wise（逐元素运算）
 │   参考: examples/elementwise/*.py, examples/activation/*.py
@@ -259,6 +266,25 @@ grep "T.Scope\|T.barrier" examples/{同类实现}  # 同步方式
     强制步骤: 先搜索本项目 examples/
 ```
 
+**⚠️ NPU 硬件约束（必查）**：
+
+设计 Tiling 策略时，必须考虑：
+1. **分形限制**（Fractal Limits）：
+   - L0A: M ≥ 16, K ≥ 32
+   - L0B: K ≥ 32, N ≥ 16
+   - L0C: M ≥ 16, N ≥ 16
+2. **对齐要求**：
+   - UB/L1: 32 Byte
+   - L0A/L0B: 512 Byte
+   - L0C: 64 Byte
+3. **存储大小上限**：
+   - L0A/L0B: 64KB
+   - L0C: 128KB
+   - L1: 512KB
+   - UB: 192KB
+
+违反约束会导致编译错误或运行时错误。详见 [tilelang-api-best-practices](../tilelang-custom-skill/tilelang-api-best-practices/references/api-kernel-memory.md)。
+
 ### 4.2 API 映射规则
 
 | 类别 | Ascend 专用 API（推荐） | 通用 API（本项目不推荐/不支持） |
@@ -286,11 +312,14 @@ grep "T.Scope\|T.barrier" examples/{同类实现}  # 同步方式
 | 6 | **验证方案覆盖典型配置**：不是「待补充」 | ⭕ 推荐 |
 | 7 | **无占位符或模糊描述**：无 `{placeholder}`、TODO、「待补充」（已确认的除外） | ✅ 必须 |
 | 8 | **技术约束已确认**：三维 Kernel、threads、动态边界等问题已处理 | ✅ 必须 |
-| 9 | **本项目同类实现已列出**：有具体的 examples/ 文件路径参考 | ✅ 必须 |
-| 10 | **参考实现差异已说明**：如果有外部参考，列出 API/结构差异 | ⭕ 推荐 |
-| 11 | **参考实现分析完整**（如有参考实现）：记录了内存层级 API、同步策略、pass_configs 等技术决策 | ⭕ 推荐 |
-| 12 | **CV 融合设计完整**（如需）：workspace 规格、数据流、pass_configs | ⭕ 推荐 |
-| 13 | **workspace_idx 配置正确**（如需 CV 融合）：与 workspace 参数位置一致 | ✅ 必须 |
+| 9 | **含 GEMM 场景**：Tiling 策略满足 NPU 分形限制（block_M ≥ 16, block_N ≥ 16） | ✅ 必须 |
+| 13 | **含 CV 融合场景**：workspace 规格、数据流、pass_configs 设计完整| ✅ 必须 |
+| 14 | **含 CV 融合场景 workspace_idx 配置正确**：与 workspace 参数位置一致 | ✅ 必须 |
+| 10 | **本项目同类实现已列出**：有具体的 examples/ 文件路径参考 | ✅ 必须 |
+| 11 | **参考实现差异已说明**：如有外部参考，列出 API/结构差异 | ⭕ 推荐 |
+| 12 | **参考实现分析完整**：如有外部参考，记录内存层级 API、同步策略、pass_configs 等技术决策 | ⭕ 推荐 |
+| 15 | **参考实现标注原算子路径**：如有外部参考，标注文件路径，用于获取 golden 实现 | ⭕ 推荐 |
+| 16 | **参考实现标注输出形状**：如有外部参考，说明输出形状是否需要 transpose | ⭕ 推荐 |
 
 **通过条件**：必须项（1, 2, 3, 7, 8, 9）全部通过，推荐项（4, 5, 6, 10）至少通过 3/4。
 
diff --git a/.agents/skills/tilelang-op-generate/SKILL.md b/.agents/skills/tilelang-op-generate/SKILL.md
index 64c4ed8e1..d28716231 100644
--- a/.agents/skills/tilelang-op-generate/SKILL.md
+++ b/.agents/skills/tilelang-op-generate/SKILL.md
@@ -110,8 +110,6 @@ import tilelang
 from tilelang import DataType, language as T
 import torch
 
-tilelang.cache.clear_cache()
-
 # ========== 算子实现 ==========
 @tilelang.jit(out_idx=[...], pass_configs={...})
 def op_name(M, N, block_M, block_N, dtype="float"):
@@ -133,7 +131,8 @@ def op_name(M, N, block_M, block_N, dtype="float"):
 
 # ========== 测试 ==========
 if __name__ == "__main__":
-    torch.manual_seed(0)
+    tilelang.disable_cache()  # 在 __main__ 中禁用编译缓存
+    torch.manual_seed(...)
     test_configs = [...]  # 来自 design.md §8
 
     for config in test_configs:
@@ -144,7 +143,7 @@ if __name__ == "__main__":
         # 5. 精度检查
         pass
 
-    print("All tests passed!")
+    print("Test Passed!")
 ```
 
 **融合算子注意事项**：
@@ -157,10 +156,13 @@ if __name__ == "__main__":
 python examples/{op}/example_{op}.py
 ```
 
-如果报错，按以下顺序排查：
-1. **编译错误** → 检查 buffer 大小、API 参数、对齐
-2. **运行错误** → 检查索引越界、同步缺失
-3. **精度错误** → 检查计算公式、数据类型、容差设置
+如果报错，查阅 [troubleshooting.md](references/troubleshooting.md) 进行排查：
+
+| 错误类型 | 排查方向 | 详细参考 |
+|---------|---------|---------|
+| 编译错误 | buffer 大小、API 参数、对齐 | troubleshooting.md §编译时错误 |
+| 运行错误 | 索引越界、同步缺失 | troubleshooting.md §运行时错误 |
+| 精度错误 | Golden 实现、输出形状 | troubleshooting.md §精度问题 |
 
 ### 步骤 5：校验原有实现正确性
 
@@ -196,7 +198,20 @@ def conv_im2col_gemm(...):
     C = globals()['C']  # 多测试场景会互相污染
 ```
 
----
+### 步骤 8：上库前检查清单
+
+运行通过后，必须按 §8 Checklist 检查所有项目。重点注意：
+
+| # | 关键项 | 说明 |
+|---|--------|------|
+| 1 | **Golden 实现一致** | 迁移算子必须使用原算子的 golden 实现 |
+| 2 | **tilelang.disable_cache()** | 放在 `__main__` 下方或 `main()` 内部 |
+| 3 | **最后一行输出** | `"Test Passed!"` 或 `"Kernel Output Match!"` |
+| 4 | **代码格式** | `ruff check` + `ruff format --check` |
+
+详见：
+- [pr-ready-guide.md](references/pr-ready-guide.md) - 上库前收尾工作完整指南
+- §8 Checklist - 完整检查清单
 
 ## 4. 关键编码规范
 
@@ -233,19 +248,49 @@ output = output[:M, :N]
 a_ub = T.alloc_ub([block_M // VEC_NUM, block_N], dtype)
 ```
 
+Developer 模式下：
+```python
+# Vector 核 buffer（编译器映射到 UB）
+packed_ub = T.alloc_shared([block_M // VEC_NUM, block_N], dtype)
+
+# Cube 核 buffer（编译器映射到 L1/L0）
+A_L1 = T.alloc_shared([block_M, block_K], dtype)
+B_L1 = T.alloc_shared([block_N, block_K], dtype)
+C_L0 = T.alloc_fragment([block_M, block_N], accum_dtype)
+```
+
 ### 数据搬运索引
 
 ```python
-# 标准索引模式
+# 标准索引模式（纯 Vector 算子）
 row_start = bx * block_M + vid * block_M // VEC_NUM
 T.copy(A[row_start, by * block_N], a_ub)
 T.copy(a_ub, B[row_start, by * block_N])
 ```
 
+**⚠️ CV 融合场景（workspace 索引一致性）**：
+```python
+VEC_NUM = 2
+block_N_2 = block_N // VEC_NUM
+
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row  # 关键索引
+    
+    # 读数据和写 workspace 都必须用 actual_row
+    T.copy(B_packed[actual_row, chunk_offset], packed_ub)  # ✓
+    # ... 处理 ...
+    T.copy(output_ub, workspace[actual_row, chunk_offset * 2])  # ✓（必须一致）
+
+# Cube 核读取完整 block_N（不涉及 vid）
+T.copy(workspace[bn * block_N, k_offset], B_L1)  # 完整 block_N
+```
+
+**易错点**：workspace 写入时忘记使用 `actual_row`，导致数据错乱。
+
 ### 同步
 
 ```python
-# Expert 模式：手��同步
+# Expert 模式：手动同步
 with T.Scope("V"):
     T.copy(A[...], a_ub)
     T.barrier_all()
@@ -279,30 +324,151 @@ torch.testing.assert_close(output.cpu(), ref_output.cpu(), rtol=rtol, atol=atol)
 
 ---
 
-## 5. Checklist
+## 5. V 核并行化编码规范
+
+Ascend NPU C:V = 1:2，默认两个 V 核执行相同工作。正确使用 `vid` 可让两个 V 核分担任务。
+
+### 按行切分
+
+```python
+VEC_NUM = 2
+block_M_2 = block_M // VEC_NUM
+
+with T.Kernel(grid_size, is_npu=True) as (cid, vid):
+    row_start = cid * block_M + vid * block_M_2
+    
+    # Buffer 分配：只需分配 V 核负责的行数
+    data_ub = T.alloc_shared((block_M_2, block_N), dtype)
+    
+    # 读入数据
+    T.copy(A[row_start, by * block_N], data_ub)
+    
+    # 计算
+    ...
+    
+    # 写出数据（索引必须与读一致）
+    T.copy(data_ub, B[row_start, by * block_N])
+```
+
+### 中间 buffer 索引一致性
+
+当 V 核读写中间 buffer（workspace、临时 buffer）时，必须保持索引一致：
+
+```python
+# 错误：读写索引不一致
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row
+    T.copy(src[actual_row, ...], temp_ub)
+    T.copy(temp_ub, dst[bn * block_N + row, ...])  # ❌ 索引不一致
+
+# 正确：读写索引一致
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row
+    T.copy(src[actual_row, ...], temp_ub)
+    T.copy(temp_ub, dst[actual_row, ...])  # ✓ 索引一致
+```
+
+### 模式三：CV 融合中的 V 核并行化
+
+CV 融合算子中，V 核负责预处理，Cube 核负责 GEMM：
+
+```python
+VEC_NUM = 2
+block_N_2 = block_N // VEC_NUM
+
+# Vector 核部分：使用 vid 分配任务
+for row in T.serial(block_N_2):
+    actual_row = bn * block_N + vid * block_N_2 + row
+    T.copy(B_packed[actual_row, ...], ...)
+    T.copy(..., workspace[actual_row, ...])
+
+# Cube 核部分：读取完整 block_N（不涉及 vid）
+T.copy(workspace[bn * block_N, ...], B_L1)
+T.gemm_v0(A_L1, B_L1, C_L0, ...)
+```
+
+---
+
+## 6. GEMM 编码规范
+
+### gemm_v0 初始化
+
+第一次调用必须清零 C_L0：
+
+```python
+for k_chunk in T.serial(k_num):
+    T.gemm_v0(A_L1, B_L1, C_L0, transpose_B=True, init=(k_chunk == 0))
+```
+
+### NPU 分形限制
+
+GEMM 的 block size 必须满足 L0A/L0B/L0C 分形限制（详见 [api-compute.md](../tilelang-custom-skill/tilelang-api-best-practices/references/api-compute.md)）：
+
+- int8 GEMM：`block_M ≥ 16`, `block_N ≥ 16`, `block_K ≥ 32`
+- float16 GEMM：`block_M ≥ 16`, `block_N ≥ 16`, `block_K ≥ 16`
+
+---
+
+## 7. CV 融合 pass_configs
+
+CV 融合算子必须开启全部 4 个 pass_configs：
+
+```python
+PASS_CONFIGS = {
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_SYNC: True,
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_SYNC: True,
+    tilelang.PassConfigKey.TL_ASCEND_AUTO_CV_COMBINE: True,  # 自动分离 Cube/Vector
+    tilelang.PassConfigKey.TL_ASCEND_MEMORY_PLANNING: True,
+}
+```
+
+---
+
+## 8. Checklist
 
 生成代码后逐项检查：
 
-### 基础检查
+### 功能验证
 
 | # | 检查项 |
 |---|--------|
 | 1 | `out_idx` 与函数签名中的输出参数位置一致 |
-| 2 | `block_M // VEC_NUM` 在 buffer 分配和索引中一致使用 |
+| 2 | V 核并行化：`block_M // VEC_NUM` 在 buffer 分配和索引中一致使用（详见 §5） |
 | 3 | 所有 `T.alloc_ub` 的 shape 乘积不超 UB 容量 |
 | 4 | Expert 模式有 `T.Scope("V")` 和 `T.barrier_all()` |
 | 5 | Developer 模式有对应的 `pass_configs` |
 | 6 | 测试包含至少 2 个配置（小规模 + 典型规模） |
-| 7 | golden 函数使用 PyTorch 标准实现 |
+| 7 | 含 GEMM：`gemm_v0` 第一次调用有 `init=True`（详见 §6） |
+| 8 | 含 GEMM：block size 满足分形限制（详见 §6） |
+
+### Golden 与精度验证
+
+| # | 检查项 | 说明 |
+|---|--------|------|
+| 9 | **Golden 实现一致** | 迁移算子必须使用原算子的 golden 实现（详见 [pr-ready-guide.md](references/pr-ready-guide.md) §1） |
+| 10 | **输出形状匹配** | 检查是否需要 transpose 来匹配原算子输出 shape |
+
+### 上库前收尾检查（详见 [pr-ready-guide.md](references/pr-ready-guide.md)）
+
+| # | 检查项 | 方法 |
+|---|--------|------|
+| 11 | **tilelang.disable_cache()** | 放在 `__main__` 下方或 `main()` 内部 |
+| 12 | **注释转英文** | 人工检查所有注释 |
+| 13 | **`# type: ignore`** | 添加到所有 `T.Tensor` 参数定义 |
+| 14 | **移除 try-catch** | 测试代码中不应有异常捕获 |
+| 15 | **每组测试提示** | `print(f"Test passed: M={M}, N={N}, K={K}")` |
+| 16 | **最终输出格式** | `"Test Passed!"` 或 `"Kernel Output Match!"` |
+| 17 | **参数处理灵活** | 支持自定义参数 + 默认多组测试 |
+| 18 | **代码格式检查** | `ruff check` + `ruff format --check` 通过 |
 
-### 融合算子检查
+### 融合算子专项检查
 
 | # | 检查项 | 说明 |
 |---|--------|------|
-| 8 | **workspace_idx 与函数签名一致** | workspace 参数位置正确 |
-| 9 | **AUTO_CV_COMBINE / AUTO_CV_SYNC 配置** | Developer 模式需开启 |
-| 10 | **Cube → workspace → Vector 数据流正确** | T.copy 搬运路径完整 |
-| 11 | **核分离方式与 pass_configs 匹配** | Developer 模式无需显式 T.Scope |
+| 19 | **workspace_idx 与函数签名一致** | workspace 参数位置正确 |
+| 20 | **AUTO_CV_COMBINE / AUTO_CV_SYNC 配置** | Developer 模式需开启 |
+| 21 | **Cube → workspace → Vector 数据流正确** | T.copy 搬运路径完整 |
+| 22 | **核分离方式与 pass_configs 匹配** | Developer 模式无需显式 T.Scope |
 
 ### 融合算子常见错误排查
 
diff --git a/.agents/skills/tilelang-op-generate/references/pr-ready-guide.md b/.agents/skills/tilelang-op-generate/references/pr-ready-guide.md
new file mode 100644
index 000000000..82239198b
--- /dev/null
+++ b/.agents/skills/tilelang-op-generate/references/pr-ready-guide.md
@@ -0,0 +1,336 @@
+# 算子上库前收尾工作指南
+
+本文档描述算子代码生成后，在提交 PR 上库前必须完成的收尾工作。
+
+---
+
+## 1. Golden 实现的重要性 ⭐⭐⭐
+
+### 1.1 为什么 Golden 实现至关重要
+
+Golden 实现是证明算子迁移正确性的**唯一证据**。如果 golden 实现不一致，即使测试通过也无法证明迁移正确。
+
+### 1.2 Golden 实现来源优先级
+
+| 优先级 | 来源 | 适用场景 |
+|--------|------|----------|
+| **最高** | 原算子的 golden 实现 | 迁移已有算子（必须使用） |
+| **次高** | PyTorch 标准实现 | 新算子开发 |
+| **最低** | 手写实现 | 无标准实现时 |
+
+### 1.3 迁移算子时必须检查
+
+**必须回答的问题**：如何证明我的实现与原算子一致？
+
+| 检查项 | 方法 |
+|--------|------|
+| golden 函数是否一致 | 对比原算子代码，确保使用相同的 golden 实现 |
+| 输出形状是否一致 | 检查原算子输出 shape，可能需要 transpose |
+| 数据类型是否一致 | 确保输入输出 dtype 与原算子匹配 |
+
+### 1.4 输出形状匹配示例
+
+原算子可能输出 `(N, M)`，而你的 kernel 输出 `(M, N)`：
+
+```python
+# 原算子输出 (N, M)
+def ref_program(A, qB):
+    B = torch_convert(qB)
+    C = torch.matmul(A.to(torch.float), B.T.to(torch.float))
+    return C.transpose(0, 1)  # (M, N) → (N, M)
+
+# 你的 kernel 输出 (M, N)
+# 测试时需要 transpose 匹配
+result = kernel(A_int8, qB, workspace, C_output)  # (M, N)
+expected = ref_program(A_int8, qB)  # (N, M)
+
+torch.testing.assert_close(result.cpu().transpose(0, 1), expected)
+```
+
+---
+
+## 2. 参数处理灵活性
+
+### 2.1 推荐模式
+
+支持用户自定义参数 + 默认测试：
+
+```python
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--M", type=int, default=0)
+    parser.add_argument("--N", type=int, default=0)
+    parser.add_argument("--K", type=int, default=0)
+    args = parser.parse_args()
+
+    torch.manual_seed(42)
+
+    # 用户自定义参数时单测
+    if args.M > 0 and args.N > 0 and args.K > 0:
+        test(args.M, args.N, args.K)
+    else:
+        # 默认：多组测试
+        test(64, 64, 256)
+        test(128, 128, 512)
+        test(256, 256, 1024)
+
+    print("Test Passed!")
+```
+
+### 2.2 测试配置建议
+
+| 场景 | 测试配置 |
+|------|----------|
+| 简单算子 | 2 组：小规模 + 典型规模 |
+| 复杂算子 | 3 组：小规模 + 典型规模 + 大规模 |
+| GEMM 类 | 3 组：K=256, 512, 1024 |
+
+---
+
+## 3. 测试输出规范
+
+### 3.0 缓存禁用 ⭐
+
+测试算子时，必须禁用缓存以确保每次重新编译：
+
+```python
+def main():
+    tilelang.disable_cache()  # Disable cache for testing correctness
+    torch.manual_seed(...)
+    
+    # ... test code ...
+```
+
+**⚠️ 重要规则**：
+
+| 规则 | 说明 |
+|------|------|
+| ✅ 推荐 | `tilelang.disable_cache()` 放在 `__main__` 下方或 `main()` 内部，只在测试时禁用 |
+| ❌ 禁止 | `tilelang.disable_cache()` 放在文件开头（全局声明），影响其他人 import kernel |
+| ❌ 禁止 | `tilelang.cache.clear_cache()`，会清理全部编译缓存，影响其它算子 |
+
+**原因**：
+- 全局声明会影响其他人 import 这个算子的 kernel 函数
+- `clear_cache()` 会清理全部编译缓存，影响其它算子的缓存
+- 只在测试时禁用缓存，可以验证算子实现正确性
+
+### 3.1 每组测试通过的提示
+
+每组测试通过时应输出提示，避免让人觉得卡住：
+
+```python
+def test(M, N, K):
+    kernel = op_name(M, N, K)
+    
+    # ... 执行 kernel ...
+    
+    torch.testing.assert_close(result.cpu(), expected, rtol=1e-2, atol=1e-2)
+    print(f"Test passed: M={M}, N={N}, K={K}")  # ✓ 关键输出
+```
+
+### 3.2 最终输出格式
+
+最后一行必须输出 `"Test Passed!"` 或 `"Kernel Output Match!"`，以符合 bench_test.sh 的判定逻辑：
+
+```bash
+# bench_test.sh 判定条件
+if [[ "$output" =~ [Kk][Ee][Rr][Nn][Ee][Ll][[:space:]][Oo][Uu][Tt][Pp][Uu][Tt][[:space:]][Mm][Aa][Tt][Cc][Hh] ]] || \
+   [[ "$output" =~ [Tt][Ee][Ss][Tt][[:space:]][Pp][Aa][Ss][Ss][Ee][Dd][!] ]]; then
+    echo "[PASSED]"
+fi
+```
+
+### 3.3 完整输出示例
+
+```
+Test passed: M=64, N=64, K=256
+Test passed: M=128, N=128, K=512
+Test passed: M=256, N=256, K=1024
+Test Passed!
+```
+
+---
+
+## 4. 代码风格规范
+
+### 4.1 注释规范
+
+| 规则 | 说明 |
+|------|------|
+| 中文注释转英文 | 所有注释使用英文 |
+| 移除调试注释 | 上库前删除临时调试代码和注释 |
+| 保留关键注释 | 保留算法说明、参数说明等关键注释 |
+
+**示例**：
+
+```python
+# ❌ 中文注释（上库前修改）
+# Vector 核部分 - 每个 V 核处理 block_N_2 行
+
+# ✓ 英文注释
+# Vector core: each V core processes block_N_2 rows
+```
+
+### 4.2 `# type: ignore` 使用
+
+`T.Tensor` 参数定义会导致 Pylance 报错，需要添加 `# type: ignore`：
+
+```python
+@T.prim_func
+def main(
+    A: T.Tensor((M, K), "int8"),               # type: ignore
+    B_packed: T.Tensor((N, K_half), "uint8"),  # type: ignore
+    workspace: T.Tensor((N, K), "int8"),       # type: ignore
+    C: T.Tensor((M, N), "int32"),              # type: ignore
+):
+    ...
+```
+
+**原因**：TileLang DSL 的 `T.Tensor` 是特殊类型定义，Pylance 无法正确识别。
+
+### 4.3 `torch.set_default_device` 使用
+
+**不推荐**：全局设置默认设备
+
+```python
+# ❌ 不推荐
+torch.set_default_device("npu")
+```
+
+**推荐**：按需指定设备
+
+```python
+# ✓ 推荐
+A_int8 = torch.randint(-8, 8, (M, K), dtype=torch.int8).npu()
+result = kernel(A_int8, qB, workspace, C_output)
+torch.npu.synchronize()
+expected = ref_program(A_int8.cpu(), qB.cpu())  # golden 在 CPU 上计算
+```
+
+**原因**：
+- 全局设置可能影响其他代码
+- 按需指定更清晰、更可控
+
+### 4.4 移除 try-catch
+
+上库前移除测试代码中的 try-catch，fail fast 更利于问题暴露：
+
+```python
+# ❌ 不推荐（上库前修改）
+try:
+    result = kernel(A_int8, qB, workspace, C_output)
+    torch.testing.assert_close(result.cpu(), expected)
+except Exception as e:
+    traceback.print_exc()
+    logging.error(f"✗ Error: {e}")
+    return False
+
+# ✓ 推荐
+result = kernel(A_int8, qB, workspace, C_output)
+torch.npu.synchronize()
+expected = ref_program(A_int8.cpu(), qB.cpu())
+torch.testing.assert_close(result.cpu(), expected, rtol=0, atol=0)
+```
+
+---
+
+## 5. 代码格式检查
+
+### 5.1 检查工具
+
+使用 `ruff` 进行 Python 代码检查：
+
+```bash
+# Lint 检查
+ruff check examples/{op}/example_{op}.py
+
+# Format 检查
+ruff format --check examples/{op}/example_{op}.py
+```
+
+详细请参考技能：[tilelang-review-skill](../../tilelang-custom-skill/tilelang-review-skill/SKILL.md)
+
+### 5.2 自动修复
+
+```bash
+# 自动修复 lint 问题
+ruff check --fix examples/{op}/example_{op}.py
+
+# 自动格式化
+ruff format examples/{op}/example_{op}.py
+```
+
+### 5.3 常见问题修复
+
+| 问题 | 修复方法 |
+|------|----------|
+| 未使用的变量 | 删除变量或添加 `_` 前缀 |
+| 行宽超限 | 手动调整或 `ruff format` |
+| 导入未使用 | 删除导入 |
+
+---
+
+## 6. 检查清单索引
+
+本文档详细说明了上库前各项检查的要求和示例，完整检查清单请参考：
+
+→ **[SKILL.md §8 Checklist](../SKILL.md#8-checklist)** - 唯一的完整检查清单（22项）
+
+### 检查项与文档章节对应
+
+| 检查项 | SKILL.md 编号 | 本文档章节 |
+|--------|--------------|-----------|
+| **Golden 实现一致** | #9 | §1 |
+| **输出形状匹配** | #10 | §1.4 |
+| **tilelang.disable_cache()** | #11 | §3.0 |
+| **注释转英文** | #12 | §4.1 |
+| **`# type: ignore`** | #13 | §4.2 |
+| **移除 try-catch** | #14 | §4.4 |
+| **每组测试提示** | #15 | §3.1 |
+| **最终输出格式** | #16 | §3.2 |
+| **参数处理灵活** | #17 | §2 |
+| **代码格式检查** | #18 | §5 |
+
+**使用方法**：
+1. 先阅读本文档 §1-5 理解各项检查的详细要求
+2. 按照 SKILL.md §8 Checklist逐项检查
+
+---
+
+## 7. 参考示例
+
+完整的上库算子示例：
+
+- `examples/dequantize_gemm/example_dequant_gemm_w4a8.py` - W4A8 GEMM CV 融合
+- `examples/quant_batch_matmul/example_quant_batch_matmul.py` - 量化 Batch Matmul
+- `examples/flash_attention/` - Flash Attention 系列
+
+---
+
+## 8. 常见错误排查
+
+### 8.1 Golden 不一致导致测试失败
+
+**症状**：测试失败，精度误差很大
+
+**排查**：
+1. 检查是否使用原算子的 golden 实现
+2. 检查输出形状是否需要 transpose
+3. 检查数据类型是否一致
+
+### 8.2 bench_test.sh 未通过
+
+**症状**：脚本判定失败
+
+**排查**：
+1. 检查最后一行输出是否为 `"Test Passed!"` 或 `"Kernel Output Match!"`
+2. 检查是否有异常退出
+3. 检查测试是否全部通过
+
+### 8.3 Pylance 报错
+
+**症状**：VSCode 显示类型错误
+
+**排查**：
+1. 添加 `# type: ignore` 到 T.Tensor 参数
+2. 确保其他代码符合类型注解规范
\ No newline at end of file
diff --git a/.agents/skills/tilelang-op-generate/references/troubleshooting.md b/.agents/skills/tilelang-op-generate/references/troubleshooting.md
index 32bdaf36b..939346831 100644
--- a/.agents/skills/tilelang-op-generate/references/troubleshooting.md
+++ b/.agents/skills/tilelang-op-generate/references/troubleshooting.md
@@ -163,16 +163,46 @@ block_M = [bs for bs in [64, 128] if bs <= M]  # 排除 256
 
 ### 2. 精度问题
 
-**现象**: 输出与参考实现有微小差异
+**现象**: 输出与参考实现有差异
 
-**原因**: float16精度较低，累积误差
+**可能原因**:
+1. **Golden 实现不一致** ⭐（最常见）
+2. float16 精度较低，累积误差
+3. 输出形状不匹配
 
 **解决方案**:
-1. 使用float32进行计算
-2. 调整测试容差：
-   ```python
-   torch.testing.assert_close(b.cpu(), ref_b.cpu(), rtol=1e-2, atol=1e-2)
-   ```
+
+#### 1. Golden 实现不一致（迁移算子时最常见）
+
+**症状**：测试失败，精度误差很大（如 98% 的元素不匹配）
+
+**排查步骤**：
+1. 检查是否使用了原算子的 golden 实现
+2. 对比原算子代码，确保算法逻辑一致
+3. 检查输出形状是否需要 transpose
+
+**示例**：
+```python
+# 原算子输出 (N, M)，你的 kernel 输出 (M, N)
+# 需要 transpose 来匹配
+result = kernel(A, B, workspace, C)  # (M, N)
+expected = ref_program(A, B)  # (N, M)
+
+torch.testing.assert_close(result.cpu().transpose(0, 1), expected)
+```
+
+**详细参考**：[pr-ready-guide.md](pr-ready-guide.md) §1
+
+#### 2. float16 精度问题
+
+使用 float32 进行计算或调整容差：
+```python
+torch.testing.assert_close(b.cpu(), ref_b.cpu(), rtol=1e-2, atol=1e-2)
+```
+
+#### 3. 输出形状不匹配
+
+检查原算子输出 shape，可能需要 transpose 或 reshape。
 
 ### 3. 性能问题
 
diff --git a/.gitignore b/.gitignore
index f00b8f036..158dadae4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -29,6 +29,7 @@ tmp/
 venv/
 .vscode/
 .vs/
+.arts/
 
 # VisualGDB files
 VisualGDB/
@@ -101,3 +102,5 @@ examples/torch_tl_ascend/demo_libtorch/lib/*.so
 docs/notebook/*.so
 docs/notebook/*.cpp
 
+.agents/reports
+