tile-ai · mmyy1217 · Apr 15, 2026 · Apr 15, 2026 · Apr 16, 2026
diff --git a/.agents/evals/minimal_skill_trigger_eval.jsonl b/.agents/evals/minimal_skill_trigger_eval.jsonl
@@ -15,3 +15,5 @@
 {"id":"review-02","prompt":"提交前帮我跑一遍格式和 lint 检查，看看 CI 风险","expected_skill":"tilelang-review-skill"}
 {"id":"github-01","prompt":"我准备把分支提 PR 到 npuir，给我完整 commit push PR 流程","expected_skill":"tilelang-github-operations"}
 {"id":"github-02","prompt":"帮我同步 upstream/npuir，rebase 后再发起 PR","expected_skill":"tilelang-github-operations"}
+{"id":"precision-01","prompt":"算子输出有 nan，并且和 Torch 的结果有细微差异，如何看精度报告","expected_skill":"tilelang-precision-debug-skill"}
+{"id":"precision-02","prompt":"请用 assert_close 给跑一下测试，输出 ascii diff map 帮我看看误差分布","expected_skill":"tilelang-precision-debug-skill"}
diff --git a/.agents/skills/tilelang-debug-helper/SKILL.md b/.agents/skills/tilelang-debug-helper/SKILL.md
@@ -35,3 +35,4 @@ Before answering, follow AGENTS.md section "Docs Auto Routing Rules (Mandatory)"
 
 - tilelang-mlir-skill
 - tilelang-error-fixer
+- tilelang-precision-debug-skill
diff --git a/.agents/skills/tilelang-precision-debug-skill/SKILL.md b/.agents/skills/tilelang-precision-debug-skill/SKILL.md
@@ -0,0 +1,56 @@
+---
+name: tilelang-precision-debug-skill
+description: TileLang 算子精度比对与误差分析技能。当用户提及“精度错误”、“结果不一致”、“比对失败”、“ASCII diff map”或需要深入定位数值差异时必须使用本技能。
+---
+
+# TileLang Precision Debug Skill
+
+## Mandatory routing rule
+
+Before answering, follow AGENTS.md section "Docs Auto Routing Rules (Mandatory)".
+
+## Trigger Guidance
+
+- When users face accuracy issues, mismatch with reference implementation (e.g., PyTorch), or precision errors.
+- When users mention `assert_close` failure, Top-10 errors, relative error too large, or need to see the ASCII diff map.
+- When a `.precision_debug` directory is mentioned.
+
+## Instructions SOP
+
+1. **Replace Comparison API:**
+   Ensure `assert_close` is imported from `testcommon` (standard in `testing/npuir`) or `tilelang.utils.prec_assert_close`.
+
+2. **Activate Debug Reporting:**
+   The detailed reporting is **disabled by default** to avoid clutter. You can activate it in two ways:
+   - **Surgical (recommended):** Add `@pytest.mark.precision_debug` to the test function.
+     ```python
+     import pytest
+     from testcommon import assert_close
+
+     @pytest.mark.precision_debug
+     def test_my_op():
+         ...
+           assert_close(actual, expected, dtype=dtype)
+       ```
+   - **Global:** Set the environment variable `TL_PREC_DEBUG=1` before running `pytest`.
+       ```bash
+       TL_PREC_DEBUG=1 pytest testing/npuir/test_xxx.py
+       ```
+
+2. **Run and Collect:**
+   Execute the test script. If there is a mismatch, the tool will automatically generate a `.precision_debug/<test_name>_<timestamp>/` directory containing `report.txt`, `diff_map.txt`, and the serialized tensors.
+
+3. **Analyze the Precision Debug Report:**
+   - **`report.txt`:** Check the mismatch ratio, `Max abs/rel diff`, and the `Top-10 largest differences`. Determine if the error is widespread (logic error) or isolated (boundary, overflow).
+   - **`diff_map.txt` (ASCII Map):** Identify spatial distribution patterns:
+     - **Blocky distribution:** Incorrect tiling or block_M/N parameters.
+     - **Periodic stripes:** Incorrect memory layout, strides, or vectorization broadcast issues.
+     - **Edge/Boundary spikes:** Padding, masking, or loop boundary conditions not handled correctly.
+
+4. **Root Cause Localization:**
+   Use the pattern analysis to trace back to Load (memory alignment), Compute (accumulation precision, type casting), or Store (write-back boundary) stages in the TVM IR or MLIR.
+
+## Related skills
+
+- tilelang-debug-helper
+- tilelang-error-fixer
diff --git a/.gitignore b/.gitignore
@@ -96,3 +96,9 @@ tilelang/jit/adapter/cython/.cycache
 
 # unit test output files
 unittest/npuir/output/
+
+# precision debug tool output
+.precision_debug/
+
+# agents
+.worktree/
diff --git a/AGENTS.md b/AGENTS.md
@@ -55,6 +55,7 @@ Use the matching skill whenever the user asks for:
 - compile/runtime error analysis on npuir branch
 - code review, lint/format checks, PR readiness
 - commit/push/rebase/upstream sync, PR or issue workflow
+- accuracy issues, assert_close failures, ASCII diff map, or precision errors
 
 Developer-mode MixCV trigger rule:
 - If one kernel contains Cube-side T.gemm and Vector-side at least one v-prefix op (such as T.vadd/T.vmul/T.vexp/T.vcast/T.vbrc), treat it as MixCV and use tilelang-mixcv-skill.

diff --git a/docs/Tilelang算子调试指南.md b/docs/Tilelang算子调试指南.md
@@ -15,9 +15,29 @@
 
 | 问题类型  | 具体表现  | 推荐调试方法  |
 | ------------ | ------------ |------------ |
-| **1.精度问题** | 算子成功编译并生成.o文件，但NPU运行结果和标杆参考（如torch）存在差异 |  `T.print`打印调试 |
-| **2.编译失败**| 算子编译失败，未生成预期的TVM IR和MLIR | 编译调试 |
-| **3.运行时失败** | 算子编译成功，但未生成.o文件，进程终止 | 运行时调试 |
+| **1.精度问题** | 算子成功编译并生成.o文件，但NPU运行结果和标杆参考（如torch）存在差异 |  `T.print`打印调试 |
+| **2.编译失败**| 算子编译失败，未生成预期的TVM IR和MLIR | 编译调试 |
+| **3.运行时失败** | 算子编译成功，但未生成.o文件，进程终止 | 运行时调试 |
+
+### 1.2 精度问题快速定位（precision debug）
+
+对于 `assert_close` 失败、结果不一致或需要观察误差分布的场景，推荐优先使用
+`testing/npuir/testcommon.py` 中的 `assert_close` 包装器，或直接使用
+`tilelang.utils.prec_assert_close`。
+
+开启方式有两种：
+
+- 在单个 pytest 用例上增加 `@pytest.mark.precision_debug`
+- 运行测试前设置环境变量 `TL_PREC_DEBUG=1`
+
+当比对失败时，工具会在 `.precision_debug/<test_name>_<timestamp>/` 下生成：
+
+- `report.txt`：误差统计、Top-10 差异点、误差分布
+- `diff_map.txt`：ASCII diff map，便于观察块状、条纹、边界等模式
+- `actual.pt` / `expected.pt`：失败时的张量快照
+
+建议优先在最小复现用例上使用该工具，先判断误差是全局扩散、局部边界异常，还是
+特定布局/广播模式下的周期性偏差，再回到 TVM IR 或 MLIR 阶段做进一步定位。
 
 # 2 Tilelang-AscendNPUIR 编译流程概览
 
@@ -408,4 +428,4 @@ _compile_option_list = [
 - **运行时失败**：xx。
 - **性能优化**：通过控制jit_npu中的编译选项来配置流水优化等功能。
 
-掌握这些调试技巧后，您就可以在TileLang中高效地调试NPU算子，充分利用昇腾硬件的计算能力。
+掌握这些调试技巧后，您就可以在TileLang中高效地调试NPU算子，充分利用昇腾硬件的计算能力。
diff --git a/testing/npuir/README.md b/testing/npuir/README.md
@@ -6,24 +6,27 @@ single source of truth.
 
 ## Design
 
-The test model is split into three layers:
+The test model is split into four layers:
 
 - Marker metadata: `op` and `mode`
+- Optional debug marker: `precision_debug`
 - Runtime controls: `--op`, `--mode`, `--npu-device`
 - Test matrix: `dtype`, shapes, and other case dimensions via `@pytest.mark.parametrize(...)`
 
 That split is intentional:
 
-- `op` and `mode` are the only custom markers. They define what a test is.
+- `op` and `mode` define what a test is.
+- `precision_debug` is an opt-in debug switch for mismatch reporting.
 - CLI options only choose which tests to run and which NPU device to use.
 - Data-oriented coverage stays inside the test matrix instead of growing more CLI flags.
 
 ## Marker Rules
 
-Only two custom markers are valid:
+Three custom markers are valid:
 
 - `@pytest.mark.op("<real-op>")`
 - `@pytest.mark.mode("<mode>")`
+- `@pytest.mark.precision_debug`
 
 Use file-level `pytestmark` when a whole file shares the same metadata:
 
@@ -55,6 +58,9 @@ def test_copy_release():
 The closest marker wins. In practice that means a test-level marker overrides a
 file-level marker of the same kind.
 
+Use `precision_debug` only on focused repro tests. It is not part of the test
+identity and should not be used as a broad file-level marker.
+
 ## Runtime Rules
 
 `--npu-device` is the only supported device selector in this pytest layer. The
@@ -69,6 +75,10 @@ warning so the remapping is visible in the test output.
 `with ascend_mode(...)`. The pytest runtime reads the closest `mode` marker and
 applies `ascend_mode(mode)` automatically around the test body.
 
+`precision_debug` is also marker-driven. When present, pytest sets
+`TL_PREC_DEBUG=1` around that test so `testcommon.assert_close(...)` emits a
+report directory under `.precision_debug/` on mismatch.
+
 ## Test Matrix Rules
 
 Use `@pytest.mark.parametrize(...)` for case dimensions such as:
@@ -104,12 +114,13 @@ def test_copy_shape(M, N, block_M, block_N, in_dtype, out_dtype):
 
 ## Contributor Rules
 
-- Do not add custom markers beyond `op` and `mode`.
+- Do not add custom markers beyond `op`, `mode`, and `precision_debug`.
 - Do not add `@pytest.mark.dtype(...)`.
 - Do not add folder-category markers such as `@pytest.mark.memory`.
 - Do not call `torch.npu.set_device(...)` inside tests.
 - Prefer file-level `pytestmark` for shared `op` / `mode`.
 - Use test-level markers only when a file intentionally needs overrides.
+- Use `precision_debug` sparingly for mismatch-focused repros.
 - Keep compile and execution work inside test functions, not at module import time.
 
 The directory name remains the category signal for humans. For example,
@@ -136,6 +147,9 @@ pytest testing/npuir --mode=Developer
 
 # combined selection
 pytest testing/npuir --op=copy --mode=Developer --npu-device=0
+
+# focused precision report for one repro
+TL_PREC_DEBUG=1 pytest testing/npuir/memory_ops/test_copy_shape_dev.py
 ```
 
 `--op` and `--mode` accept comma-separated values.
@@ -146,6 +160,7 @@ pytest testing/npuir --op=copy --mode=Developer --npu-device=0
 - `--mode` matches the closest `@pytest.mark.mode(...)`
 - `--npu-device` sets the session device before tests execute
 - out-of-range `--npu-device` values are remapped with modulo and reported as warnings
+- `@pytest.mark.precision_debug` enables detailed mismatch reports for that test
 
 Tests without a matching marker are excluded when that selector is provided.
 
@@ -169,3 +184,26 @@ CASES = [
 def test_copy_shape_dev(M, N, block_M, block_N, dtype):
     ...
 ```
+
+## Precision Debug
+
+Use `from testcommon import assert_close` for NPUIR tests. On mismatch, the
+precision debug tool can emit:
+
+- `report.txt`
+- `diff_map.txt`
+- serialized `actual.pt` and `expected.pt`
+
+Enable it in one of two ways:
+
+```python
+import pytest
+
+@pytest.mark.precision_debug
+def test_copy_debug():
+    ...
+```
+
+```bash
+TL_PREC_DEBUG=1 pytest testing/npuir/test_xxx.py
+```
diff --git a/testing/npuir/conftest.py b/testing/npuir/conftest.py
@@ -21,6 +21,13 @@ def _get_npu_device_id(config: pytest.Config) -> tuple[int, Optional[str]]:
     return resolve_npu_device_id(config.getoption("--npu-device"))
 
 
+def pytest_configure(config):
+    config.addinivalue_line(
+        "markers",
+        "precision_debug: enable tilelang advanced precision debugging for this test",
+    )
+
+
 def pytest_addoption(parser):
     parser.addoption(
         "--op",
@@ -116,3 +123,12 @@ def _apply_mode_marker(request: pytest.FixtureRequest):
 
     with ascend_mode(mode):
         yield
+
+
+@pytest.fixture(autouse=True)
+def _apply_precision_debug_marker(
+    request: pytest.FixtureRequest, monkeypatch: pytest.MonkeyPatch
+):
+    if request.node.get_closest_marker("precision_debug") is not None:
+        monkeypatch.setenv("TL_PREC_DEBUG", "1")
+    yield
diff --git a/testing/npuir/pytest.ini b/testing/npuir/pytest.ini
@@ -4,3 +4,4 @@ addopts = -ra --strict-markers
 markers =
     op(name): Real operation identity, used by --op.
     mode(name): Runtime Ascend mode, used by --mode and applied automatically during test execution.
+    precision_debug: Enable detailed precision mismatch reporting for a focused test.
diff --git a/testing/npuir/test_precision_debug_smoke.py b/testing/npuir/test_precision_debug_smoke.py
@@ -0,0 +1,79 @@
+import pathlib
+
+import pytest
+import torch
+
+from testcommon import assert_close
+from tilelang.utils import prec_assert_close
+
+
+pytestmark = [
+    pytest.mark.op("precision_debug"),
+    pytest.mark.mode("Developer"),
+]
+
+
+def _single_run_dir(root: pathlib.Path) -> pathlib.Path:
+    runs = [path for path in root.iterdir() if path.is_dir()]
+    assert len(runs) == 1
+    return runs[0]
+
+
+def test_prec_assert_close_exported_from_tilelang_utils():
+    from tilelang.utils.precision_debug import prec_assert_close as direct_import
+
+    assert prec_assert_close is direct_import
+
+
+def test_assert_close_without_debug_does_not_write_reports(
+    tmp_path: pathlib.Path, monkeypatch: pytest.MonkeyPatch
+):
+    monkeypatch.chdir(tmp_path)
+    actual = torch.tensor([0.0, 1.0], dtype=torch.float32)
+    expected = torch.tensor([0.0, 0.0], dtype=torch.float32)
+
+    with pytest.raises(AssertionError):
+        assert_close(actual, expected, dtype="float32")
+
+    assert not (tmp_path / ".precision_debug").exists()
+
+
+@pytest.mark.precision_debug
+def test_precision_debug_marker_enables_report_output(
+    tmp_path: pathlib.Path, monkeypatch: pytest.MonkeyPatch
+):
+    monkeypatch.chdir(tmp_path)
+    actual = torch.tensor([0.0, 1.0], dtype=torch.float32)
+    expected = torch.tensor([0.0, 0.0], dtype=torch.float32)
+
+    with pytest.raises(AssertionError):
+        assert_close(actual, expected, dtype="float32")
+
+    run_dir = _single_run_dir(tmp_path / ".precision_debug")
+    assert (run_dir / "report.txt").is_file()
+    assert (run_dir / "diff_map.txt").is_file()
+    assert (run_dir / "actual.pt").is_file()
+    assert (run_dir / "expected.pt").is_file()
+
+
+def test_prec_assert_close_respects_equal_nan_in_report(
+    tmp_path: pathlib.Path,
+):
+    output_dir = tmp_path / "precision_debug_manual"
+    actual = torch.tensor([float("nan")], dtype=torch.float32)
+    expected = torch.tensor([float("nan")], dtype=torch.float32)
+
+    with pytest.raises(AssertionError):
+        prec_assert_close(
+            actual,
+            expected,
+            output_dir=str(output_dir),
+            save_tensors=False,
+            print_map=False,
+            equal_nan=False,
+        )
+
+    report_text = (_single_run_dir(output_dir) / "report.txt").read_text(
+        encoding="utf-8"
+    )
+    assert "Mismatched: 1 (100.0000%)" in report_text
Original file line number	Diff line number	Diff line change
Expand Up		@@ -35,3 +35,4 @@ Before answering, follow AGENTS.md section "Docs Auto Routing Rules (Mandatory)"

		- tilelang-mlir-skill
		- tilelang-error-fixer
		- tilelang-precision-debug-skill