Skip to content

test: refresh qwen3 decode pypto cases#683

Merged
zhangstevenunity merged 1 commit into
hw-native-sys:mainfrom
HecreReed:codex/qwen3-decode-refresh
May 20, 2026
Merged

test: refresh qwen3 decode pypto cases#683
zhangstevenunity merged 1 commit into
hw-native-sys:mainfrom
HecreReed:codex/qwen3-decode-refresh

Conversation

@HecreReed
Copy link
Copy Markdown
Collaborator

Summary

  • refresh test/samples/Qwen3DecodeA3 and test/samples/Qwen3DecodeA5 from the latest pypto-lib/models/qwen3/32b/qwen3_32b_decode.py output (05127d2)
  • replace the old 17-case qwen3_decode_incore_* layout with the current 14-fragment raw PTO topology (rmsnorm, rope_kv_cache, out_proj_residual, post_rmsnorm, down_proj_residual, plus the remaining qwen3_decode_incore_* kernels)
  • drop the stale per-case custom golden helpers that only matched the old 17-case topology, and remove the now-dead Qwen-specific scalar defaults from generate_testcase.py
  • update the Qwen README notes to document the new source path/topology and the follow-up needed to regenerate board-validation goldens

Validation

  • PTOAS_BIN=/Users/laoda/pto/PTOAS-qwen3-refresh/build/tools/ptoas/ptoas PTOAS_OUT_DIR=/private/tmp/qwen3_refresh_runop_a3 bash test/samples/runop.sh -t Qwen3DecodeA3
  • PTOAS_BIN=/Users/laoda/pto/PTOAS-qwen3-refresh/build/tools/ptoas/ptoas PTOAS_OUT_DIR=/private/tmp/qwen3_refresh_runop_a5 bash test/samples/runop.sh -t Qwen3DecodeA5

Notes

  • This PR intentionally focuses on refreshing the compile-regression PTO inputs.
  • The old board-validation Python goldens were tied to the removed 17-kernel split and need to be regenerated separately for the new 14-fragment topology.

@gemini-code-assist
Copy link
Copy Markdown

Note

The number of changes in this pull request is too large for Gemini Code Assist to generate a review.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 19, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: test: refresh qwen3 decode pypto cases #683 test: refresh qwen3 decode pypto cases
  • Author: HecreReed
  • Base/Head: main / codex/qwen3-decode-refresh
  • Head SHA: e77168afddd0
  • Trigger: PR 有新提交
  • Generated At: 2026-05-20T06:53:11Z
  • Previous Head SHA: 39cd66e406f6
  • Status: completed

Summary

PR #683 中刷新的 Qwen3 decode golden 与现有验证 harness 的标量/launch 合约不匹配,会导致多组 sample 在 golden 生成或板端校验阶段失败。

Findings

  1. P1 刷新后的 Qwen3 goldens 仍只按 `int32_t` 读取标量,多个用例会在 golden 生成阶段直接崩溃 test/samples/Qwen3DecodeA3/qwen3_decode_golden_lib.py:177

run_case() 仍通过 load_int32_assignments()ints,而这个入口只会从生成的 main.cpp 中提取字面量 int32_t 赋值。此次刷新后的多组 Qwen3 kernel 已把运行时标量改成 index(host 侧会生成为 size_t),例如 rope_kv_cacheqwen3_decode_incore_4/6out_proj_residualdown_proj_residual。这些 builder 又立即执行 batch_index, pos = ints[:2]out_block = ints[0] 之类的解包,所以 ints 会是空/过短列表,golden 生成阶段就会抛 ValueError/IndexError,板端验证和 sample runtime CI 会直接失败。A5 镜像文件有同样问题。

  1. P1 golden 仍把 `__pypto_spmd_block_idx` 对应的选择维度当作 host 标量,和新 kernel ABI 不匹配 test/samples/Qwen3DecodeA3/qwen3_decode_golden_lib.py:243

多处 refreshed builder 仍从 ints 里读取 pair_indexlocal_block(如 build_qk_matmulbuild_softmaxbuild_sv_matmulbuild_online_softmax_build_group_projbuild_silu),但新 .pto 已把这类选择维度改成隐式的 __pypto_spmd_block_idx。例如 qwen3_decode_incore_4.pto 用 block idx 计算 gi0/gi1qwen3_decode_incore_10.pto%arg3 + block_idx 选择 MLP block;这些值根本不会出现在生成的 host 参数列表里。所以即使把标量解析从 int32_t 扩到 size_t,golden 仍然拿不到与实际 launch 一致的 pair/block 选择,结果要么继续因参数数量不足崩溃,要么校验到错误的输出切片。A5 镜像文件同样受影响。

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a3 down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a3 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3,A3 板测器会处理这条请求。

  • 进度页:http://154.9.227.233/ptoas-board-dashboard/#board-a3
  • 当前状态:板测器空闲,这条请求会在本轮轮询启动。
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A3 板测成功

  • 触发方式:manual
  • 源码提交:1760318854a4
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260519_140706_manual_pr683.log
  • 结果 TSV:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260519_140706_manual_pr683.tsv
  • 手动指令:/run a3 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3
  • 触发人:HecreReed
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3
  • 触发评论:test: refresh qwen3 decode pypto cases #683 (comment)

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3,A5 板测器会处理这条请求。

  • 进度页:http://154.9.227.233/ptoas-board-dashboard/#board-a5
  • 当前状态:板测器空闲,这条请求会在本轮轮询启动。
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:1760318854a4
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260519_141405_manual_pr683.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260519_141405_manual_pr683.tsv
  • 手动指令:/run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3
  • 触发人:HecreReed
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3
  • 触发评论:test: refresh qwen3 decode pypto cases #683 (comment)

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a3 down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a3 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3,A3 板测器会处理这条请求。

  • 进度页:http://154.9.227.233/ptoas-board-dashboard/#board-a3
  • 当前状态:板测器空闲,这条请求会在本轮轮询启动。
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A3 板测成功

  • 触发方式:manual
  • 源码提交:09cd2d2f2717
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260519_145005_manual_pr683.log
  • 结果 TSV:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260519_145005_manual_pr683.tsv
  • 手动指令:/run a3 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3
  • 触发人:HecreReed
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3
  • 触发评论:test: refresh qwen3 decode pypto cases #683 (comment)

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3,A5 板测器会处理这条请求。

  • 进度页:http://154.9.227.233/ptoas-board-dashboard/#board-a5
  • 当前状态:板测器空闲,这条请求会在本轮轮询启动。
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:09cd2d2f2717
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260519_145606_manual_pr683.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260519_145606_manual_pr683.tsv
  • 手动指令:/run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3
  • 触发人:HecreReed
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3
  • 触发评论:test: refresh qwen3 decode pypto cases #683 (comment)

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Qwen3DecodeA5 residual golden helpers still assume an A3-style fifth GM buffer, so both refreshed A5 residual validation cases fail at runtime.

Findings

  1. P1 A5 out_proj_residual golden still expects a removed fifth buffer test/samples/Qwen3DecodeA5/qwen3_decode_golden_lib.py:389

build_out_proj_residual still allocates v5 via meta.elem_counts["v5"], but the refreshed A5 kernel signature in test/samples/Qwen3DecodeA5/out_proj_residual.pto:2 only has four GM pointers (%arg0..%arg3) plus one index scalar. The generated custom golden therefore has no v5 entry in CaseMeta, and golden.py will fail immediately with a KeyError instead of producing inputs/golden data. That blocks A5 board-validation for this new case.

  1. P1 A5 down_proj_residual golden still expects a removed fifth buffer test/samples/Qwen3DecodeA5/qwen3_decode_golden_lib.py:487

build_down_proj_residual has the same stale A3 assumption: it reads meta.elem_counts["v5"], but the refreshed A5 kernel in test/samples/Qwen3DecodeA5/down_proj_residual.pto:2 exposes only four GM pointers and one index scalar. As a result, custom golden generation for down_proj_residual will crash before writing any testcase artifacts, so this A5 validation case cannot run.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

AIV goldens miss the second subblock in the new residual cases.

Findings

  1. P1 AIV residual goldens only cover one subblock test/samples/Qwen3DecodeA3/qwen3_decode_golden_lib.py:382

build_out_proj_residual and build_down_proj_residual only compute one 8-row AIV slice per out_block, but the corresponding *_aiv kernels use pto.get_subblock_idx and write two subblocks. That leaves rows 8-15 of the 16-row output zero, so the new A3/A5 validation cases will compare against incomplete goldens.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

发现 2 个 P1:A3/A5 的 rope_kv_cache 自定义 golden 输入映射与 PTO 参数顺序不一致,会生成错误 golden。

Findings

  1. P1 `rope_kv_cache` A3 golden 把 q/k/v 与 RoPE 表输入接错了 test/samples/Qwen3DecodeA3/qwen3_decode_golden_lib.py:193

test/samples/Qwen3DecodeA3/rope_kv_cache.pto%arg3..%arg9 的顺序是 k_proj, cos_lo, sin_lo, cos_hi, sin_hi, v_proj, q_proj,但这里把 v4..v10 读成了 q_proj, cos_lo, cos_hi, sin_lo, sin_hi, k_proj, v_proj。这会同时交换 q/k/v 投影和高低半区的 sin/cos 表,导致生成的 golden 基于错误输入计算,rope_kv_cache 的板测/CI 结果不再可信。

  1. P1 `rope_kv_cache` A5 golden 把 q/k/v 与 RoPE 表输入接错了 test/samples/Qwen3DecodeA5/qwen3_decode_golden_lib.py:193

test/samples/Qwen3DecodeA5/rope_kv_cache.pto%arg3..%arg9 的顺序同样是 k_proj, cos_lo, sin_lo, cos_hi, sin_hi, v_proj, q_proj,但这里也把 v4..v10 读成了 q_proj, cos_lo, cos_hi, sin_lo, sin_hi, k_proj, v_proj。结果是 golden 会用错误的投影张量和错误的旋转位置编码表计算输出,rope_kv_cache 的 A5 验证结果会系统性错误。

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@HecreReed HecreReed marked this pull request as ready for review May 20, 2026 02:05
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fd62307f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

exp_scores = np.exp(scores - cur_mi)
exp_scores_bf16 = float32_to_bf16(exp_scores)
cur_li = np.sum(bf16_to_float32(exp_scores_bf16), axis=1, keepdims=True)
out_exp = _store_group_scores(out_exp, exp_scores_bf16, group_index=gi, sb=sb, rows_per_group=Q_HEAD_BATCH, cols=SEQ_TILE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use padded row stride when storing exp scores

build_softmax stores exp_scores_bf16 into out_exp with rows_per_group=Q_HEAD_BATCH (8), but the kernel layout for all_exp_padded is padded to 16 rows per group (Q_HEAD_PAD) and computes offsets with sb * 16 for each context block. This packs scores too tightly, so values for the second head group (and all later blocks) land at the wrong rows, producing incorrect golden data for qwen3_decode_incore_5 (same issue is mirrored in the A5 golden lib).

Useful? React with 👍 / 👎.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

PR #683 has a P1 golden-layout bug in the refreshed Qwen3 softmax case.

Findings

  1. P1 `qwen3_decode_incore_5` golden writes the exp buffer with the wrong row stride test/samples/Qwen3DecodeA3/qwen3_decode_golden_lib.py:296

build_softmax() stores out_exp with rows_per_group=Q_HEAD_BATCH (8), but the refreshed kernels lay out all_exp_padded with a 16-row stride per Q group. In qwen3_decode_incore_5.pto, the exp tiles are written at gi * 256 + sb * 16 rows, so the 8 active rows must leave an 8-row padded gap. The new golden compacts them instead, so qwen3_decode_incore_5_golden.py will generate a different v3 buffer than the kernel writes, causing the refreshed softmax case to fail on both A3 and A5. The same bug is duplicated at test/samples/Qwen3DecodeA5/qwen3_decode_golden_lib.py:296.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

 .../Qwen3DecodeA5/qwen3_decode_incore_14.pto       |  41 --
 .../Qwen3DecodeA5/qwen3_decode_incore_14_golden.py |  14 -
 .../Qwen3DecodeA5/qwen3_decode_incore_15.pto       |  47 --
 .../Qwen3DecodeA5/qwen3_decode_incore_15_golden.py |  14 -
 .../Qwen3DecodeA5/qwen3_decode_incore_16.pto       |  28 -
 .../Qwen3DecodeA5/qwen3_decode_incore_16_golden.py |  14 -
 .../Qwen3DecodeA5/qwen3_decode_incore_2.pto        | 204 ++++--
 .../Qwen3DecodeA5/qwen3_decode_incore_3.pto        |  25 -
 .../Qwen3DecodeA5/qwen3_decode_incore_3_golden.py  |  14 -
 .../Qwen3DecodeA5/qwen3_decode_incore_4.pto        | 225 +++---
 .../Qwen3DecodeA5/qwen3_decode_incore_5.pto        | 150 ++--
 .../Qwen3DecodeA5/qwen3_decode_incore_6.pto        | 160 +++--
 .../Qwen3DecodeA5/qwen3_decode_incore_7.pto        | 265 +++++--
 .../Qwen3DecodeA5/qwen3_decode_incore_8.pto        | 112 ---
 .../Qwen3DecodeA5/qwen3_decode_incore_8_golden.py  |  14 -
 .../Qwen3DecodeA5/qwen3_decode_incore_9.pto        |  45 --
 .../Qwen3DecodeA5/qwen3_decode_incore_9_golden.py  |  14 -
 test/samples/Qwen3DecodeA5/rmsnorm.pto             | 181 +++++
 test/samples/Qwen3DecodeA5/rmsnorm_golden.py       |  14 +
 test/samples/Qwen3DecodeA5/rope_kv_cache.pto       | 142 ++++
 test/samples/Qwen3DecodeA5/rope_kv_cache_golden.py |  14 +
 71 files changed, 4148 insertions(+), 2883 deletions(-)
===== END STAGE clone rc=0 @ 2026-05-20 11:10:06 =====

===== STAGE codex-review @ 2026-05-20 11:10:06 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/review_prompt.txt'
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019e435c-aadc-7363-b950-177d42165fdc
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#683 test: refresh qwen3 decode pypto cases
作者:HecreReed
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #683 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: 0203384b-633a-4087-912f-ae39eff44a14)
Reconnecting... 2/5 (unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: b80d19a4-f8a0-4992-b348-641aacfec6de)
Reconnecting... 3/5 (unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: 2f447b7c-4334-46bd-82ce-29121de2b50a)
Reconnecting... 4/5 (unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: 10539d41-7e36-47b0-80a2-735c042cd00b)
Reconnecting... 5/5 (unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: dbafac60-ec9b-47e9-92a7-8c68fd22739c)
ERROR: unexpected status 502 Bad Gateway: Upstream access forbidden, please contact administrator, url: https://codex.0u0o.com/responses, request id: ef7712fb-d0f9-4cf5-8963-03cc7fb7aa21
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260520_111002_pr683/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-05-20 11:11:23 =====

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/review

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

收到 /review,正在执行 Codex review。

Findings

Review in progress.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3,A5 板测器会处理这条请求。

  • 进度页:http://154.9.227.233/ptoas-board-dashboard/#board-a5
  • 当前状态:板测器空闲,这条请求会在本轮轮询启动。
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:manual
  • 源码提交:aa347cdfab0d
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260520_133006_manual_pr683.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260520_133006_manual_pr683.tsv
  • 手动指令:/run a5 down_proj_residual out_proj_residual post_rmsnorm qwen3_decode_incore_0 qwen3_decode_incore_1 qwen3_decode_incore_2 qwen3_decode_incore_3 qwen3_decode_incore_4 qwen3_decode_incore_5 qwen3_decode_incore_6 qwen3_decode_incore_7 qwen3_decode_incore_8 qwen3_decode_incore_9 qwen3_decode_incore_10 qwen3_decode_incore_11 qwen3_decode_incore_12 qwen3_decode_incore_13 qwen3_decode_incore_14 qwen3_decode_incore_15 qwen3_decode_incore_16 rmsnorm rope_kv_cache --pto-level=level3
  • 触发人:HecreReed
  • 指定用例:down_proj_residual,out_proj_residual,post_rmsnorm,qwen3_decode_incore_0,qwen3_decode_incore_1,qwen3_decode_incore_2,qwen3_decode_incore_3,qwen3_decode_incore_4,qwen3_decode_incore_5,qwen3_decode_incore_6,qwen3_decode_incore_7,qwen3_decode_incore_8,qwen3_decode_incore_9,qwen3_decode_incore_10,qwen3_decode_incore_11,qwen3_decode_incore_12,qwen3_decode_incore_13,qwen3_decode_incore_14,qwen3_decode_incore_15,qwen3_decode_incore_16,rmsnorm,rope_kv_cache
  • PTOAS 参数:--pto-level=level3
  • 触发评论:test: refresh qwen3 decode pypto cases #683 (comment)

@reedhecre
Copy link
Copy Markdown

Manual Codex Review

该评论由 /review 手动触发。

Summary

未检查到 PR #683 存在问题

Findings

No issues found.

@HecreReed HecreReed force-pushed the codex/qwen3-decode-refresh branch from 39cd66e to e77168a Compare May 20, 2026 06:17
@zhangstevenunity zhangstevenunity merged commit 16b2041 into hw-native-sys:main May 20, 2026
18 of 19 checks passed
@reedhecre
Copy link
Copy Markdown

A5 板测成功

  • 触发方式:merged
  • 源码提交:16b2041f842c
  • 结果汇总:OK 14 / FAIL 0 / SKIP 0
  • 日志:/root/ptoas-board-monitor-a5/logs/20260520_170906_merged_pr683.log
  • 结果 TSV:/root/ptoas-board-monitor-a5/logs/20260520_170906_merged_pr683.tsv

@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:merged
  • 源码提交:16b2041f842c
  • 结果汇总:OK 205 / FAIL 4 / SKIP 2
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260520_170904_merged_pr683.log
  • 失败阶段:board-validation / exit=1

失败用例

  • rope_kv_cache (run, exit=2)
  • down_proj_residual (run, exit=1)
  • out_proj_residual (run, exit=1)
  • qwen3_decode_incore_5 (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A3 板测失败详情:PR #683

rope_kv_cache

stage=run info=exit=2

[ERROR] Mismatch (bf16 golden_v1.bin vs v1.bin): max ulp diff=30463 at idx=26720 (golden_bits=48012, out_bits=15218, golden=-0.0042724609375, out=0.003692626953125)
[ERROR] compare failed
[2026-05-20 17:31:54] ERROR: testcase failed (exit 2): rope_kv_cache
down_proj_residual

stage=run info=exit=1

Traceback (most recent call last):
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/down_proj_residual/./golden.py", line 14, in <module>
    run_case('down_proj_residual')
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/down_proj_residual/qwen3_decode_golden_lib.py", line 568, in run_case
    buffers, golden = BUILDERS[case_name](meta, generator, ints)
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/down_proj_residual/qwen3_decode_golden_lib.py", line 530, in build_down_proj_residual
    resid = load_strided_2d(
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/down_proj_residual/validation_runtime.py", line 118, in load_strided_2d
    raise ValueError(f'strided load out of bounds: [{start}:{stop}] > {flat.size}')
ValueError: strided load out of bounds: [57600:57856] > 57600
[2026-05-20 17:32:08] ERROR: testcase failed (exit 1): down_proj_residual
out_proj_residual

stage=run info=exit=1

Traceback (most recent call last):
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/out_proj_residual/./golden.py", line 14, in <module>
    run_case('out_proj_residual')
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/out_proj_residual/qwen3_decode_golden_lib.py", line 568, in run_case
    buffers, golden = BUILDERS[case_name](meta, generator, ints)
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/out_proj_residual/qwen3_decode_golden_lib.py", line 418, in build_out_proj_residual
    load_strided_2d(
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/out_proj_residual/validation_runtime.py", line 118, in load_strided_2d
    raise ValueError(f'strided load out of bounds: [{start}:{stop}] > {flat.size}')
ValueError: strided load out of bounds: [57600:57856] > 57600
[2026-05-20 17:32:12] ERROR: testcase failed (exit 1): out_proj_residual
qwen3_decode_incore_5

stage=run info=exit=1

Traceback (most recent call last):
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/qwen3_decode_incore_5/./golden.py", line 14, in <module>
    run_case('qwen3_decode_incore_5')
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/qwen3_decode_incore_5/qwen3_decode_golden_lib.py", line 568, in run_case
    buffers, golden = BUILDERS[case_name](meta, generator, ints)
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/qwen3_decode_incore_5/qwen3_decode_golden_lib.py", line 287, in build_softmax
    scores_valid = load_strided_2d(buffers["v4"], offset=in_offset, rows=Q_HEAD_BATCH, cols=SEQ_TILE, row_stride=SEQ_TILE).astype(np.float32)
  File "/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260520_170904_merged_pr683/npu_validation/Qwen3DecodeA3/qwen3_decode_incore_5/validation_runtime.py", line 118, in load_strided_2d
    raise ValueError(f'strided load out of bounds: [{start}:{stop}] > {flat.size}')
ValueError: strided load out of bounds: [198400:198656] > 198401
[2026-05-20 17:32:54] ERROR: testcase failed (exit 1): qwen3_decode_incore_5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants