Fix tcmps ST packed dst layout#813
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the tcmps test cases by introducing a helper function GetDstElemCount in main.cpp to dynamically calculate destination element counts, and adjusts destination tensor shapes and strides in tcmps.pto. The review feedback correctly identifies a potential division-by-zero vulnerability in GetDstElemCount if srcElemSize is zero, as well as an issue with hardcoded byte sizes that could cause buffer overflows for 8-bit types.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static size_t GetDstElemCount(const TestCase &tc) { | ||
| if (tc.srcElemSize == sizeof(float) || tc.srcElemSize == sizeof(int32_t)) { | ||
| const size_t logicalElems = tc.validRows * tc.validCols; | ||
| const size_t repeatElm = 256 / tc.srcElemSize; | ||
| const size_t repeatTimes = (logicalElems + repeatElm - 1) / repeatElm + 1; | ||
| return (repeatTimes / 2) * 16; | ||
| } | ||
| const size_t repeatElm = 256 / tc.srcElemSize; | ||
| const size_t repeatTimes = (tc.validCols + repeatElm - 1) / repeatElm; | ||
| return tc.validRows * repeatTimes * 16; | ||
| } |
There was a problem hiding this comment.
There are two issues in GetDstElemCount:
- Division by Zero Risk: If
tc.srcElemSizeis0,256 / tc.srcElemSizewill cause a division by zero crash. Adding a defensive check is highly recommended. - 8-bit Type Support: For 8-bit types (where
tc.srcElemSize == 1), the bytes per iteration is32(as defined incompare.pyandgen_data.py), not16. Hardcoding16will return an incorrect size (half of what is required) if an 8-bit test case is added in the future, potentially leading to buffer overflows or memory corruption.
static size_t GetDstElemCount(const TestCase &tc) {
if (tc.srcElemSize == 0) {
return 0;
}
if (tc.srcElemSize == sizeof(float) || tc.srcElemSize == sizeof(int32_t)) {
const size_t logicalElems = tc.validRows * tc.validCols;
const size_t repeatElm = 256 / tc.srcElemSize;
const size_t repeatTimes = (logicalElems + repeatElm - 1) / repeatElm + 1;
return (repeatTimes / 2) * 16;
}
const size_t repeatElm = 256 / tc.srcElemSize;
const size_t repeatTimes = (tc.validCols + repeatElm - 1) / repeatElm;
const size_t bytesPerIter = (tc.srcElemSize == 1) ? 32 : 16;
return tc.validRows * repeatTimes * bytesPerIter;
}
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
732f0f1 to
395097e
Compare
A3 板测失败
失败用例
|
A3 板测失败详情:PR #813rowexpand
cmps
colmin
rowexpandexpdif
colexpandmul
print_scalar
fmod
mrgsort_format2
mrgsort
or
not
sort32
relu
log
shrs
gatherb
colexpand
orchestration_example_kernel_add
vector_example_dag_kernel_add_scalar
paged_attention_example_kernel_pv_matmul
paged_attention_example_kernel_init_inplace
vector_example_dag_kernel_add
paged_attention_example_kernel_online_update
paged_attention_example_kernel_softmax_prepare
orchestration_example_kernel_add_scalar
paged_attention_example_kernel_qk_matmul
orchestration_example_kernel_mul
vector_example_dag_kernel_mul
tinsert_fp
tinsert
partmax
max
dequant
dequant_i8
vadd_pto_pingpong
subview_boxed_dynamic
subview_tsubs
subview
rowexpandmin
addc
muls
subsc
tileSetGetValue
|
A3 板测失败详情:PR #813recip
neg
colexpandmax
partadd
fmods
colexpandexpdif
colprod
ors
partarg
rsqrt
sel
and
comm_p2p
tscatter_root_binding
comm_collective
tgather_root_binding
treduce_root_binding
tnotify_atomic_add_binding
comm_p2p_binding_variants
comm_collective_binding_variants
tbroadcast_root_binding
twait_atomic_binding
ttri
shr
rowexpanddiv
rowsum
prelu
rowmax
bitcast_inplace_cvt
reshape
bitcast_dtype_alias
sub
lrelu
rowprod
trap
layout_inference
tpows
plan_memory_bind_tile_alias_liveness
plan_memory_peak_exact_capacity
plan_memory_loop_no_reuse_outer_live
|
A3 板测失败详情:PR #813plan_memory_if_yield
plan_memory_loop_in_if
plan_memory_peak_8_overlapping
plan_memory_if_in_loop
plan_memory_fragmentation_hole_fit
plan_memory_for_iter_args_yield
plan_memory_no_reuse_overlap
plan_memory_reuse_sequential
plan_memory_nested_loops
plan_memory_fragmentation_two_holes
rems
min
gather
gather_legacy
rowexpandadd
tget_async_kernel_impl_like
tput_async_kernel_impl_like
xor
vectorAddition
vadd_pto_ir
vadd_validshape_hyper
adds
colexpandsub
matMul
tmatmulk
scalar_ptr
colmax
xors
rowexpandmax
fillpad
fillpad_expand
fillpad_inplace
set_validshape
tprefetch
colexpandmin
dynamic_tail_matmul
shls
partition_view_verify_rank_mismatch_valid
partition_view_verify_valid
extract
extract_fp
addptr
addptr_f16
addptr_chain
addptr_dynamic
|
A3 板测失败详情:PR #813ci
maxs
div
shl
cv_region
tpow
divs2
colsum
Matmul_transpose
rowexpandmul
expands
expand
quant_asym
quant
partition5d_dynamic
partition5d
mins
trans
exp
scatter
cmp
gemvbias
gemvacc
gemv
subs
colexpanddiv
partmul
addsc
sparse_attn_test_incore_7
decode_hca_test_incore_54
attention_swa_test_incore_40
decode_swa_test_incore_40
decode_csa_test_incore_81
attention_hca_test_incore_54
attention_csa_test_refresh_incore_81
tensor_view_layout_dn
tensor_view_infer_layout_dn
subc
abs
rowmin
mul
rowexpandsub
rope_kv_cache
qwen3_decode_incore_4
post_rmsnorm
|
A3 板测失败详情:PR #813down_proj_residual
out_proj_residual
qwen3_decode_incore_1
qwen3_decode_incore_10
qwen3_decode_incore_11
rmsnorm
qwen3_decode_incore_6
qwen3_decode_incore_2
qwen3_decode_incore_7
qwen3_decode_incore_5
qwen3_decode_incore_12
sels
syncall_binding
movfp
ands
divs
cvt_f32_f32
cvt_f32_f16
tcvt
tprefetch_async_binding
concat
test_inject_sync_loop
tmatmulk_autosync
test_barrier_sync
test_inject_sync_intra_pipe_barrier
test_set_wait_unified_api
test_mem_inject_sync_basic
test_intercore_sync_a3
syncHigh
matmul
add_double_dynamic
nested_loop_confliect
rar_optimization_test
test_intercore_sync_a3_modes
test_dynamic_valid_shape
test_inject_sync_if_else
test_auto_sync_tail_hint
sync
test_intercore_sync_a3_dyn
test_inject_sync_loop_nest
compensation_test
test_inject_sync_if
test_inject_sync_two_event_id
rem
partmin
colexpandadd
sqrt
|
A3 板测失败详情:PR #813bf16_tile
|
Summary
tcmpshost-side dst buffer sizing for packed predicate outputsf32_256x16ST kernel dst tensor view/tstore layout to match the allocated256x32xui8tilecases.pymetadata experiment so the testcase stays aligned with the actual harness behaviorRoot cause
f32_256x16allocated a256x32xui8dst tile, but its GM dst view andtstorepath still used256x16. That truncated the packed predicate output during store, so the compare failed from byte 256 onward.Validation
python3 test/tilelang_st/script/run_st.py -r sim -v a5 -t tcmps -p "$PWD/build/tools/ptoas/ptoas"