Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs by ChaoZheng109 · Pull Request #764 · hw-native-sys/simpler

ChaoZheng109 · 2026-05-13T02:34:39Z

Summary

Two coupled fixes for the legacy rtKernelLaunchWithHandleV2 path on a5 AICore. Without both, rtKernelLaunchWithHandleV2 either allocates no local memory for the kernel or refuses the launch with ACL_ERROR_RT_PARAM_INVALID (107000).

1. Set `cfg.localMemorySize`

cfg.localMemorySize was left at 0 in launch_aicore_kernel, so the runtime reserved no AICore local memory and SIMT execution failed. Introduce PLATFORM_AICORE_LOCAL_MEMORY_SIZE = 216 KB and pass it through rtTaskCfgInfo_t. The 216 KB ceiling pairs with the 8 KB advertised by the TLV record (fix #2) to land exactly on RT_SIMT_REMAIN_UB_SIZE (224 KB = 256 KB UB − 32 KB dcache); runtime's check is strict > so equality is accepted.

2. Inject SIMT TLVs into the AICore ELF

Runtime reads two TLV records from .ascend.meta.<funcname> at register time:

RT_FUNCTION_TYPE_COMPILER_ALLOC_UB_SIZE (type=7) → Kernel::shareMemSize_
RT_FUNCTION_TYPE_AIV_TYPE_FLAG (type=12) → Kernel::kernelVfType_

bisheng only emits these when it can statically infer the kernel uses SIMT intrinsics. Our SU-dispatcher entry doesn't satisfy that — vector ops live in task .o files invoked through aicore_execute. Fix in two layers:

Hand-write a meta record for the AIV variant: ub_size = PLATFORM_AICORE_SHARE_MEM_SIZE (8 KB), aiv_type = SIMD_SIMT_MIX_VF. The dispatcher routes task .o files containing both SIMD and SIMT vector kernels, so MIX_VF avoids runtime's per-type restrictions.
Add -mllvm -cce-dyn-kernel-stack-size=false to the AICore build flags. Without it, bisheng auto-emits a sibling section with the same name, which runtime's parser (kernelInfoMap keyed by section name) overwrites instead of merging — so the auto-emitted NO_VF / shareMemSize=0 would shadow our values.

TLV type IDs 7 / 12 mirror rtFunctionMetaType in runtime/runtime/elf_base.h; AIVType values are not exposed in any CANN C/C++ header (only in ascendc_identify_meta_section_info.py). Both are documented inline in simt_meta.h for traceability.

Files

src/a5/platform/include/common/platform_config.h — new PLATFORM_AICORE_LOCAL_MEMORY_SIZE and PLATFORM_AICORE_SHARE_MEM_SIZE constants
src/a5/platform/onboard/host/device_runner.cpp — set cfg.localMemorySize in launch_aicore_kernel
src/a5/platform/onboard/aicore/CMakeLists.txt — add -mllvm -cce-dyn-kernel-stack-size=false
src/a5/platform/onboard/aicore/simt_meta.h — TLV struct/enum definitions
src/a5/platform/onboard/aicore/kernel.cpp — hand-written SIMT TLV record using the extracted types

+107 lines across 5 files, no deletions.

Test plan

AICore object builds with the new bisheng flag without errors.
bisheng-readelf -S build/lib/<…>/aicore_kernel.o shows exactly one .ascend.meta.aicore_kernel_0_mix_aiv section (no shadow).
TLV bytes dump to [type=7, len=4, ub_size=8192] and [type=12, len=4, aiv_type=4].
An a5 onboard example (e.g. a small example under examples/) launches without ACL_ERROR_RT_PARAM_INVALID (107000) from runtime's CheckAndGetTotalShareMemorySize.

gemini-code-assist

Code Review

This pull request implements SIMT metadata TLV injection for AICore kernels to support the legacy launch path. Key changes include defining TLV structures and metadata enums in kernel.cpp, disabling automatic metadata generation via compiler flags in CMakeLists.txt, and setting the local memory size in the host-side device runner. Review feedback recommends renaming internal structures to avoid reserved identifier conflicts and using macros to dynamically generate section names for better maintainability.

Three coupled changes that together unblock the rtKernelLaunchWithHandleV2 path on a5 and add CI coverage for it: 1. cfg.localMemorySize was left at 0 in launch_aicore_kernel, so runtime allocated no AICore local memory and SIMT execution failed. Add a pair of constants in platform_config.h — PLATFORM_AICORE_SHARE_MEM_SIZE (8 KB) and PLATFORM_AICORE_LOCAL_MEMORY_SIZE (216 KB) — and pass the latter through rtTaskCfgInfo_t::localMemorySize. The pair sums to exactly RT_SIMT_REMAIN_UB_SIZE (224 KB = 256 KB UB − 32 KB dcache); runtime's check is strict > so equality is accepted. Section 2 below consumes PLATFORM_AICORE_SHARE_MEM_SIZE as the advertised TLV value. 2. Runtime reads two TLV records (COMPILER_ALLOC_UB_SIZE / type=7 and AIV_TYPE_FLAG / type=12) from the kernel ELF's \`.ascend.meta.<func>\` section to populate Kernel::shareMemSize_ and Kernel::kernelVfType_. bisheng only emits these when it can statically infer SIMT use; our SU-dispatcher entry can't be tagged automatically. Inject a hand-written meta record for the AIV variant (ub_size=8 KB via PLATFORM_AICORE_SHARE_MEM_SIZE, aiv_type=SIMD_SIMT_MIX_VF — the dispatcher routes task .o files containing both SIMD and SIMT vector kernels, so MIX_VF avoids runtime's per-type restrictions) and disable bisheng's auto-emission with \`-mllvm -cce-dyn-kernel-stack-size=false\` so the runtime parser, which keys kernelInfoMap by section name and overwrites instead of merging, doesn't shadow our values with NO_VF / shareMemSize=0. 3. Add tests/st/a5/tensormap_and_ringbuffer/simt_basic/ — a minimal element-scatter ST that exercises the SIMT launch path end-to-end. The kernel is distilled from the ptoas-generated mscatter reference and keeps the pieces real hardware actually requires: - per-data 3-tile alias pattern (TLOAD binds one tile, MSCATTER reads from another aliased to the same UB address; the single-tile form silently dropped the scatter on a5 hw) - set_mask_norm / set_vector_mask SIMT mask init at entry - MTE2 → V flag/wait before MSCATTER (the ptoas default MTE2 → MTE3 also silently dropped the scatter on hw) Indices use torch.arange() so the golden reduces to \`out == src\`, keeping the test a strict bring-up signal: a regression in any of the launch-path layers fixed above (TLV injection, localMemorySize budget, sync) flips it red while keeping false-positives from the scatter algorithm itself out of the way. A follow-up case with torch.randperm indices can be added once the ptoas dispatcher's per-element vs row-mode behaviour on hw is confirmed. The orchestration wraps rt_submit_aiv_task in PTO2_SCOPE() so the submit flushes through the task ringbuffer before the entry returns.

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/a5/platform/onboard/aicore/kernel.cpp Outdated

Comment thread src/a5/platform/onboard/aicore/kernel.cpp Outdated

ChaoZheng109 force-pushed the fix-a5-aicore-simt-tlv branch 2 times, most recently from e7a9644 to fb7b893 Compare May 14, 2026 01:44

ChaoZheng109 force-pushed the fix-a5-aicore-simt-tlv branch from fb7b893 to 3d1adbf Compare May 14, 2026 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764

Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix-a5-aicore-simt-tlv

ChaoZheng109 commented May 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChaoZheng109 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Set cfg.localMemorySize

2. Inject SIMT TLVs into the AICore ELF

Files

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChaoZheng109 commented May 13, 2026 •

edited

Loading

1. Set `cfg.localMemorySize`