Skip to content

Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764

Open
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix-a5-aicore-simt-tlv
Open

Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764
ChaoZheng109 wants to merge 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:fix-a5-aicore-simt-tlv

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented May 13, 2026

Summary

Two coupled fixes for the legacy rtKernelLaunchWithHandleV2 path on a5 AICore. Without both, rtKernelLaunchWithHandleV2 either allocates no local memory for the kernel or refuses the launch with ACL_ERROR_RT_PARAM_INVALID (107000).

1. Set cfg.localMemorySize

cfg.localMemorySize was left at 0 in launch_aicore_kernel, so the runtime reserved no AICore local memory and SIMT execution failed. Introduce PLATFORM_AICORE_LOCAL_MEMORY_SIZE = 216 KB and pass it through rtTaskCfgInfo_t. The 216 KB ceiling pairs with the 8 KB advertised by the TLV record (fix #2) to land exactly on RT_SIMT_REMAIN_UB_SIZE (224 KB = 256 KB UB − 32 KB dcache); runtime's check is strict > so equality is accepted.

2. Inject SIMT TLVs into the AICore ELF

Runtime reads two TLV records from .ascend.meta.<funcname> at register time:

  • RT_FUNCTION_TYPE_COMPILER_ALLOC_UB_SIZE (type=7) → Kernel::shareMemSize_
  • RT_FUNCTION_TYPE_AIV_TYPE_FLAG (type=12) → Kernel::kernelVfType_

bisheng only emits these when it can statically infer the kernel uses SIMT intrinsics. Our SU-dispatcher entry doesn't satisfy that — vector ops live in task .o files invoked through aicore_execute. Fix in two layers:

  • Hand-write a meta record for the AIV variant: ub_size = PLATFORM_AICORE_SHARE_MEM_SIZE (8 KB), aiv_type = SIMD_SIMT_MIX_VF. The dispatcher routes task .o files containing both SIMD and SIMT vector kernels, so MIX_VF avoids runtime's per-type restrictions.
  • Add -mllvm -cce-dyn-kernel-stack-size=false to the AICore build flags. Without it, bisheng auto-emits a sibling section with the same name, which runtime's parser (kernelInfoMap keyed by section name) overwrites instead of merging — so the auto-emitted NO_VF / shareMemSize=0 would shadow our values.

TLV type IDs 7 / 12 mirror rtFunctionMetaType in runtime/runtime/elf_base.h; AIVType values are not exposed in any CANN C/C++ header (only in ascendc_identify_meta_section_info.py). Both are documented inline in simt_meta.h for traceability.

Files

  • src/a5/platform/include/common/platform_config.h — new PLATFORM_AICORE_LOCAL_MEMORY_SIZE and PLATFORM_AICORE_SHARE_MEM_SIZE constants
  • src/a5/platform/onboard/host/device_runner.cpp — set cfg.localMemorySize in launch_aicore_kernel
  • src/a5/platform/onboard/aicore/CMakeLists.txt — add -mllvm -cce-dyn-kernel-stack-size=false
  • src/a5/platform/onboard/aicore/simt_meta.h — TLV struct/enum definitions
  • src/a5/platform/onboard/aicore/kernel.cpp — hand-written SIMT TLV record using the extracted types

+107 lines across 5 files, no deletions.

Test plan

  • AICore object builds with the new bisheng flag without errors.
  • bisheng-readelf -S build/lib/<…>/aicore_kernel.o shows exactly one .ascend.meta.aicore_kernel_0_mix_aiv section (no shadow).
  • TLV bytes dump to [type=7, len=4, ub_size=8192] and [type=12, len=4, aiv_type=4].
  • An a5 onboard example (e.g. a small example under examples/) launches without ACL_ERROR_RT_PARAM_INVALID (107000) from runtime's CheckAndGetTotalShareMemorySize.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements SIMT metadata TLV injection for AICore kernels to support the legacy launch path. Key changes include defining TLV structures and metadata enums in kernel.cpp, disabling automatic metadata generation via compiler flags in CMakeLists.txt, and setting the local memory size in the host-side device runner. Review feedback recommends renaming internal structures to avoid reserved identifier conflicts and using macros to dynamically generate section names for better maintainability.

Comment thread src/a5/platform/onboard/aicore/kernel.cpp Outdated
Comment thread src/a5/platform/onboard/aicore/kernel.cpp Outdated
@ChaoZheng109 ChaoZheng109 force-pushed the fix-a5-aicore-simt-tlv branch 2 times, most recently from e7a9644 to fb7b893 Compare May 14, 2026 01:44
Three coupled changes that together unblock the
rtKernelLaunchWithHandleV2 path on a5 and add CI coverage for it:

1. cfg.localMemorySize was left at 0 in launch_aicore_kernel, so runtime
   allocated no AICore local memory and SIMT execution failed. Add a
   pair of constants in platform_config.h —
   PLATFORM_AICORE_SHARE_MEM_SIZE (8 KB) and
   PLATFORM_AICORE_LOCAL_MEMORY_SIZE (216 KB) — and pass the latter
   through rtTaskCfgInfo_t::localMemorySize. The pair sums to exactly
   RT_SIMT_REMAIN_UB_SIZE (224 KB = 256 KB UB − 32 KB dcache); runtime's
   check is strict > so equality is accepted. Section 2 below consumes
   PLATFORM_AICORE_SHARE_MEM_SIZE as the advertised TLV value.

2. Runtime reads two TLV records (COMPILER_ALLOC_UB_SIZE / type=7 and
   AIV_TYPE_FLAG / type=12) from the kernel ELF's \`.ascend.meta.<func>\`
   section to populate Kernel::shareMemSize_ and Kernel::kernelVfType_.
   bisheng only emits these when it can statically infer SIMT use; our
   SU-dispatcher entry can't be tagged automatically. Inject a
   hand-written meta record for the AIV variant (ub_size=8 KB via
   PLATFORM_AICORE_SHARE_MEM_SIZE, aiv_type=SIMD_SIMT_MIX_VF — the
   dispatcher routes task .o files containing both SIMD and SIMT vector
   kernels, so MIX_VF avoids runtime's per-type restrictions) and
   disable bisheng's auto-emission with
   \`-mllvm -cce-dyn-kernel-stack-size=false\` so the runtime parser,
   which keys kernelInfoMap by section name and overwrites instead of
   merging, doesn't shadow our values with NO_VF / shareMemSize=0.

3. Add tests/st/a5/tensormap_and_ringbuffer/simt_basic/ — a minimal
   element-scatter ST that exercises the SIMT launch path end-to-end.
   The kernel is distilled from the ptoas-generated mscatter reference
   and keeps the pieces real hardware actually requires:

   - per-data 3-tile alias pattern (TLOAD binds one tile, MSCATTER
     reads from another aliased to the same UB address; the single-tile
     form silently dropped the scatter on a5 hw)
   - set_mask_norm / set_vector_mask SIMT mask init at entry
   - MTE2 → V flag/wait before MSCATTER (the ptoas default MTE2 → MTE3
     also silently dropped the scatter on hw)

   Indices use torch.arange() so the golden reduces to \`out == src\`,
   keeping the test a strict bring-up signal: a regression in any of
   the launch-path layers fixed above (TLV injection, localMemorySize
   budget, sync) flips it red while keeping false-positives from the
   scatter algorithm itself out of the way. A follow-up case with
   torch.randperm indices can be added once the ptoas dispatcher's
   per-element vs row-mode behaviour on hw is confirmed.

   The orchestration wraps rt_submit_aiv_task in PTO2_SCOPE() so the
   submit flushes through the task ringbuffer before the entry returns.
@ChaoZheng109 ChaoZheng109 force-pushed the fix-a5-aicore-simt-tlv branch from fb7b893 to 3d1adbf Compare May 14, 2026 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant