Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764
Open
ChaoZheng109 wants to merge 1 commit into
Open
Fix: a5 AICore SIMT launch — set localMemorySize + inject SIMT TLVs#764ChaoZheng109 wants to merge 1 commit into
ChaoZheng109 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements SIMT metadata TLV injection for AICore kernels to support the legacy launch path. Key changes include defining TLV structures and metadata enums in kernel.cpp, disabling automatic metadata generation via compiler flags in CMakeLists.txt, and setting the local memory size in the host-side device runner. Review feedback recommends renaming internal structures to avoid reserved identifier conflicts and using macros to dynamically generate section names for better maintainability.
e7a9644 to
fb7b893
Compare
Three coupled changes that together unblock the
rtKernelLaunchWithHandleV2 path on a5 and add CI coverage for it:
1. cfg.localMemorySize was left at 0 in launch_aicore_kernel, so runtime
allocated no AICore local memory and SIMT execution failed. Add a
pair of constants in platform_config.h —
PLATFORM_AICORE_SHARE_MEM_SIZE (8 KB) and
PLATFORM_AICORE_LOCAL_MEMORY_SIZE (216 KB) — and pass the latter
through rtTaskCfgInfo_t::localMemorySize. The pair sums to exactly
RT_SIMT_REMAIN_UB_SIZE (224 KB = 256 KB UB − 32 KB dcache); runtime's
check is strict > so equality is accepted. Section 2 below consumes
PLATFORM_AICORE_SHARE_MEM_SIZE as the advertised TLV value.
2. Runtime reads two TLV records (COMPILER_ALLOC_UB_SIZE / type=7 and
AIV_TYPE_FLAG / type=12) from the kernel ELF's \`.ascend.meta.<func>\`
section to populate Kernel::shareMemSize_ and Kernel::kernelVfType_.
bisheng only emits these when it can statically infer SIMT use; our
SU-dispatcher entry can't be tagged automatically. Inject a
hand-written meta record for the AIV variant (ub_size=8 KB via
PLATFORM_AICORE_SHARE_MEM_SIZE, aiv_type=SIMD_SIMT_MIX_VF — the
dispatcher routes task .o files containing both SIMD and SIMT vector
kernels, so MIX_VF avoids runtime's per-type restrictions) and
disable bisheng's auto-emission with
\`-mllvm -cce-dyn-kernel-stack-size=false\` so the runtime parser,
which keys kernelInfoMap by section name and overwrites instead of
merging, doesn't shadow our values with NO_VF / shareMemSize=0.
3. Add tests/st/a5/tensormap_and_ringbuffer/simt_basic/ — a minimal
element-scatter ST that exercises the SIMT launch path end-to-end.
The kernel is distilled from the ptoas-generated mscatter reference
and keeps the pieces real hardware actually requires:
- per-data 3-tile alias pattern (TLOAD binds one tile, MSCATTER
reads from another aliased to the same UB address; the single-tile
form silently dropped the scatter on a5 hw)
- set_mask_norm / set_vector_mask SIMT mask init at entry
- MTE2 → V flag/wait before MSCATTER (the ptoas default MTE2 → MTE3
also silently dropped the scatter on hw)
Indices use torch.arange() so the golden reduces to \`out == src\`,
keeping the test a strict bring-up signal: a regression in any of
the launch-path layers fixed above (TLV injection, localMemorySize
budget, sync) flips it red while keeping false-positives from the
scatter algorithm itself out of the way. A follow-up case with
torch.randperm indices can be added once the ptoas dispatcher's
per-element vs row-mode behaviour on hw is confirmed.
The orchestration wraps rt_submit_aiv_task in PTO2_SCOPE() so the
submit flushes through the task ringbuffer before the entry returns.
fb7b893 to
3d1adbf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two coupled fixes for the legacy
rtKernelLaunchWithHandleV2path on a5 AICore. Without both,rtKernelLaunchWithHandleV2either allocates no local memory for the kernel or refuses the launch withACL_ERROR_RT_PARAM_INVALID (107000).1. Set
cfg.localMemorySizecfg.localMemorySizewas left at0inlaunch_aicore_kernel, so the runtime reserved no AICore local memory and SIMT execution failed. IntroducePLATFORM_AICORE_LOCAL_MEMORY_SIZE = 216 KBand pass it throughrtTaskCfgInfo_t. The 216 KB ceiling pairs with the 8 KB advertised by the TLV record (fix #2) to land exactly onRT_SIMT_REMAIN_UB_SIZE(224 KB = 256 KB UB − 32 KB dcache); runtime's check is strict>so equality is accepted.2. Inject SIMT TLVs into the AICore ELF
Runtime reads two TLV records from
.ascend.meta.<funcname>at register time:RT_FUNCTION_TYPE_COMPILER_ALLOC_UB_SIZE(type=7) →Kernel::shareMemSize_RT_FUNCTION_TYPE_AIV_TYPE_FLAG(type=12) →Kernel::kernelVfType_bisheng only emits these when it can statically infer the kernel uses SIMT intrinsics. Our SU-dispatcher entry doesn't satisfy that — vector ops live in task
.ofiles invoked throughaicore_execute. Fix in two layers:ub_size = PLATFORM_AICORE_SHARE_MEM_SIZE(8 KB),aiv_type = SIMD_SIMT_MIX_VF. The dispatcher routes task.ofiles containing both SIMD and SIMT vector kernels, so MIX_VF avoids runtime's per-type restrictions.-mllvm -cce-dyn-kernel-stack-size=falseto the AICore build flags. Without it, bisheng auto-emits a sibling section with the same name, which runtime's parser (kernelInfoMap keyed by section name) overwrites instead of merging — so the auto-emittedNO_VF / shareMemSize=0would shadow our values.TLV type IDs
7/12mirrorrtFunctionMetaTypeinruntime/runtime/elf_base.h;AIVTypevalues are not exposed in any CANN C/C++ header (only inascendc_identify_meta_section_info.py). Both are documented inline insimt_meta.hfor traceability.Files
src/a5/platform/include/common/platform_config.h— newPLATFORM_AICORE_LOCAL_MEMORY_SIZEandPLATFORM_AICORE_SHARE_MEM_SIZEconstantssrc/a5/platform/onboard/host/device_runner.cpp— setcfg.localMemorySizeinlaunch_aicore_kernelsrc/a5/platform/onboard/aicore/CMakeLists.txt— add-mllvm -cce-dyn-kernel-stack-size=falsesrc/a5/platform/onboard/aicore/simt_meta.h— TLV struct/enum definitionssrc/a5/platform/onboard/aicore/kernel.cpp— hand-written SIMT TLV record using the extracted types+107 lines across 5 files, no deletions.
Test plan
bisheng-readelf -S build/lib/<…>/aicore_kernel.oshows exactly one.ascend.meta.aicore_kernel_0_mix_aivsection (no shadow).[type=7, len=4, ub_size=8192]and[type=12, len=4, aiv_type=4].examples/) launches withoutACL_ERROR_RT_PARAM_INVALID (107000)from runtime'sCheckAndGetTotalShareMemorySize.