feat(tensilelite): Add gfx1250 support for AdaptiveGemmNTAB#8421
feat(tensilelite): Add gfx1250 support for AdaptiveGemmNTAB#8421AndySu12 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends TensileLite/Tensile + StinkyTofu/rocisa plumbing to support gfx1250-specific cache-hint semantics needed by AdaptiveGemm / AdaptiveGemmNTAB, including temporal-hint (th:TH_*) and non-volatile (nv) modifiers, and adds a gfx1250 coverage YAML for AGNTAB.
Changes:
- Add
NonVolatilesupport to StinkyTofu MUBUF modifiers and ensure it round-trips through asm emission + (de)serialization. - Update Tensile KernelWriter to avoid label collisions under AdaptiveGemmNTAB and to map NTAB selections to gfx1250 temporal hints when supported.
- Add gfx1250 AGNTAB coverage test YAML and extend rocisa tests for
thandnvemission.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| shared/stinkytofu/src/serialization/asm/StinkyAsmEmitter.cpp | Emit nv token for MUBUF modifiers when present. |
| shared/stinkytofu/src/serialization/asm/ModifierSerializer.cpp | Serialize/deserialize nv and use store/load-specific temporal-hint strings. |
| shared/stinkytofu/src/conversion/rocisa/ToStinkyTofuUtils.cpp | Convert rocisa nv into StinkyTofu and minor loop-body detection tweak. |
| shared/stinkytofu/include/stinkytofu/ir/asm/StinkyModifiers.hpp | Introduce NonVolatile enum + parse/print helpers; extend MUBUFModifiers. |
| projects/hipblaslt/tensilelite/Tensile/Tests/common/gemm/gfx12/agntab_coverage_gfx1250.yaml | New gfx1250 AGNTAB/AG coverage configuration. |
| projects/hipblaslt/tensilelite/Tensile/KernelWriter.py | Make unrolled-loop labels NTAB-specific; adjust TH(A/B) under NTAB when supported. |
| projects/hipblaslt/tensilelite/rocisa/test/test_mubuf.py | Add StinkyTofu emission tests for MUBUF temporal-hint and non-volatile modifiers. |
| projects/hipblaslt/tensilelite/rocisa/rocisa/include/container.hpp | Remove TH/NV asmCaps gating for FLAT/MUBUF/SMEM modifier emission (needs fixes). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
❌ Your project status has failed because the head coverage (76.92%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #8421 +/- ##
===========================================
- Coverage 71.33% 71.33% -0.00%
===========================================
Files 2628 2628
Lines 413045 413056 +11
Branches 61875 61878 +3
===========================================
+ Hits 294615 294616 +1
- Misses 96656 96666 +10
Partials 21774 21774
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
94ac370 to
74d72c9
Compare
f6ba794 to
d4d0c00
Compare
d4d0c00 to
a5f8a4b
Compare
a5f8a4b to
f73ca4f
Compare
✅ All Checks Passed — Ready for Review
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🎉 All checks passed! This PR is ready for review. |
StinkyTofu's MUBUF modifier model did not carry the gfx1250 TemporalHint (th:) or NonVolatile (nv) fields, so converting a rocisa module via toStinkyTofuModule and rendering it with emitAssembly dropped them while keeping scope:. Because every gfx1250 kernel is emitted through StinkyTofu for any ScheduleIterAlg, this silently disabled the AdaptiveGemmNTAB per-body non-temporal hint (th:TH_LOAD_NT) on all gfx1250 kernels. Add TemporalHint/NonVolatile to StinkyTofu's MUBUFModifiers across the model, the rocisa->StinkyTofu conversion, the asm emitter, and the .stir round-trip, mirroring rocisa's encoding and isStore-aware naming. StinkyTofu emits th:/nv purely from the modifier values; capability gating lives in rocisa (container.hpp) upstream. Add regression tests for the th: and nv paths.
…working copy AdaptiveGemmNTAB emits multiple loop bodies whose A/B global loads differ only by their non-temporal/temporal-hint state. NonTemporal already switches this per body via the transient tensorParameters["NonTemporal"] working copy; TemporalHint, however, was toggled by mutating the kernel solution parameters kernel["TemporalHintA"/"TemporalHintB"] directly, which is an architectural violation (solution parameters are immutable inputs, not per-body scratch). Mirror the NonTemporal pattern: seed tP["TemporalHint"] alongside tP["NonTemporal"] at tensorParameters init, read it on the A/B global-read paths (TemporalHint(tP["TemporalHint"])), and snapshot/set/restore it at the tensorParameters level inside the NTAB body loop. The kernel-level TemporalHint solution parameters are no longer mutated. Behavior is unchanged: the same per-body int values flow through the same TemporalHint enum into decodeNonTemporal; only the storage location moves from the kernel dict to the per-tensor working copy. The kernel-direct load path (non-tP) keeps reading the solution parameter as before.
f73ca4f to
039fe23
Compare
AIHPBLAS-879
JIRA ID : https://amd-hub.atlassian.net/browse/AIHPBLAS-879
Motivation
gfx1250 adds TemporalHint and NonVolatile cache modifiers that AdaptiveGemmNTAB uses for per-body cache behavior. AGNTAB emits three main-loop bodies differing only in their NonTemporal/TemporalHint state, but the hint was set by mutating immutable solution parameters and the loop labels collided across bodies. More importantly, every gfx1250 kernel is lowered through StinkyTofu, whose MUBUF model dropped th/nv, silently disabling the hint on all gfx1250 kernels. This PR makes AGNTAB cache hints generate correctly and survive the full rocisa → StinkyTofu → assembly pipeline.
Technical Details
In Tensilelite, the per-body TemporalHint now uses the transient tensorParameters (tP) working copy instead of mutating kernel object, and unrolled-loop labels are made NTAB-specific. StinkyTofu's MUBUFModifiers is extended to carry th/nv, plumbed through the rocisa conversion, asm emitter, and stir round-trip with isStore-aware naming. StinkyTofu emits purely from the values; capability gating (HasTHModifier/HasNVModifier) stays upstream in rocisa. The stir deserialization recovers isStore from the th token prefix so store hints aren't downgraded on round-trip.
Test Plan
test_mubuf.py, agntab_coverage_gfx1250.yaml, tox, hipblaslt-test
Test Result
PASSED
Submission Checklist