Fix BUG and Optimize performance of mm operator for mthreads backend by Vincent-Xiao · Pull Request #2219 · flagos-ai/FlagGems

Vincent-Xiao · 2026-04-02T13:02:44Z

PR Category

Operator

Type of Change

Bug Fix | Performance Optimization

Description

Added a dedicated GEMV kernel path to address the N=1 case.
Improved SQMMA kernel performance by caching the device_descriptor.
Improved performance by caching the pre_hook when using @libentry().
Achieved up to 124% performance improvement

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

… LibEntry cache the autotuner-selected config.pre_hook together with config.all_kwargs() on a cache miss, and replay these pre_hooks on the direct-launch path on a cache hit before launching the kernel.

Copilot

Pull request overview

This PR targets correctness and performance improvements for the mthreads backend matrix multiply path, including better handling of the N=1 case and reducing overhead in SQMMA launches.

Changes:

Added a dedicated GEMV (N=1) kernel path for mm/mm_out.
Added caching for TMA device descriptors to reduce repeated descriptor construction overhead.
Updated @libentry() execution to cache and replay Triton autotuner pre_hook calls on cached kernel launches.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/flag_gems/utils/libentry.py`	Cache autotuner pre-hooks and replay them for cached compiled-kernel launches.
`src/flag_gems/runtime/backend/_mthreads/ops/utils.py`	Introduce an LRU-style cache for TMA device descriptors.
`src/flag_gems/runtime/backend/_mthreads/ops/mm.py`	Add GEMV fast path for `N=1`; refactor SQMMA descriptor setup to use cached descriptors + pre-hook.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T13:11:13Z

src/flag_gems/runtime/backend/_mthreads/ops/mm.py

+    mm_sqmma_kernel[grid](
+        A,
+        B,
+        C,
        desc_a,


mm_sqmma() launches mm_sqmma_kernel without with torch_device_fn.device(A.device): (unlike mm_fma(), mm_out(), etc.). This can run the kernel on the wrong current device in multi-device contexts and can also break descriptor creation/usage that depends on the active device. Wrap the kernel launch in the same torch_device_fn.device(A.device) context manager used elsewhere in this backend.

Copilot · 2026-04-02T13:11:13Z

src/flag_gems/runtime/backend/_mthreads/ops/mm.py

+    nargs["b_desc_ptr"].copy_(
+        get_cached_tma_device_descriptor(b, block_k, block_n, device)
+    )
+    nargs["c_desc_ptr"].copy_(create_tma_device_descriptor(c, block_m, block_n, device))


In sqmma_descriptor_pre_hook, create_tma_device_descriptor(c, ...) allocates a new device tensor every launch and then copy_() immediately copies it into c_desc_ptr, adding an extra allocation + device-to-device copy on the critical path. Consider changing the descriptor helper to fill an existing descriptor tensor (or to return a CPU tensor and copy directly into c_desc_ptr) so the pre-hook avoids per-launch device allocations.

Suggested change

nargs["c_desc_ptr"].copy_(create_tma_device_descriptor(c, block_m, block_n, device))

nargs["c_desc_ptr"].copy_(

get_cached_tma_device_descriptor(c, block_m, block_n, device)

)

Copilot · 2026-04-02T13:11:14Z

src/flag_gems/runtime/backend/_mthreads/ops/mm.py

+def is_sqmma_compatible(a, b, N, K):
+    return (
+        os.getenv("MUSA_ENABLE_SQMMA", "0") == "1"
+        and a.dim() == 2
+        and b.dim() == 2
+        and a.dtype == b.dtype
+        and a.dtype in (torch.float16, torch.bfloat16)
+        and is_supported_sqmma_layout(a)
+        and is_supported_sqmma_layout(b)
+        and N % 8 == 0
+        and K % 8 == 0
+    )


mm() now gates SQMMA via is_sqmma_compatible() instead of the shared should_enable_sqmma() helper in ops/utils.py, which means the explicit shape exclusions in should_enable_sqmma (e.g. (15, 160, 1024)) are no longer applied for mm. If those exclusions are still required to avoid known SQMMA failures, this is a behavioral regression; consider reusing should_enable_sqmma here or moving any required exclusions/alignment checks into a single shared predicate used by mm/addmm/bmm.

Vincent-Xiao added 11 commits April 2, 2026 11:00

add gemv kernel with N=1; add flagtune expand configuration

e6e15b2

cache A/B sqmma device_descriptor to reduce sqmma wrapper overhead

a69c8a6

udpate

4f6b211

fix issue: @@libentry() error for mthreads mm_sqmma_kernel via making…

d2c0340

… LibEntry cache the autotuner-selected config.pre_hook together with config.all_kwargs() on a cache miss, and replay these pre_hooks on the direct-launch path on a cache hit before launching the kernel.

narrow gemv config space

1e3994b

rename mm_gemv to gemv

b151753

prune illeagal config

4164af3

fix accuray issue of gemv

0956652

remove prune_sqmma_configs logic

9d87b5d

remove FlagTune related logic

57adbdf

pre-commit

2f014e4

Copilot AI review requested due to automatic review settings April 2, 2026 13:02

github-actions bot added core vendor/MooreThreads size/Medium labels Apr 2, 2026

Copilot started reviewing on behalf of Vincent-Xiao April 2, 2026 13:03 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BUG and Optimize performance of mm operator for mthreads backend#2219

Fix BUG and Optimize performance of mm operator for mthreads backend#2219
Vincent-Xiao wants to merge 11 commits intoflagos-ai:masterfrom
Vincent-Xiao:mm_mthreads

Vincent-Xiao commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Vincent-Xiao commented Apr 2, 2026

PR Category

Type of Change

Description

Issue

Progress

Performance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants