Skip to content

[MetaXGPU] Add compiler-path C500 hgemm route#9

Open
VitalyAnkh wants to merge 2 commits into
MetaX-MACA:devfrom
VitalyAnkh:clean-hgemm-164
Open

[MetaXGPU] Add compiler-path C500 hgemm route#9
VitalyAnkh wants to merge 2 commits into
MetaX-MACA:devfrom
VitalyAnkh:clean-hgemm-164

Conversation

@VitalyAnkh

@VitalyAnkh VitalyAnkh commented May 20, 2026

Copy link
Copy Markdown

Hi maintainers,

This PR routes the C500 hgemm path in tileops.ops.gemm.GemmOp through the TileLang compiler-generated backend. The handwritten maca_hgemm MACA C implementation remains a performance, layout, and shape reference; it is not used as the optimized execution path from GemmOp.

Summary

  • Adds the compiler-path MetaX C500 hgemm route with packed-B and split-K support.
  • Keeps the helper and wrapper pieces needed to preserve the validated layout contract.
  • Updates auto-dispatch coverage for the new GemmOp path.
  • Requires TileLang 0.1.10, which provides the GEMM annotation surface used by this route.
  • Keeps the route paired with the TileLang lowering and layout support in the companion PR.

Rebase update

  • Rebased onto the latest upstream dev.
  • Updated the direct hgemm block scopes to the current TileLang T.sblock API.
  • Revalidated the paired TileOps/TileLang stack after the TileLang WSM contract fix.

Validation

  • Production-shape hgemm sweep passed correctness on all 8 covered shapes.
  • Current measured throughput range: 172.076209 to 205.316929 TFLOPS.
  • Minimum A100-relative ratio in the sweep: 89.68%.
  • A representative compiler-path WSM fallback guard passed correctness with the paired TileLang PR.

Notes

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive support for MetaX C500 GPUs in GEMM operations, introducing specialized kernel paths such as BSM, split-K, and packed-B tile layouts. It includes several new kernel implementations, hardware-specific C++ headers, and auto-dispatch logic. The review feedback recommends defaulting compilation flags to enable MetaX-specific optimizations automatically on supported hardware and moving split-K validation logic to the initialization phase to prevent runtime crashes during tensor preparation.

Comment thread tileops/kernels/gemm/gemm.py Outdated
Comment thread tileops/kernels/gemm/gemm.py Outdated
@VitalyAnkh VitalyAnkh force-pushed the clean-hgemm-164 branch 3 times, most recently from dfd55e4 to 351ac49 Compare May 20, 2026 13:03
@VitalyAnkh

VitalyAnkh commented May 21, 2026

Copy link
Copy Markdown
Author

Addressed in the current head.

  • _gemm_compile_flags now takes use_maca: Optional[bool] = None and defaults to runtime C500 detection when no explicit value is provided.
  • The TILEOPS_GEMM_SPLIT_K compatibility checks for block_k now run immediately after init_config, before the prepared-B path can hit a less helpful shape error.

I will continue monitoring this PR together with the companion TileLang PR.

Keep GemmOp auto/default dispatch on the TileLang GemmKernel and reject direct maca_hgemm/maca_auto backend overrides for new hgemm work.

Add the MACA BSM compiler path used by the packed-B split-K result, including prepared-B packing, prepared-B caching, split-K reduction, and C500 defaults.

Validation: git diff --cached --check; ./.venv/bin/python -m py_compile tileops/ops/gemm.py tileops/kernels/gemm/gemm.py tileops/kernels/gemm/maca_auto.py tests/ops/test_gemm_auto_dispatch.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant