Skip to content

[Metal] emit Metal builtins directly instead of CUDA-style threadIdx/blockIdx aliases#37

Open
apstenku123 wants to merge 404 commits intotile-ai:mainfrom
apstenku123:cppmega/metal-emit-builtins-directly
Open

[Metal] emit Metal builtins directly instead of CUDA-style threadIdx/blockIdx aliases#37
apstenku123 wants to merge 404 commits intotile-ai:mainfrom
apstenku123:cppmega/metal-emit-builtins-directly

Conversation

@apstenku123
Copy link
Copy Markdown

Summary

TVM-mirror companion to tile-ai/tilelang#2143 (filed against tile-ai/tilelang).

This PR fixes the same bug in the TVM Metal codegen that lives at src/target/source/codegen_metal.cc in the TileLang/tvm fork. The supermodule TileLang patch fixes src/target/codegen_metal.cc, which is a thin wrapper; the actual codegen lives in this submodule.

The bug

TileLang's metal codegen emits CUDA-style threadIdx/blockIdx aliases:

uint3 blockIdx [[threadgroup_position_in_grid]],
uint3 threadIdx [[thread_position_in_threadgroup]],

and uses ((int)blockIdx.x) / ((int)threadIdx.x) in the body. These aliases are CUDA-compat affordances that don't fit Metal — they're pure overhead. Downstream consumers (e.g., cppmega.mlx) ended up post-processing MSL to canonicalize back to Metal builtins.

The fix

Emit Metal builtin names (threadgroup_position_in_grid, thread_position_in_threadgroup) directly as kernel parameters and body references. No intermediate alias variable, no (int) cast layer.

Why this is the TVM-mirror half

src/target/source/codegen_metal.cc in TileLang/tvm is what actually emits the MSL. The supermodule TileLang has its own src/target/codegen_metal.cc which inherits from this. Both files need the change for end-to-end correctness — the supermodule PR (apache#2143 in tile-ai/tilelang) is incomplete on its own.

cppmega workaround as evidence

cppmega_mlx/nn/_tilelang/_msl_transform.py::_canonicalize_tilelang_builtin_aliases (and 4 helper functions) string-substitute the alias output back to Metal builtins. With this PR + apache#2143 landed, those helpers become no-ops and can be deleted.

Test plan

cd TileLang/tvm
mkdir build && cd build
cmake .. && make -j tvm_runtime
# emit MSL from a tiny prim_func; assert "blockIdx_x" never appears, "threadgroup_position_in_grid" present

Pairing

LeiWang1999 and others added 30 commits October 22, 2025 21:29
…em raised in the codegen test for cuda (apache#18398)

* fix the  8-bit vector loads/stores so each lane is addressed using reinterpret_cast byte indexing, instead of rolled bit packing, which will omit certain bits.

* fix clang format
…ry ops tests (apache#18400)

* finish1

* finish2

* finish3

* finish4
This PR addresses the issue where tvm.tir.exp does not support integer types (e.g., int32, int64), causing an InternalError during LLVM code generation with the message.

The issue arises because the llvm.exp intrinsic expects floating-point inputs, but no type conversion is performed for integer inputs.

This change aligns the behavior of tir.exp with libraries like PyTorch and NumPy, which implicitly convert integer inputs to floating-point types for their exponential functions.

Fix apache#18381
* Fixing database bug

* Fix lit gemini error
This PR bumps tvm-ffi to latest
…es (apache#18412)

* Replace relax.build with tvm.compile in export script

* Remove unnecessary print statement in export script

Remove print statement for skipping model conversion.

* Update output handling for TVM results
… importing ONNX model using Relax frontend (apache#18416)

[apache#18397] Fix bug: Unsupported numpy or ml_dtypes dtype('O') when importing ONNX model using Relax frontend

Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
…t cases (apache#18419)

[apache#17640] Refactor: remove the depreation warning from test cases

Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
kurisu6912 and others added 27 commits January 16, 2026 15:08
- Added PrintIndent call in PrintSSAAssign to improve code formatting.
- Removed unnecessary scope management in VisitExpr_ for better clarity and performance.
- Increased the timeout limit in SetRLimit from 10,000 to 100,000 for improved performance.
- Added detailed logging in the CanProve method to trace the evaluation process and results of the Z3 solver.
- Reduced the timeout limit in SetRLimit from 100,000 to 10,000 for improved control over execution time.
- Fixed formatting inconsistencies in comments and code for better readability.
- Improved the handling of nested conditions in the if_then_else construct to prevent out-of-bounds access by combining outer select conditions.
- Added a stack to manage select conditions during code generation, ensuring proper evaluation order and safety.
- Updated comments for clarity and better understanding of the changes made.
…ation

- Introduced CountSatisfyingValues method in both Z3Prover implementations to count distinct integer values satisfying current constraints using Z3's model enumeration.
- Added detailed documentation for the new method, explaining parameters and return values.
- Implemented basic error handling for unsatisfiable conditions and minimum consecutive value requirements.
- Updated the Z3Prover interface to include the new method, ensuring compatibility with existing functionality.
- Updated the logic for handling read and write access in the VisitExpr_ method to treat read and write masks more conservatively.
- This change simplifies the access region detection process, allowing for better handling of common patterns like atomic read-modify-write without requiring manual annotations from users.
- Improved code clarity by restructuring conditional checks for read and write access updates.
- Added logic to eliminate bounded offsets in comparisons involving expressions of the form (base + offset) when offset is known to be within a specific range.
- Implemented helper functions to determine if expressions are multiples of a given factor and to simplify comparisons based on modular analysis.
- Updated tests to cover new simplification cases for aligned values, ensuring correctness of the new logic.
…ctory handling

- Introduced a new helper function `_resolve_artifact_paths` in `nvcc.py` to streamline the management of temporary file paths for CUDA compilation.
- Enhanced the `TempDirectory` class in `utils.py` to ensure thread-safe creation of debug temporary directories, preventing race conditions in multi-process scenarios.
- Updated tests in `test_util.py` to validate the new debug directory handling and ensure robustness in concurrent environments.
- Introduced EnsureCurrentDeviceContext to ensure the correct CUDA context is set for the current thread before executing device-specific operations.
- Updated multiple methods in CUDAModuleNode and CUDAWrappedFunc to call this new function, enhancing thread safety and context management during multi-GPU execution.
* feat: add bfloat16x2 types

* fix: make less diff
- Added a check to validate that grid dimensions are positive before launching CUDA kernels, improving error handling for dynamic shapes that may result in zero dimensions.
- Simplified work size assignment in thread storage scope to remove unnecessary checks for dynamic shapes.
…nd (apache#30)

cuLaunchKernel is asynchronous and its return value does not capture
runtime errors such as illegal memory access. Add cudaPeekAtLastError()
after the launch to detect these errors, matching the Cython backend's
TILELANG_CHECK_LAST_ERROR behavior.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add tfloat32 datatype

* fix: change tfloat32 type code to 130

* minor fix
* Fix TVMDerivedObject slots for apache-tvm-ffi compatibility

Add __slots__ = ("_inst", "__weakref__") to the dynamically created
TVMDerivedObject class inside the derived_object decorator.

The class inherits from CObject (apache-tvm-ffi), a C extension type with
__slots__ = () and no instance __dict__. Without explicit __slots__,
setting self._inst in __init__ raises AttributeError, and weakref.ref(self)
fails because __weakref__ is not available.

Root cause: tilelang migrated from a custom TVM fork to apache-tvm-ffi
(October 2025). The old fork's Object type allowed arbitrary instance
attributes; the new CObject does not.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix TVMDerivedObject slots in meta_schedule/utils.py duplicate

Apply the same __slots__ fix to the second copy of derived_object in
meta_schedule/utils.py. Most @derived_object users (LocalRunner,
LocalBuilder, cost models, etc.) import from this copy, not
runtime/support.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 10:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

apstenku123 added a commit to DatasunriseOU/cppmega_mlx that referenced this pull request May 4, 2026
…, #37/#38/#39)

Three parallel agents completed the supermodule/submodule split filing:

1. tilelang_metal_fp8 (storage-only FP8 emulation) split:
   - 0001-tilelang-metal-fp8-storage-only.patch — supermodule half (235 lines)
   - 0002-tvm-metal-fp8-storage-only.patch — TVM-mirror half (260 lines, prefix stripped)
   - PR tile-ai/tilelang#2144 (supermodule, stacks on PR #2130)
   - PR tile-ai/tvm#38 (TVM mirror, base tilelang_main @ 0e15b274)

2. tilelang_metal_fp8_vector (vector cast lanes 2/3/4) split:
   - 0001-tilelang-metal-fp8-vector-cast.patch — supermodule half (148 lines)
   - 0002-tvm-metal-fp8-vector-cast.patch — TVM-mirror half (151 lines)
   - PR tile-ai/tilelang#2145 (supermodule, depends on #2144)
   - PR tile-ai/tvm#39 (TVM mirror, depends on #38)

3. PR #2143 TVM-mirror companion:
   - PR tile-ai/tvm#37 — already filed, README updated to link both halves

Total filed today: 11 PRs across 3 repos
- 1 ml-explore/mlx (#3476)
- 1 apache/tvm (#19504)
- 6 tile-ai/tilelang (#2139, #2140, #2141, #2142, #2143 super, #2144 super, #2145 super)
- 3 tile-ai/tvm (#37, #38, #39 — TVM-mirror companions)

PR #2142 (T.fp8_scaled_matmul) has no TVM-mirror companion needed —
verified the patch only touches supermodule files.

All splits round-trip clean (apply forward + reverse) on their respective
bases. README files in each docs/upstream/<dir>/ updated with PR URLs and
dependency-chain diagrams.

Note: TileLang/tvm redirects to tile-ai/tvm server-side (canonical org
slug). All TVM-mirror PRs land at tile-ai/tvm/pull/N URLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.