Skip to content

[MetaxGPU][quantize] Use portable C++ MXFP4 dequant on Maca#114

Open
ventijing wants to merge 13 commits into
tile-ai:devfrom
ventijing:dev_dequantize
Open

[MetaxGPU][quantize] Use portable C++ MXFP4 dequant on Maca#114
ventijing wants to merge 13 commits into
tile-ai:devfrom
ventijing:dev_dequantize

Conversation

@ventijing

@ventijing ventijing commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Maca mxcc rejects PTX inline asm in decode_fp4_to_bf16_twiddling.
Select the existing portable C++ path for Maca and gfx950, and infer
target from determine_target("auto") when not passed explicitly.

Summary by CodeRabbit

  • Bug Fixes

    • Improved MXFP dequantization compatibility by automatically choosing a portable dequantization path on targets that can’t run CUDA PTX inline assembly, fixing correctness for affected AMD/MACA configurations.
    • Refined target selection so the MXFP implementation matches the active hardware capability set.
  • Chores

    • Updated regression discovery to run benchmark driver scripts from the top-level examples directory instead of only examples/maca.
    • Improved MACA CI setup by fully removing any existing flash_linear_attention/flash-linear-attention installs (including related metadata) before installing the specified MetaX version.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

tilelang/quantize/mxfp.py adds two helpers—_resolve_mxfp_target() and _target_uses_portable_mxfp_dequant()—and uses them to replace the ad-hoc AMD try/except detection in get_mxfp_intrin_group() with a generalized portable-path switch covering both maca and gfx950 targets. Imports are expanded accordingly. In parallel, maint/scripts/regression_all.py broadens regression test discovery to scan the full examples directory instead of only the maca subdirectory. The CI workflow fully resets and updates the MACA-specific toolkit setup to use a build-variant-pinned version of flash_linear_attention.

Changes

Portable MXFP dequant target detection

Layer / File(s) Summary
Target resolution and portable-path helpers
tilelang/quantize/mxfp.py
Imports expanded to include Target, determine_target, and target_is_gfx950; _resolve_mxfp_target() selects the active target from explicit input, Target.current(), or determine_target("auto"); _target_uses_portable_mxfp_dequant() returns True for maca and gfx950 targets.
get_mxfp_intrin_group portable routing
tilelang/quantize/mxfp.py
Replaces the _is_gfx950 try/except block with _use_portable from the new helpers; portable mode accepts only source_bit=4 + out_dtype=bfloat16, routing to HIP twiddling or simple FP4→BF16 sources; other combinations raise AssertionError.

Expanded regression testing scope

Layer / File(s) Summary
Examples root directory expansion
maint/scripts/regression_all.py
_examples_root() now points to the repository-wide examples directory instead of examples/maca, changing which benchmark driver files are discovered and executed by regression_all().

CI MACA toolkit dependency update

Layer / File(s) Summary
MACA toolkit dependency setup
.github/workflows/ci.yml
The CI workflow's MACA-specific install path fully resets any existing flash_linear_attention artifacts, then installs flash_linear_attention pinned to 0.4.0+metax3.7.2.0torch2.8 using the same MetaX index and trusted-host settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

🐰 A rabbit once hopped past a PTX gate,
And found that some targets just couldn't relate.
So helpers were dug — two burrows so neat —
To sniff out the targets and choose the right street.
Now maca and gfx950 skip the asm maze,
While broader examples light regression days! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enabling portable C++ MXFP4 dequantization on Maca instead of PTX inline assembly, which directly aligns with the core purpose of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ventijing

Copy link
Copy Markdown
Collaborator Author

@regression-perf

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tilelang/quantize/mxfp.py`:
- Around line 174-182: The function _target_uses_portable_mxfp_dequant currently
calls target_is_gfx950() unconditionally on all non-maca targets at the final
return statement. Since target_is_gfx950() is a ROCm-specific predicate and may
fail or reject non-ROCm targets (cuda, llvm, metal, etc.), add a guard check
before the target_is_gfx950() call to verify the target kind is ROCm-related.
Only invoke target_is_gfx950(target) if the target is actually a ROCm target;
otherwise return False to maintain safe failure behavior and prevent errors when
non-ROCm targets are passed from the call site.
- Around line 165-171: In the `_resolve_mxfp_target` function, modify the first
condition that currently checks `if target is not None:` to also exclude the
string "auto" by changing it to `if target is not None and target != "auto":`.
This ensures that when "auto" is passed as the target argument, it is not
returned directly but instead falls through to the `determine_target("auto",
return_object=True)` call on the last line, which properly resolves it to a
Target object rather than passing the unresolved "auto" string downstream.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 29bad3d8-7a97-4419-99cd-3ebbfae39c18

📥 Commits

Reviewing files that changed from the base of the PR and between f5af19d and 1d335ad.

📒 Files selected for processing (1)
  • tilelang/quantize/mxfp.py

Comment thread tilelang/quantize/mxfp.py
Comment thread tilelang/quantize/mxfp.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@maint/scripts/regression_all.py`:
- Line 60: The merge logic using `if k not in merged` silently skips duplicate
benchmark names, which can hide regressions when the expanded examples scope
encounters multiple benchmarks with the same name across different drivers.
Modify the merge operation in the regression_all() function to detect when a key
already exists in the merged dictionary and raise an exception or error with
details about the collision, rather than silently ignoring the duplicate entry.
Alternatively, namespace the keys by including the source file path along with
the benchmark name to ensure uniqueness while preserving all results.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a41f2eaf-ff55-4680-b140-e7218c64cc67

📥 Commits

Reviewing files that changed from the base of the PR and between 1d335ad and 1a6a8cb.

📒 Files selected for processing (1)
  • maint/scripts/regression_all.py


def _examples_root() -> Path:
return Path(__file__).resolve().parents[2] / "examples" / "maca"
return Path(__file__).resolve().parents[2] / "examples"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Expanded examples scope can silently drop benchmark results due to name collisions.

By switching discovery to the repo-wide examples tree (Line 60), regression_all() now has a much higher chance of encountering duplicate benchmark names across different drivers. The current merge logic (if k not in merged) silently ignores later entries, which can hide regressions and undercount totals.

A safer approach is to either fail on duplicate names or namespace keys by file path when merging.

Suggested fix (fail fast on duplicate benchmark names)
@@
-        for k, v in parsed.items():
-            if k not in merged:
-                merged[k] = v
-                _RESULTS.append(PerfResult(name=k, latency=v))
+        for k, v in parsed.items():
+            if k in merged:
+                failures.append(
+                    f"{rel_path}\nDuplicate benchmark name detected: {k!r}. "
+                    "Benchmark names must be globally unique across discovered drivers."
+                )
+                continue
+            merged[k] = v
+            _RESULTS.append(PerfResult(name=k, latency=v))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/scripts/regression_all.py` at line 60, The merge logic using `if k not
in merged` silently skips duplicate benchmark names, which can hide regressions
when the expanded examples scope encounters multiple benchmarks with the same
name across different drivers. Modify the merge operation in the
regression_all() function to detect when a key already exists in the merged
dictionary and raise an exception or error with details about the collision,
rather than silently ignoring the duplicate entry. Alternatively, namespace the
keys by including the source file path along with the benchmark name to ensure
uniqueness while preserving all results.

Comment thread tilelang/quantize/mxfp.py
f"AMD gfx950 MXFP dequant only supports source_bit=4 and out_dtype=bfloat16, "
f"Portable MXFP dequant only supports source_bit=4 and out_dtype=bfloat16, "
f"got source_bit={source_bit}, out_dtype={out_dtype}"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add if branch in here like gfx950 to resolve same issue with maca target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants