[MetaxGPU][testing] Add missing double2 ->e8m0x2 conversion in fp8.h by xxxzh99 · Pull Request #101 · tile-ai/tilelang-metax

xxxzh99 · 2026-06-05T08:42:57Z

Add complete bidirectional conversion kernels between FP8-E8M0 and float/bfloat16 scalar/vector types for MACA device. These TL_DEVICE helper routines support quantization/dequantization used in low-precision grouped-GEMM and MoE inference, covering scalar, dual-element vector and double-precision fallback paths with proper exponent rounding rules to guarantee numerical correctness during type cast.

Summary by CodeRabbit

New Features
- Added FP8 E8M0 conversion support for float, bfloat16, and double, including packed/vectorized x2 variants.
Bug Fixes
- FP8 vectorized-cast test updated to run as a normal pass/fail test (no longer marked expected-to-fail).

github-actions · 2026-06-05T08:43:07Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2026-06-05T08:43:10Z

Warning

Review limit reached

@xxxzh99, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 26 minutes and 14 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5af93583-3d93-40da-9580-7c9e80df1aee

📥 Commits

Reviewing files that changed from the base of the PR and between 85cdc15 and f1ccc05.

📒 Files selected for processing (2)

src/tl_templates/maca/maca_fp8.h
testing/maca/language/test_tilelang_language_vectorized_cast.py

📝 Walkthrough

Walkthrough

Adds exponent-only E8M0 FP8 encode/decode helpers (scalar and packed x2) for float, bfloat16, and double in the MACA template header; removes the xfail marker from the FP8 vectorized-cast test; and advances the 3rdparty/tvm submodule pointer.

Changes

E8M0 Conversion Functions

Layer / File(s)	Summary
E8M0 conversion implementations `src/tl_templates/maca/maca_fp8.h`	Adds scalar decode helpers converting an E8M0 byte to `float` and `bfloat16`, packed `x2` decode for `bfloat16x2`, and scalar/packed `x2` encode helpers for `float`/`double`/`bfloat16` into E8M0 and E8M0x2.
FP8 vectorized-cast test enablement `testing/maca/language/test_tilelang_language_vectorized_cast.py`	Removed the `@tilelang.testing.pytest.mark.xfail` decorator so `test_vectorized_cast_fp8` runs as a normal pass/fail test.

Third-party Submodule

Layer / File(s)	Summary
tvm submodule pointer update `3rdparty/tvm`	Advanced the `3rdparty/tvm` git submodule commit hash to a newer revision.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 E8M0 hops into the stream,
bytes become exponents, a tiny dream.
Tests shed their xfail cloak and leap,
Submodules nudged, the tree won't sleep.
Hop, convert, and quietly keep.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions adding a missing double2 conversion, which is partially accurate but incomplete; the changeset actually adds 9 new conversion functions covering multiple scalar and vector types (float, bfloat16, double across scalar, x2, and paired variants), plus updates a test decorator and submodule. The title focuses on one specific conversion rather than the main scope of changes.	Revise the title to reflect the broader scope, such as 'Add FP8-E8M0 conversion kernels for scalar and vector types' or 'Implement bidirectional FP8-E8M0 conversions and enable vectorized-cast test'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/tl_templates/maca/maca_fp8.h (1)

669-674: ⚡ Quick win

Redundant cast in double2 conversion.

Lines 670-671 cast src.x and src.y to float before passing to __tl_cvt_double_to_e8m0, but that function already casts double to float internally (line 665). The explicit casts are unnecessary.

♻️ Proposed cleanup

 TL_DEVICE __maca_fp8x2_storage_t __tl_cvt_double2_to_e8m0x2(const double2 src) {
-  unsigned char lo = __tl_cvt_double_to_e8m0(static_cast<float>(src.x));
-  unsigned char hi = __tl_cvt_double_to_e8m0(static_cast<float>(src.y));
+  unsigned char lo = __tl_cvt_double_to_e8m0(src.x);
+  unsigned char hi = __tl_cvt_double_to_e8m0(src.y);
   return static_cast<__maca_fp8x2_storage_t>(lo) |
          (static_cast<__maca_fp8x2_storage_t>(hi) << 8);
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tl_templates/maca/maca_fp8.h` around lines 669 - 674, The two explicit
static_cast<float> calls in __tl_cvt_double2_to_e8m0x2 are redundant because
__tl_cvt_double_to_e8m0 already handles double-to-float conversion; remove the
casts around src.x and src.y and pass them directly to __tl_cvt_double_to_e8m0
so the function reads __tl_cvt_double_to_e8m0(src.x) and
__tl_cvt_double_to_e8m0(src.y), leaving the rest of __tl_cvt_double2_to_e8m0x2
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tl_templates/maca/maca_fp8.h`:
- Around line 645-653: In __tl_cvt_float_to_e8m0 change the rounding and
overflow guard: replace the current "if (mantissa > 0 && exponent < 0xFF) {
exponent += 1; }" with a round-half-up test using the halfway mantissa threshold
(mantissa >= 0x400000) and only increment when exponent < 0xFE to avoid
promoting large finite exponents to 0xFF (infinity/NaN); i.e., test mantissa >=
0x400000 and exponent < 0xFE before doing exponent += 1.

---

Nitpick comments:
In `@src/tl_templates/maca/maca_fp8.h`:
- Around line 669-674: The two explicit static_cast<float> calls in
__tl_cvt_double2_to_e8m0x2 are redundant because __tl_cvt_double_to_e8m0 already
handles double-to-float conversion; remove the casts around src.x and src.y and
pass them directly to __tl_cvt_double_to_e8m0 so the function reads
__tl_cvt_double_to_e8m0(src.x) and __tl_cvt_double_to_e8m0(src.y), leaving the
rest of __tl_cvt_double2_to_e8m0x2 unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06f0ca5b-f1fc-4ae9-974e-bfe911c3efef

📥 Commits

Reviewing files that changed from the base of the PR and between f6498bf and dd41f5e.

📒 Files selected for processing (2)

src/tl_templates/maca/maca_fp8.h
testing/maca/language/test_tilelang_language_vectorized_cast.py

💤 Files with no reviewable changes (1)

testing/maca/language/test_tilelang_language_vectorized_cast.py

coderabbitai · 2026-06-05T08:47:45Z

+TL_DEVICE unsigned char __tl_cvt_float_to_e8m0(const float src) {
+  unsigned int bits = *reinterpret_cast<const unsigned int *>(&src);
+  unsigned int exponent = (bits >> 23) & 0xFF;
+  unsigned int mantissa = bits & 0x7FFFFF;
+  if (mantissa > 0 && exponent < 0xFF) {
+    exponent += 1;
+  }
+  return static_cast<unsigned char>(exponent);
+}


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Rounding logic rounds up too aggressively and can overflow to infinity.

Lines 649-651 increment the exponent whenever mantissa is non-zero. This causes two issues:

Overly aggressive rounding: The condition mantissa > 0 rounds up even tiny mantissa values. For proper round-half-up in logarithmic (exponent-only) space, use the halfway threshold: mantissa >= 0x400000.

Overflow to infinity: When exponent == 0xFE (254) and mantissa is non-zero, incrementing produces 0xFF (255), which represents infinity or NaN in IEEE-754. Large finite values incorrectly become infinity after quantization.

🛠️ Proposed fix

TL_DEVICE unsigned char __tl_cvt_float_to_e8m0(const float src) { unsigned int bits = *reinterpret_cast<const unsigned int *>(&src); unsigned int exponent = (bits >> 23) & 0xFF; unsigned int mantissa = bits & 0x7FFFFF; - if (mantissa > 0 && exponent < 0xFF) { + // Round half up: increment exponent if mantissa >= 0.5, clamping to avoid overflow to infinity + if (mantissa >= 0x400000 && exponent < 0xFE) { exponent += 1; } return static_cast<unsigned char>(exponent); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TL_DEVICE unsigned char __tl_cvt_float_to_e8m0(const float src) {

unsigned int bits = *reinterpret_cast<const unsigned int *>(&src);

unsigned int exponent = (bits >> 23) & 0xFF;

unsigned int mantissa = bits & 0x7FFFFF;

if (mantissa > 0 && exponent < 0xFF) {

exponent += 1;

}

return static_cast<unsigned char>(exponent);

}

TL_DEVICE unsigned char __tl_cvt_float_to_e8m0(const float src) {

unsigned int bits = *reinterpret_cast<const unsigned int *>(&src);

unsigned int exponent = (bits >> 23) & 0xFF;

unsigned int mantissa = bits & 0x7FFFFF;

// Round half up: increment exponent if mantissa >= 0.5, clamping to avoid overflow to infinity

if (mantissa >= 0x400000 && exponent < 0xFE) {

exponent += 1;

}

return static_cast<unsigned char>(exponent);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tl_templates/maca/maca_fp8.h` around lines 645 - 653, In __tl_cvt_float_to_e8m0 change the rounding and overflow guard: replace the current "if (mantissa > 0 && exponent < 0xFF) { exponent += 1; }" with a round-half-up test using the halfway mantissa threshold (mantissa >= 0x400000) and only increment when exponent < 0xFE to avoid promoting large finite exponents to 0xFF (infinity/NaN); i.e., test mantissa >= 0x400000 and exponent < 0xFE before doing exponent += 1.

…p8.h

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

xxxzh99 force-pushed the fp8e8m0 branch 2 times, most recently from 85cdc15 to f1ccc05 Compare June 5, 2026 10:19

[MetaxGPU][testing] Add missing double2 ->e8m0x2 conversion in maca_f…

970fd22

…p8.h

xxxzh99 force-pushed the fp8e8m0 branch from f1ccc05 to 970fd22 Compare June 12, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MetaxGPU][testing] Add missing double2 ->e8m0x2 conversion in fp8.h#101

[MetaxGPU][testing] Add missing double2 ->e8m0x2 conversion in fp8.h#101
xxxzh99 wants to merge 1 commit into
tile-ai:devfrom
xxxzh99:fp8e8m0

xxxzh99 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xxxzh99 commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xxxzh99 commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading