Skip to content

Remove the stale MACA fusedmoe xfail and synchronize cython multi-stream JIT coverage#47

Open
VitalyAnkh wants to merge 2 commits into
tile-ai:devfrom
VitalyAnkh:vitaly/fusedmoe-stale-xfail
Open

Remove the stale MACA fusedmoe xfail and synchronize cython multi-stream JIT coverage#47
VitalyAnkh wants to merge 2 commits into
tile-ai:devfrom
VitalyAnkh:vitaly/fusedmoe-stale-xfail

Conversation

@VitalyAnkh

@VitalyAnkh VitalyAnkh commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator

Problem

The MACA fusedmoe test is still marked xfail, even though it now passes on the current dev baseline. That stale marker turns a normal success into an XPASS and makes the MACA test suite harder to read.

While validating this cleanup on the upstream self-hosted MACA runner, the CI also exposed an unrelated instability in testing/python/jit/test_tilelang_jit_gemm_cython.py: the multi-stream coverage launched work on auxiliary streams but did not wait for those streams to finish before the following dynamic-shape test reused GPU state.

What this PR changes

  • removes the obsolete xfail marker from examples/maca/fusedmoe/test_example_fusedmoe.py
  • synchronizes the auxiliary streams at the end of run_cython_kernel_multi_stream() in testing/python/jit/test_tilelang_jit_gemm_cython.py

Solution

The fusedmoe change lets the test report its true outcome again.

The Cython test change keeps the existing multi-stream coverage, but makes the helper wait for the side streams it launches. This is the minimal fix: the kernel launches remain asynchronous, yet the test no longer leaves in-flight work behind for the next case.

Alternatives considered

One option was to keep the stale xfail until a broader cleanup pass. That would have preserved misleading test output for no technical benefit.

Another option was to retry CI and treat the Cython failure as an unrelated flake. That would have left a real ordering bug in the test helper, and the same runner could fail again on the next PR.

A cleaner review boundary would be to submit the JIT test fix separately. I did not do that here because the failure blocked validation of this PR on the upstream MACA runner and the fix is small, local, and directly tied to getting this branch green.

Verification

  • python -m pytest -q -rX examples/maca/fusedmoe/test_example_fusedmoe.py on origin/dev -> 1 xpassed
  • python -m pytest -q examples/maca/fusedmoe/test_example_fusedmoe.py on this branch -> 1 passed
  • pre-commit run --files examples/maca/fusedmoe/test_example_fusedmoe.py
  • pre-commit run --files testing/python/jit/test_tilelang_jit_gemm_cython.py
  • exercised the branch version of testing/python/jit/test_tilelang_jit_gemm_cython.py against the built dev environment, including the CI order test_cython_kernel_multi_stream() -> test_cython_dynamic_shape()
  • repeated test_cython_kernel_multi_stream() -> test_cython_dynamic_shape() 20 times without reproducing the MACA mismatch

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@VitalyAnkh

Copy link
Copy Markdown
Collaborator Author

The previous MACA test failure was unrelated to the fusedmoe xfail removal itself. The failing case was , and the root cause was that the preceding multi-stream coverage launched work on auxiliary streams without waiting for completion before the next test reused GPU state. This update keeps the coverage intact and adds an explicit stream synchronization at the end of the helper so the later dynamic-shape case no longer inherits outstanding asynchronous work.

@VitalyAnkh

Copy link
Copy Markdown
Collaborator Author

The previous MACA test failure was unrelated to the fusedmoe xfail removal itself. The failing case was testing/python/jit/test_tilelang_jit_gemm_cython.py::test_cython_dynamic_shape, and the root cause was that the preceding multi-stream coverage launched work on auxiliary streams without waiting for completion before the next test reused GPU state.

This update keeps the coverage intact and adds an explicit stream synchronization at the end of the helper so the later dynamic-shape case no longer inherits outstanding asynchronous work.

@VitalyAnkh VitalyAnkh changed the title Remove the stale xfail from the MACA fusedmoe test Remove the stale MACA fusedmoe xfail and synchronize cython multi-stream JIT coverage Apr 22, 2026
stream = torch.cuda.Stream()
streams = [torch.cuda.Stream() for _ in range(4)]
for stream in streams:
with torch.cuda.stream(stream):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an implicit syncthread in with context, is the explicit sync required?

# side streams to finish before the test releases their tensors or the next
# test allocates new buffers on the default stream.
for stream in streams:
stream.synchronize()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not wittern follow matmul_kernel directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants