Skip to content

Adapt DeepSeek V32 examples to MACA-safe barriers, head tiling, and fallback paths#36

Open
VitalyAnkh wants to merge 3 commits into
tile-ai:devfrom
VitalyAnkh:vitaly/deepseek-v32-maca-compat
Open

Adapt DeepSeek V32 examples to MACA-safe barriers, head tiling, and fallback paths#36
VitalyAnkh wants to merge 3 commits into
tile-ai:devfrom
VitalyAnkh:vitaly/deepseek-v32-maca-compat

Conversation

@VitalyAnkh

@VitalyAnkh VitalyAnkh commented Apr 9, 2026

Copy link
Copy Markdown
Collaborator

Fixes #30.

Review note

This branch now carries only the shared MACA backend prerequisite from #33 plus the DeepSeek V32 changes below. It no longer includes the unrelated example updates from #34 or #35.

Problem

The DeepSeek V32 examples assumed CUDA-specific execution behaviour in several places, including barrier handling, head partitioning, TMA-oriented forward paths, and vector-atomic usage in backward kernels.

What this PR changes

  • replaces the CUDA-style histogram reset assumptions in topk_selector with a MACA-safe reset strategy
  • adapts sparse MLA forward and pipelined forward paths to use MACA-safe head tiling and fallback execution paths
  • changes the backward path to use a supported atomic formulation on MACA
  • keeps the follow-up correctness fixes for replicated-head stride and the histogram sentinel slot in the same kernel family
  • adds regression coverage for the MACA-specific edge cases identified during review

Solution

The PR keeps the DeepSeek V32 examples intact at the algorithmic level, but rewrites the execution assumptions that were specific to CUDA. The MACA path now partitions heads according to MACA-safe tile sizes, clears the histogram state fully, and avoids unsupported synchronization or atomic behaviour.

Alternatives considered

One option was to bypass the DeepSeek V32 cases entirely on MACA. That would have been expedient, but it would also have left a substantial portion of the example suite unexercised. Another was to preserve the existing kernels and add narrow guards around the observed failures. That approach would have been fragile because the failures shared a broader root cause: CUDA-specific execution assumptions embedded in the example kernels.

Verification

  • python -m pytest -q examples/maca/deepseek_v32/test_tilelang_example_deepseek_v32.py

@VitalyAnkh VitalyAnkh changed the title [MetaxGPU][Examples] Adapt DeepSeek V32 kernels for MACA Adapt DeepSeek V32 examples to MACA-safe barriers, head tiling, and fallback paths Apr 9, 2026
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@VitalyAnkh VitalyAnkh force-pushed the vitaly/deepseek-v32-maca-compat branch from da28d0b to e144b9b Compare April 9, 2026 20:19
@VitalyAnkh VitalyAnkh force-pushed the vitaly/deepseek-v32-maca-compat branch from e144b9b to 5816c1c Compare April 12, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepSeek V32 MACA examples rely on CUDA-specific barriers, TMA paths, and vector atomics

1 participant