Skip to content

Enable MI300X ROCm support#484

Open
ehartford wants to merge 3 commits into
antirez:mainfrom
QuixiAI:main
Open

Enable MI300X ROCm support#484
ehartford wants to merge 3 commits into
antirez:mainfrom
QuixiAI:main

Conversation

@ehartford

Copy link
Copy Markdown

This PR enables DeepSeek V4 Pro to run on AMD Instinct MI300X by adding CDNA-oriented ROCm kernels and sharding model layers across local ROCm GPUs.

Summary

  • Add CDNA3/CDNA4 direct MFMA wrapper kernels for f16 MFMA:
    - gfx942 uses mfma_f32_16x16x16_f16
    - gfx950 uses mfma_f32_16x16x32_f16

  • Add a CDNA Q8 batch matmul/MFMA prefill path.

  • Add ROCm MoE kernel fixes for CDNA correctness, including disabling the broken IQ2/Q2 float-down WMMA overlay.

  • Add ROCm attention/activation fixes to avoid fp16 overflow and repeated BOS failures.

  • Add MI300X/CDNA build targets, with CDNA4 gfx950 compile plumbing.

  • Add local --gpus launcher using the existing distributed runtime to shard layers across local GPUs.

  • Support repeated -m model shards independent of argument order.

  • Allocate graph/KV/cache state only for the layer slice owned by each worker.

  • Add model-cache preflight checks for early actionable OOM errors.

  • Use BF16 for 16-bit distributed activation transport.

  • Add MI300X/ROCm smoke scripts and a synthetic Q8 MFMA correctness test.

Validation

Validated on MI300X / CDNA3:

make mi300x
git diff --check

Also validated a local sharded Pro Q4 run across MI300X GPUs, including reversed -m shard order.

Notes

CDNA4 / gfx950 kernel selection and build plumbing are included, but runtime validation has not been performed because I do not have CDNA4 hardware.

Comment thread README.md Outdated
@beverm2391

Copy link
Copy Markdown

+1 waiting on this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants