Release Note
Hi all, torch_musa v2.9.0 is now available. Along with torch2.9.0, we enchanced user experiences and support bunch of new features. This release supports Context Parallel in FSDP2, sparse-related operators and "reduce-overhead" mode for torch.compile. Since torch_musa 2.9.0, GEMM kernels are computed in FP32 by default, user can set environment variable TORCH_ALLOW_TF32_MUBLAS_OVERRIDE=1 or python global setting 'torch.backends.musa.matmul.allow_tf32 = True' to enable TF32 computation.
We also made kineto as a third_party repository of torch_musa, and this is not the official one but a musified one.
Build torch_musa v2.9.0 on MUSA platform with MUSA SDK>= 4.3.2 please.
EnhanceMent
Operators
- Support torch.arange with Double dtype
- Fix BatchNorm outputs NaN
- Optimize performance of embedding_bag
- Support complex dtypes for index_select, index_put
- Support some Sparse Tensor operators
- Support some special operators.
- Fix empty tensor creation error with pin_memory=True
- Add W8A8 matmul kernel
New Features
- Support torch.compile wth mode="reduce-overhead"
- Support Context Parallel (Ulysses) in FSDP2
- Support DLPack for torch.tensor to enable zero-copy when interacted with other library
Known && blocked issues
- torch.compile generated kernel's performance worse than torch_musa v2.7.0
Please feel free to contact us with any issues or questions.