Release v0.14.4 Patch release · microsoft/DeepSpeed

What's Changed

Update version.txt after 0.14.3 release by @mrwyattii in #5651
[CPU] SHM based allreduce improvement for small message size by @delock in #5571
_exec_forward_pass: place zeros(1) on the same device as the param by @nelyahu in #5576
[XPU] adapt lazy_call func to different versions by @YizhouZ in #5670
fix IDEX dependence in xpu accelerator by @Liangliang-Ma in #5666
Remove compile wrapper to simplify access to model attributes by @tohtana in #5581
Fix hpZ with zero element by @samadejacobs in #5652
Fixing the reshape bug in sequence parallel alltoall, which corrupted all QKV data by @YJHMITWEB in #5664
enable yuan autotp & add conv tp by @Yejing-Lai in #5428
Fix latest pytorch '_get_socket_with_port' import error by @Yejing-Lai in #5654
Fix numpy upgrade to 2.0.0 BUFSIZE import error by @Yejing-Lai in #5680
Update BUFSIZE to come from autotuner's constants.py, not numpy by @loadams in #5686
[XPU] support op builder from intel_extension_for_pytorch kernel path by @YizhouZ in #5425

Full Changelog: v0.14.3...v0.14.4