Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.12.6: Patch release
What's Changed
- Update version.txt after 0.12.5 release by @mrwyattii in #4826
- Cache metadata for TP activations and grads by @BacharL in #4360
- Inference changes for incorporating meta loading checkpoint by @oelayan7 in #4692
- Update CODEOWNERS by @mrwyattii in #4838
- support baichuan model: by @baodii in #4721
- inference engine: check if accelerator supports FP16 by @nelyahu in #4832
- Update zeropp.md by @goodship1 in #4835
- [NPU] load EXPORT_ENV based on different accelerators to support multi-node training on other devices by @minchao-sun in #4830
- Add cuda_accelerator.py to triggers for A6000 test by @mrwyattii in #4848
- Capture short kernel sequences to graph by @inkcherry in #4318
- Checkpointing: Avoid assigning tensor storage with different device by @deepcharm in #4836
- engine.py: remove unused _curr_save_path by @nelyahu in #4844
- Mixtral FastGen Support by @cmikeh2 in #4828
New Contributors
- @minchao-sun made their first contribution in #4830
Full Changelog: v0.12.5...v0.12.6
v0.12.5: Patch release
What's Changed
- Fix DS Stable Diffusion for latest diffusers version by @lekurile in #4770
- Resolve any '..' in the file paths using os.path.abspath() by @rraminen in #4709
- Update dockerfile with updated versions by @loadams in #4780
- Run workflows when they are edited by @loadams in #4779
- BF16_Optimizer: add support for bf16 grad acc by @nelyahu in #4713
- fix autoTP issue for mpt (trust_remote_code=True) by @sywangyi in #4787
- Fix Hybrid Engine metrics printing by @lekurile in #4789
- [BUG] partition_balanced return wrong result. by @zjjMaiMai in #4312
- improve the way to determine whether a variable is None by @RUAN-ZX in #4782
- [NPU] Add HcclBackend for 1-bit adam, 1-bit lamb, 0/1 adam by @RUAN-ZX in #4733
- Fix for stage3 when setting different communication data type by @BacharL in #4540
- Add support of Falcon models (7b, 40b, 180b) to DeepSpeed-FastGen by @arashb in #4790
- Switch paths-ignore to single quotes, update paths-ignore on nv-pre-compile-ops by @loadams in #4805
- fix for tests using torch<2.1 by @mrwyattii in #4818
- Universal Checkpoint for Sequence Parallelism by @samadejacobs in #4752
- Accelerate CI fix by @mrwyattii in #4819
- fix [BUG] 'DeepSpeedGPTInference' object has no attribute 'dtype' for… by @jxysoft in #4814
- Update broken link in docs by @mrwyattii in #4822
- Update imports from Transformers by @loadams in #4817
- Minor updates to CI workflows by @mrwyattii in #4823
- fix falcon model load from_config meta_data error by @baodii in #4783
- mv DeepSpeedEngine param_names dict init post _configure_distributed_model by @nelyahu in #4803
- Refactor launcher user arg parsing by @mrwyattii in #4824
- Fix 4649 by @Alienfeel in #4650
New Contributors
- @zjjMaiMai made their first contribution in #4312
- @jxysoft made their first contribution in #4814
- @baodii made their first contribution in #4783
- @Alienfeel made their first contribution in #4650
Full Changelog: v0.12.4...v0.12.5
v0.12.4: Patch release
What's Changed
- Update version.txt after 0.12.3 release by @mrwyattii in #4673
- [MII] catch error wrt HF version and Mistral by @jeffra in #4634
- [NPU] Add NPU support for unit test by @RUAN-ZX in #4569
- [op-builder] use unique exceptions for cuda issues by @jeffra in #4653
- Add stable diffusion unit test by @mrwyattii in #2496
- [CANN] Support cpu offload optimizer for Ascend NPU by @hipudding in #4568
- Inference Checkpoints in V2 by @cmikeh2 in #4664
- KV Cache Improved Flexibility by @cmikeh2 in #4668
- Fix for when prompt contains an odd num of apostrophes by @oelayan7 in #4660
- universal-ckp: support megatron-deepspeed llama model by @mosheisland in #4666
- Add new MII unit tests by @mrwyattii in #4693
- [Bug fix] WarmupCosineLR issues by @sbwww in #4688
- infV2 fix for OPT size variants by @mrwyattii in #4694
- Add get and set APIs for the ZeRO-3 partitioned parameters by @yiliu30 in #4681
- Remove unneeded dict reinit (fix for #4565) by @eisene in #4702
- Update flops profiler to recurse by @loadams in #4374
- Communication Optimization for Large-Scale Training by @RezaYazdaniAminabadi in #4695
- [docs] Intel inference blog by @jeffra in #4734
- use all_gather_into_tensor instead of all_gather by @taozhiwei in #4705
- Install
deepspeed-kernels
only on Linux by @aphedges in #4739 - Add nv-sd badge to README by @loadams in #4747
- Re-organize
.gitignore
file to be parsed properly by @aphedges in #4740 - fix mics run with offload++ by @GuanhuaWang in #4749
- Fix logger formatting for partitioning flags by @OAfzal in #4728
- fix: to solve #4726 by @RUAN-ZX in #4727
- Add safetensors support by @jihnenglin in #4659
New Contributors
- @RUAN-ZX made their first contribution in #4569
- @oelayan7 made their first contribution in #4660
- @sbwww made their first contribution in #4688
- @yiliu30 made their first contribution in #4681
- @eisene made their first contribution in #4702
- @taozhiwei made their first contribution in #4705
- @OAfzal made their first contribution in #4728
- @jihnenglin made their first contribution in #4659
Full Changelog: v0.12.3...v0.12.4
v0.12.3: Patch release
New Bug Fixes
- Stable Diffusion now supported with latest Torch, diffusers, and Triton versions.
What's Changed
- Update version.txt after 0.12.2 release by @mrwyattii in #4617
- Fix figure in FlexGen blog by @tohtana in #4624
- Fix figure of llama2 13B in DS-FlexGen blog by @tohtana in #4625
- Fix config format by @xu-song in #4594
- Guanhua/partial offload rebase v2 (#590) by @GuanhuaWang in #4636
- offload++ blog (#623) by @GuanhuaWang in #4637
- Update README in offloadpp blog by @GuanhuaWang in #4641
- [docs] update news items by @jeffra in #4640
- DeepSpeed-FastGen Chinese Blog by @HeyangQin in #4642
- Fix issues with torch cpu builds by @loadams in #4639
- Isolate src code and testing for DeepSpeed-FastGen by @cmikeh2 in #4610
- Add Japanese blog for DeepSpeed-FastGen by @tohtana in #4651
- Fix for MII unit tests by @mrwyattii in #4652
- Enhance the robustness of
module_state_dict
by @LZHgrla in #4587 - Enable ZeRO3 allgather for multiple dtypes by @tohtana in #4647
- add option to disable pipeline partitioning by @nelyahu in #4322
- Added HIP_PLATFORM_AMD=1 for non JIT build by @rraminen in #4585
- Fix rope_theta arg for diffusers_attention by @lekurile in #4656
- tl.dot(a,b, trans_b=True) is not supported by triton2.0+ , updating this api by @bmedishe in #4541
- Update ds-chat workflow to work w/ deepspeed-chat install by @lekurile in #4598
- Diffusers attention script update triton2.1 by @bmedishe in #4573
- Fix the openfold training. by @cctry in #4657
- Universal ckp fixes by @mosheisland in #4588
- Update .gitignore [Adding comments , Improved documentation] by @Nadav23AnT in #4631
- Update lr_schedules.py by @CoinCheung in #4563
- Fix UNET and VAE implementations for new diffusers version by @lekurile in #4663
- fix num_kv_heads sharding in autoTP for the new in-repo Falcon-40B by @dc3671 in #4654
New Contributors
- @xu-song made their first contribution in #4594
- @LZHgrla made their first contribution in #4587
- @mosheisland made their first contribution in #4588
- @Nadav23AnT made their first contribution in #4631
- @CoinCheung made their first contribution in #4563
Full Changelog: v0.12.2...v0.12.3
v0.12.2
What's Changed
- Quick bug fix direct to
master
to ensure mismatched cuda environments are shown to the user 4f7dd72 - Update version.txt after 0.12.1 release by @mrwyattii in #4615
Full Changelog: v0.12.1...v0.12.2
v0.12.1: Patch release
What's Changed
- Update version.txt after 0.12.0 release by @mrwyattii in #4611
- Add number for latency comparison by @tohtana in #4612
- Update minor CUDA version compatibility. by @cmikeh2 in #4613
Full Changelog: v0.12.0...v0.12.1
DeepSpeed v0.12.0
New features
What's Changed
- Update version.txt after 0.11.2 release by @mrwyattii in #4609
- Pin transformers in nv-inference by @loadams in #4606
- DeepSpeed-FastGen by @cmikeh2 in #4604
- DeepSpeed-FastGen blog by @jeffra in #4607
Full Changelog: v0.11.2...v0.12.0
v0.11.2: Patch release
What's Changed
- Update version.txt after 0.11.1 release by @mrwyattii in #4484
- Update DS_BUILD_* references. by @loadams in #4485
- Introduce pydantic_v1 compatibility module for pydantic>=2.0.0 support by @ringohoffman in #4407
- Enable control over timeout with environment variable by @BramVanroy in #4405
- Update ROCm verison by @loadams in #4486
- adding 8bit dequantization kernel for asym fine-grained block quantization in zero-inference by @stephen-youn in #4450
- Fix scale factor on flops profiler by @loadams in #4500
- add DeepSpeed4Science white paper by @conglongli in #4502
- [CCLBackend] update API by @Liangliang-Ma in #4378
- Ulysses: add col-ai evaluation by @samadejacobs in #4517
- Ulysses: Update README.md by @samadejacobs in #4518
- add available memory check to accelerators by @jeffra in #4508
- clear redundant parameters in zero3 bwd hook by @inkcherry in #4520
- Add NPU FusedAdam support by @CurryRice233 in #4343
- fix error type issue in deepspeed/comm/ccl.py by @Liangliang-Ma in #4521
- Fixed deepspeed.comm.monitored_barrier call by @Quentin-Anthony in #4496
- [Bug fix] Add rope_theta for llama config by @cupertank in #4480
- [ROCm] Add rocblas header by @rraminen in #4538
- [docs] ZeRO infinity slides and blog by @jeffra in #4542
- Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD by @loadams in #4539
- Turn off I_MPI_PIN for impi launcher by @delock in #4531
- [docs] paper updates by @jeffra in #4543
- ROCm 6.0 prep changes by @loadams in #4537
- Fix RTD builds by @mrwyattii in #4558
- pipe engine _aggregate_total_loss: more efficient loss concatenation by @nelyahu in #4327
- Add missing rocblas include by @loadams in #4557
- Enable universal checkpoint for zero stage 1 by @tjruwase in #4516
- [AutoTP] Make AutoTP work when num_heads not divisible by number of workers by @delock in #4011
- Fix the sequence-parallelism for the dense model architecture by @RezaYazdaniAminabadi in #4530
- engine.py - save_checkpoint: only rank-0 should create the save dir by @nelyahu in #4536
- Remove PP Grad Tail Check by @Quentin-Anthony in #2538
- Added HIP_PLATFORM_AMD=1 by @rraminen in #4570
- fix multiple definition while building evoformer by @fecet in #4556
- Don't check overflow for bf16 data type by @hablb in #4512
- Public update by @yaozhewei in #4583
- [docs] paper updates by @jeffra in #4584
- Disable CPU inference on PRs by @loadams in #4590
New Contributors
- @ringohoffman made their first contribution in #4407
- @BramVanroy made their first contribution in #4405
- @cupertank made their first contribution in #4480
Full Changelog: v0.11.1...v0.11.2
v0.11.1: Patch release
What's Changed
- Fix bug in bfloat16 optimizer related to checkpointing by @okoge-kaz in #4434
- Move tensors to device if mp is not enabled by @deepcharm in #4461
- Fix torch import causing release build failure by @mrwyattii in #4468
- add lm_head and embed_out tensor parallel by @Yejing-Lai in #3962
- Fix release workflow by @mrwyattii in #4483
New Contributors
- @okoge-kaz made their first contribution in #4434
- @deepcharm made their first contribution in #4461
Full Changelog: v0.11.0...v0.11.1
DeepSpeed v0.11.0
New features
- DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Image Inputs [English] [中文] [日本語]
- Announcing the DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated AI system technologies [DeepSpeed4Science website] [Tutorials] [Blog] [中文] [日本語]
What's Changed
- added a model check for use_triton in deepspeed by @stephen-youn in #4266
- Update release and bump patch versioning flow by @loadams in #4286
- README update by @tjruwase in #4303
- Update README.md by @NinoRisteski in #4316
- Handle empty parameter groups by @tjruwase in #4277
- Clean up modeling code by @loadams in #4320
- Fix Zero3 contiguous grads, reduce scatter false accuracy issue by @nelyahu in #4321
- Add release version checking by @loadams in #4328
- clear redundant timers by @starkhu in #4308
- DS-Chat BLOOM: Fix Attention mask by @lekurile in #4338
- Fix a bug in the implementation of dequantization for inference by @sakogan in #3433
- Suppress noise by @tjruwase in #4310
- Fix skipped inference tests by @mrwyattii in #4336
- Fix autotune to support Triton 2.1 by @stephen-youn in #4340
- Pass base_dir to model files can be loaded for auto-tp/meta-tensor. by @awan-10 in #4348
- Support InternLM by @wangruohui in #4137
- DeepSpeed4Science by @conglongli in #4357
- fix deepspeed4science links by @conglongli in #4358
- Add the policy to run llama model from the official repo by @RezaYazdaniAminabadi in #4313
- Check inference input_id tokens length by @mrwyattii in #4349
- add deepspeed4science blog link by @conglongli in #4364
- Update conda env to have max pydantic version by @loadams in #4362
- Enable workflow dispatch on Torch 1.10 CI tests by @loadams in #4361
- deepspeed4science chinese blog by @conglongli in #4366
- deepspeed4science japanese blog by @conglongli in #4369
- Openfold fix by @cctry in #4368
- [BUG] add the missing method to MPS accelerator by @cli99 in #4363
- Fix multinode runner to properly append to PDSH_SSH_ARGS_APPEND by @loadams in #4373
- Fix min torch version by @tjruwase in #4375
- Fix llama meta tensor loading in AutoTP and kernel injected inference by @zeyugao in #3608
- adds triton flash attention2 kernel by @stephen-youn in #4337
- Allow multiple inference engines in single script by @mrwyattii in #4384
- Save/restore step in param groups with zero 1 or 2 by @tohtana in #4396
- Fix incorrect assignment of self.quantized_nontrainable_weights by @VeryLazyBoy in #4399
- update deepspeed4science blog by @conglongli in #4408
- Add torch no grad condition by @ajindal1 in #4391
- Update nv-transformers workflow to use cu11.6 by @loadams in #4412
- Add condition when dimension is greater than 2 by @ajindal1 in #4390
- [CPU] Add CPU AutoTP UT. by @Yejing-Lai in #4263
- fix cpu loading model partition OOM by @Yejing-Lai in #4353
- Update cpu_inference checkout action by @loadams in #4424
- Zero infinity xpu support by @Liangliang-Ma in #4130
- [CCLBackend] Using parallel memcpy for inference_all_reduce by @delock in #4404
- Change default
set_to_none=true
inzero_grad
methods by @Jackmin801 in #4438 - Small docstring fix by @Jackmin801 in #4431
- fix: check-license by @Jackmin801 in #4432
- Fixup check release version script by @loadams in #4413
- Enable ad-hoc running of cpu_inference by @loadams in #4444
- Fix wrong documentation of
ignore_unused_parameters
by @UniverseFly in #4418 - DeepSpeed-VisualChat Blog by @xiaoxiawu-microsoft in #4446
- Fix a bug in DeepSpeedMLP by @sakogan in #4389
- documenting load_from_fp32_weights config parameter by @clumsy in #4449
- Add Japanese translation of DS-VisualChat blog by @tohtana in #4454
- fix blog format by @conglongli in #4456
- Update README-Japanese.md by @conglongli in #4457
- DeepSpeed-VisualChat Chinese blog by @conglongli in #4458
- CI fix for torch 2.1 release by @mrwyattii in #4452
- fix lm head overriden issue, move it from checkpoint in-loop loading … by @sywangyi in #4206
- feat: add Lion by @enneamer in #4331
- pipe engine eval_batch: add option to disable loss broadcast by @nelyahu in #4326
- Add release flow by @loadams in #4467
New Contributors
- @nelyahu made their first contribution in #4321
- @starkhu made their first contribution in #4308
- @sakogan made their first contribution in #3433
- @cctry made their first contribution in #4368
- @zeyugao made their first contribution in #3608
- @VeryLazyBoy made their first contribution in #4399
- @ajindal1 made their first contribution in #4391
- @Liangliang-Ma made their first contribution in #4130
- @Jackmin801 made their first contribution in #4438
- @UniverseFly made their first contribution in #4418
- @enneamer made their first contribution in #4331
Full Changelog: v0.10.3...v0.11.0