Releases · microsoft/DeepSpeed

12 Jun 18:14

mrwyattii

v0.14.3

54f98fd

v0.14.3 Patch release

What's Changed

Update version.txt after 0.14.2 release by @mrwyattii in #5458
Add getter and setter methods for compile_backend across accelerators. by @vshekhawat-hlab in #5299
Fix torch.compile error for PyTorch v2.3 by @tohtana in #5463
Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)" by @lekurile in #5461
Update ds-chat CI workflow paths to include zero stage 1-3 files by @lekurile in #5462
Update with ops not supported on Windows by @loadams in #5468
fix: swapping order of parameters in create_dir_symlink method. by @alvieirajr in #5465
Un-pin torch version in nv-torch-latest back to latest and skip test_compile_zero tests on v100 by @loadams in #5459
re-introduce: stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5493
Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() by @harygo2 in #5464
Fix compile wrapper by @BacharL in #5455
enable phi3_mini autotp by @Yejing-Lai in #5501
Fused adam for HPU by @BacharL in #5500
[manifest] update mainfest to add hpp file in csrc. by @ys950902 in #5522
enable phi2 autotp by @Yejing-Lai in #5436
Switch pynvml to nvidia-ml-py by @loadams in #5529
Switch from double quotes to match single quotes by @loadams in #5530
[manifest] update mainfest to add hpp file in deepspeed. by @ys950902 in #5533
New integration - CometMonitor by @alexkuzmik in #5466
Improve _configure_optimizer() final optimizer log by @nelyahu in #5528
Enhance testing: Skip fused_optimizer tests if not supported. by @vshekhawat-hlab in #5159
Skip the UT cases that use unimplemented op builders. by @foin6 in #5372
rocblas -> hipblas changes for ROCm by @rraminen in #5401
Rocm warp size fix by @rraminen in #5402
CPUAdam fp16 and bf16 support by @BacharL in #5409
Optimize zero3 fetch params using all_reduce by @deepcharm in #5420
Fix the TypeError for XPU Accelerator by @shiyang-weng in #5531
Fix RuntimeError for moe on XPU: tensors found at least two devices by @shiyang-weng in #5519
Remove synchronize calls from allgather params by @BacharL in #5516
Avoid overwrite of compiled module wrapper attributes by @deepcharm in #5549
Small typos in functions set_none_gradients_to_zero by @TravelLeraLone in #5557
Adapt doc for #4405 by @oraluben in #5552
Update to HF_HOME from TRANSFORMERS_CACHE by @loadams in #4816
[INF] DSAttention allow input_mask to have false as value by @oelayan7 in #5546
Add throughput timer configuration by @deepcharm in #5363
Add Ulysses DistributedAttention compatibility by @Kwen-Chen in #5525
Add hybrid_engine.py as path to trigger the DS-Chat GH workflow by @lekurile in #5562
Update HPU docker version by @loadams in #5566
Rename files in fp_quantize op from quantize.* to fp_quantize.* by @loadams in #5577
[MiCS] Remove the handle print on DeepSpeed side by @ys950902 in #5574
Update to fix sidebar over text by @loadams in #5567
DeepSpeedCheckpoint: support custom final ln idx by @nelyahu in #5506
Update minor CUDA version compatibility by @adk9 in #5591
Add slide deck for meetup in Japan by @tohtana in #5598
Fixed the Windows build. by @costin-eseanu in #5596
estimate_zero2_model_states_mem_needs: fixing memory estiamtion by @nelyahu in #5099
Fix cuda hardcode for inference woq by @Liangliang-Ma in #5565
fix sequence parallel(Ulysses) grad scale for zero0 by @inkcherry in #5555
Add Compressedbackend for Onebit optimizers by @Liangliang-Ma in #5473
Updated hpu-gaudi2 tests content. by @vshekhawat-hlab in #5622
Pin transformers version for MII tests by @loadams in #5629
WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo by @NirSonnenschein in #5590
stage_1_and_2: optimize clip calculation to use clamp by @nelyahu in #5632
Fix overlap communication of ZeRO stage 1 and 2 by @penn513 in #5606
fixes in _partition_param_sec function by @mmhab in #5613
assumption of torch.initial_seed function accepting seed arg in DeepSpeedAccelerator abstract class is incorrect by @polisettyvarma in #5569
pipe/_exec_backward_pass: fix immediate grad update by @nelyahu in #5605
Monitor was always enabled causing performance degradation by @deepcharm in #5633

New Contributors

@alvieirajr made their first contribution in #5465
@harygo2 made their first contribution in #5464
@alexkuzmik made their first contribution in #5466
@foin6 made their first contribution in #5372
@shiyang-weng made their first contribution in #5531
@TravelLeraLone made their first contribution in #5557
@oraluben made their first contribution in #5552
@Kwen-Chen made their first contribution in #5525
@adk9 made their first contribution in #5591
@costin-eseanu made their first contribution in #5596
@NirSonnenschein made their first contribution in #5590
@penn513 made their first contribution in #5606

Full Changelog: v0.14.2...v0.14.3

Contributors

adk9, oraluben, and 26 other contributors

Assets 2

23 Apr 23:25

loadams

v0.14.2

5f631ab

v0.14.2 Patch release

What's Changed

Update version.txt after 0.14.1 release by @mrwyattii in #5413
Remove dtype(fp16) condition check for residual_add unit test by @raza-sikander in #5329
[XPU] Use non_daemonic_proc by default on XPU device by @ys950902 in #5412
Fix a convergence issues in TP topology caused by incorrect grad_norm. by @inkcherry in #5411
Update 'create-pr' action in release workflow to latest by @loadams in #5415
Update engine.py to avoid torch warning by @etiennebonnafoux in #5408
Update _sidebar.scss by @fasterinnerlooper in #5293
Add more tests into XPU CI by @Liangliang-Ma in #5427
[CPU] Support SHM based inference_all_reduce in TorchBackend by @delock in #5391
Add required paths to trigger AMD tests on PRs by @loadams in #5406
Bug fix in split_index method by @bm-synth in #5292
Parallel map step for DistributedDataAnalyzer map-reduce by @bm-synth in #5291
Selective dequantization by @RezaYazdaniAminabadi in #5375
Fix sorting of shard optimizer states files for universal checkpoint by @tohtana in #5395
add device config env for the accelerator by @shiyuan680 in #5396
64bit indexing fused adam by @garrett4wade in #5187
Improve parallel process of universal checkpoint conversion by @tohtana in #5343
set the default to use set_to_none for clearing gradients in BF16 optimizer. by @inkcherry in #5434
OptimizedLinear implementation by @jeffra in #5355
Update README.md by @Jhonso7393 in #5453
Update PyTest torch version to match PyTorch latest official (2.3.0) by @loadams in #5454

New Contributors

@etiennebonnafoux made their first contribution in #5408
@fasterinnerlooper made their first contribution in #5293
@shiyuan680 made their first contribution in #5396
@garrett4wade made their first contribution in #5187
@Jhonso7393 made their first contribution in #5453

Full Changelog: v0.14.1...v0.14.2

Contributors

jeffra, fasterinnerlooper, and 14 other contributors

Assets 2

15 Apr 19:51

loadams

v0.14.1

e3d873a

v0.14.1 Patch release

What's Changed

Update version.txt after 0.14.0 release by @mrwyattii in #5238
Fp6 blog chinese by @xiaoxiawu-microsoft in #5239
Add contributed HW support into README by @delock in #5240
Set tp world size to 1 in ckpt load, if MPU is not provided by @samadejacobs in #5243
Make op builder detection adapt to accelerator change by @delock in #5206
Replace HIP_PLATFORM_HCC with HIP_PLATFORM_AMD by @rraminen in #5264
Add CI for Habana Labs HPU/Gaudi2 by @loadams in #5244
Fix attention mask handling in the Hybrid Engine Bloom flow by @deepcharm in #5101
Skip 1Bit Compression and sparsegrad tests for HPU. by @vshekhawat-hlab in #5270
Enabled LMCorrectness inference tests on HPU. by @vshekhawat-hlab in #5271
Added HPU backend support for torch.compile tests. by @vshekhawat-hlab in #5269
Average only valid part of the ipg buffer. by @BacharL in #5268
Add HPU accelerator support in unit tests. by @vshekhawat-hlab in #5162
Fix loading a universal checkpoint by @tohtana in #5263
Add Habana Gaudi2 CI badge to the README by @loadams in #5286
Add intel gaudi to contributed HW in README by @BacharL in #5300
Fixed Accelerate Link by @wkaisertexas in #5314
Enable mixtral 8x7b autotp by @Yejing-Lai in #5257
support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix by @inkcherry in #5259
fix comms dtype by @mayank31398 in #5297
Modified regular expression by @igeni in #5306
Docs typos fix and grammar suggestions by @Gr0g0 in #5322
Added Gaudi2 CI tests. by @vshekhawat-hlab in #5275
Improve universal checkpoint by @tohtana in #5289
Increase coverage for HPU by @loadams in #5324
Add NFS path check for default deepspeed triton cache directory by @HeyangQin in #5323
Correct typo in checking on bf16 unit test support by @loadams in #5317
Make NFS warning print only once by @HeyangQin in #5345
resolve KeyError: 'PDSH_SSH_ARGS_APPEND' by @Lzhang-hub in #5318
BF16 optimizer: Clear lp grads after updating hp grads in hook by @YangQun1 in #5328
Fix sort of zero checkpoint files by @tohtana in #5342
Add distributed_port for deepspeed.initialize by @LZHgrla in #5260
[fix] fix typo s/simultanenously /simultaneously by @digger-yu in #5359
Update container version for Gaudi2 CI by @raza-sikander in #5360
compute global norm on device by @BacharL in #5125
logger update with torch master changes by @rogerxfeng8 in #5346
Ensure capacity does not exceed number of tokens by @jeffra in #5353
Update workflows that use cu116 to cu117 by @loadams in #5361
FP [6,8,12] quantizer op by @jeffra in #5336
CPU SHM based inference_all_reduce improve by @delock in #5320
Auto convert moe param groups by @jeffra in #5354
Support MoE for pipeline models by @mosheisland in #5338
Update pytest and transformers with fixes for pytest>= 8.0.0 by @loadams in #5164
Increase CI coverage for Gaudi2 accelerator. by @vshekhawat-hlab in #5358
Add CI for Intel XPU/Max1100 by @Liangliang-Ma in #5376
Update path name on xpu-max1100.yml, add badge in README by @loadams in #5386
Update checkout action on workflows on ubuntu 20.04 by @loadams in #5387
Cleanup required_torch_version code and references. by @loadams in #5370
Update README.md for intel XPU support by @Liangliang-Ma in #5389
Optimize the fp-dequantizer to get high memory-BW utilization by @RezaYazdaniAminabadi in #5373
Removal of cuda hardcoded string with get_device function by @raza-sikander in #5351
Add custom reshaping for universal checkpoint by @tohtana in #5390
fix pagable h2d memcpy by @GuanhuaWang in #5301
stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5256
Fix the FP6 kernels compilation problem on non-Ampere GPUs. by @JamesTheZ in #5333

New Contributors

@vshekhawat-hlab made their first contribution in #5270
@wkaisertexas made their first contribution in #5314
@igeni made their first contribution in #5306
@Gr0g0 made their first contribution in #5322
@Lzhang-hub made their first contribution in #5318
@YangQun1 made their first contribution in #5328
@raza-sikander made their first contribution in #5360
@rogerxfeng8 made their first contribution in #5346
@JamesTheZ made their first contribution in #5333

Full Changelog: v0.14.0...v0.14.1

Contributors

jeffra, igeni, and 28 other contributors

Assets 2

08 Mar 01:10

loadams

v0.14.0

ce78a63

DeepSpeed v0.14.0

New Features

DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models.

What's Changed

Update version.txt after 0.13.5 release by @mrwyattii in #5229
MOE gate fixes and enhancements by @mosheisland in #5156
FP6 quantization end-to-end. by @loadams in #5234
FP6 blog by @loadams in #5235

Full Changelog: v0.13.5...v0.14.0

Contributors

mrwyattii, mosheisland, and loadams

Assets 2

06 Mar 01:16

mrwyattii

v0.13.5

bc0d246

v0.13.5 Patch release

What's Changed

Update version.txt after 0.13.4 release by @mrwyattii in #5196
Fix assertion to run pipeline engine with a compiled module by @tohtana in #5197
Allow specifying MII branch on MII CI by @mrwyattii in #5208
[zero++] Synchronize at the end of secondary partitioning and simplify the logic by @ByronHsu in #5216
Add fp16 support of Qwen1.5 models (0.5B to 72B) to DeepSpeed-FastGen by @ZonePG in #5219
Rename nv-torch-latest-cpu workflow to cpu-torch-latest by @loadams in #5226
Fix moe cpu offload by @RezaYazdaniAminabadi in #5220
Use deepspeed.comm instead of torch.distributed by @jinyouzhi in #5225
fix fused_qkv model accuracy issue by @Yejing-Lai in #5217

Full Changelog: v0.13.4...v0.13.5

Contributors

jinyouzhi, mrwyattii, and 6 other contributors

Assets 2

26 Feb 23:52

lekurile

v0.13.4

5115df3

v0.13.4 Patch release

What's Changed

Update version.txt after v0.13.3 release by @loadams in #5185
Fixes for --extra-index-url by @loadams in #5183
allow debug/experimental compiler backends by @tohtana in #5191
Disable ninja by default by @mrwyattii in #5194
[CPUAdam] Update full_precision_optimizer_states in docstring by @rohan-varma in #5181
Add script to check for --extra-index-url by @loadams in #5184

New Contributors

@rohan-varma made their first contribution in #5181

Full Changelog: v0.13.3...v0.13.4

Contributors

rohan-varma, mrwyattii, and 2 other contributors

Assets 2

23 Feb 22:58

mrwyattii

v0.13.3

afdf028

v0.13.3 Patch release

What's Changed

Update version.txt after 0.13.2 release by @mrwyattii in #5119
Stop tracking backward chain of broadcast (ZeRO3) by @tohtana in #5113
[NPU]ZeRO-Infinity feature compatibility by @misstek in #5077
BF16 optimizer: Improve device utilization by immediate grad update by @deepcharm in #4975
removed if condition in if collate_fn is None by @bm-synth in #5107
disable compile tests for torch<2.1 by @mrwyattii in #5121
Update inference test model names by @mrwyattii in #5127
Fix issue with zero-sized file after merging file on curriculum map_reduce by @bm-synth in #5106
Update return codes in PyTest to properly error out if tests fail by @loadams in #5122
add missing methods to MPS_Accelerator by @mrwyattii in #5134
Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. by @bm-synth in #5108
Fix broadcast deadlock for incomplete batches in data sample for data analysis by @bm-synth in #5117
Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning by @bm-synth in #5118
remove mandatory index key from output of metric_function in DataAnalysis map operation by @bm-synth in #5112
tensorboard logging: avoid item() outside gas to improve performance by @nelyahu in #5135
Check overflow on device without host synchronization for each tensor by @BacharL in #5115
Update nv-inference torch version by @loadams in #5128
Method run_map_reduce to fix errors when running run_map followed by run_reduce by @bm-synth in #5131
Added missing isinstance check in PR 5112 by @bm-synth in #5142
Fix UserWarning: The torch.cuda.*DtypeTensor constructors are no long… by @ShukantPal in #5018
TestEmptyParameterGroup: replace fusedAdam with torch.optim.AdamW by @nelyahu in #5139
Update deprecated HuggingFace function by @mrwyattii in #5144
Pin to PyTest 8.0.0 by @loadams in #5163
get_grad_norm_direct: fix a case of empty norm group by @nelyahu in #5148
Distributed in-memory map-reduce for data analyzer by @bm-synth in #5129
DeepSpeedZeroOptimizer_Stage3: remove cuda specific optimizer by @nelyahu in #5138
MOE: Fix save checkpoint when TP > 1 by @mosheisland in #5157
Fix gradient clipping by @tohtana in #5150
Use ninja to speed up build by @jinzhen-lin in #5088
Update flops profiler to handle attn and matmul by @KimmiShi in #4724
Fix allreduce for BF16 and ZeRO0 by @tohtana in #5170
Write multiple items to output file at once, in distributed data analyzer. by @bm-synth in #5169
Fix typos in blogs/ by @jinyouzhi in #5172
Inference V2 Human Eval by @lekurile in #4804
Reduce ds_id name length by @jomayeri in #5176
Switch cpu-inference workflow from --extra-index-url to --index-url by @loadams in #5182

New Contributors

@bm-synth made their first contribution in #5107
@KimmiShi made their first contribution in #4724

Full Changelog: v0.13.2...v0.13.3

Contributors

KimmiShi, jinyouzhi, and 13 other contributors

Assets 2

12 Feb 17:19

mrwyattii

v0.13.2

1817980

v0.13.2 Patch release

What's Changed

Update version.txt after 0.13.1 release by @mrwyattii in #5002
Support exclude_frozen_parameters for save_16bit_model by @LZHgrla in #4999
Allow nightly tests dispatch by @mrwyattii in #5014
Enable hpz based on secondary tensor presence by @HeyangQin in #4906
Enable workflow dispatch on all workflows by @loadams in #5016
[minor] improve code quality and readablilty by @ByronHsu in #5011
Update falcon fused type order by @Yejing-Lai in #5007
Fix error report of DSElasticAgent._set_master_addr_port() by @RobinDong in #4985
DS #4993 #662 : autotune single node hostfile bugfix by @oushu1zhangxiangxuan1 in #4996
[minor] Improve logging for multiprocesses by @ByronHsu in #5004
deepspeed/launcher: add launcher_helper as each rank's start portal by @YizhouZ in #4699
Graph capture support on HPU accelerators by @deepcharm in #5013
launcher/launcher_helper.py: fix PMI name and add EnvironmentError by @YizhouZ in #5025
Remove MI100 badge from landing page by @mrwyattii in #5036
Remove coverage reports from workflows and fix for inference CI by @loadams in #5028
Remove Megatron-DeepSpeed CI workflow by @mrwyattii in #5038
Fix P40 CI failures by @mrwyattii in #5037
Fix for nightly torch CI by @mrwyattii in #5039
Fix nv-accelerate and nv-torch-latest-v100. by @loadams in #5035
update inference pages to point to FastGen by @mrwyattii in #5029
launcher_helper: enable fds passing by @YizhouZ in #5042
Fix nv-torch-latest-cpu CI by @mrwyattii in #5045
[NPU] Add NPU to support hybrid engine by @CurryRice233 in #4831
MoE type hints by @ringohoffman in #5043
[doc] update inference related docs from mp_size to tensor_parallel for TP by @yundai424 in #5048
Fix broken model names in inference CI by @mrwyattii in #5053
[NPU] Change log level to debug by @CurryRice233 in #5051
Delay reduce-scatter for ZeRO3 leaf modules by @tohtana in #5008
Optimize grad_norm calculations by reducing device/host dependency by @nelyahu in #4974
load linear layer weight with given dtype by @polisettyvarma in #4044
Update import for changes to latest diffusers by @mrwyattii in #5065
adding hccl to init_distributed function description by @nelyahu in #5034
[Zero++ qgZ] Fall back to reduce_scatter if tensor.numel() % (2 * global_world_size) != 0 by @ByronHsu in #5056
Make batch size documentation clearer by @segyges in #5072
[doc/1-line change] default stage3_param_persistence_threshold is wrong in the doc by @ByronHsu in #5073
Further refactor deepspeed.moe.utils + deepspeed.moe.layer type hints by @ringohoffman in #5060
Fix verification for ZeRO3 leaf module by @tohtana in #5074
Stop tracking backward chain of broadcast in initialization by @tohtana in #5075
Update torch version for nv-torch-latest-cpu by @loadams in #5086
Add backwards compatibility w/ older versions of diffusers (<0.25.0) by @lekurile in #5083
Enable torch.compile with ZeRO (Experimental) by @tohtana in #4878
Update nv-accelerate to latest torch by @loadams in #5040
HPU Accelerator: fix supported_dtypes API by @nelyahu in #5094
[NPU] replace 'cuda' with get_accelerator().device_name() by @minchao-sun in #5095
optimize clip_grad_norm_ function by @mmhab in #4915
[xs] fix ZEROPP convergence test by @yundai424 in #5061
Switch hasattr check from compile to compiler by @loadams in #5096
Split is_synchronized_device api to multiple apis by @BacharL in #5026
47% FastGen speedup for low workload - refactor allocator by @HeyangQin in #5090
Support exclude_frozen_parameters for zero_to_fp32.py script by @andstor in #4979
Fix alignment of optimizer states when loading by @tohtana in #5105
Skip Triton import for AMD by @lekurile in #5110
Add HIP conversion file outputs to .gitignore by @lekurile in #5111
Remove optimizer step on initialization by @tohtana in #5104

New Contributors

@ByronHsu made their first contribution in #5011
@RobinDong made their first contribution in #4985
@oushu1zhangxiangxuan1 made their first contribution in #4996
@yundai424 made their first contribution in #5048
@segyges made their first contribution in #5072
@andstor made their first contribution in #4979

Full Changelog: v0.13.1...v0.13.2

Contributors

RobinDong, polisettyvarma, and 20 other contributors

Assets 2

23 Jan 23:01

mrwyattii

v0.13.1

1d35db7

v0.13.1 Patch release

What's Changed

Update version.txt after 0.13.0 release by @mrwyattii in #4982
Update FastGen blog title by @arashb in #4983
Fix the MoE-params gradient-scaling by @RezaYazdaniAminabadi in #4957
fix some typo under blogs/ by @digger-yu in #4988
Fix placeholder value in FastGen Blog by @mrwyattii in #5000
fix for DS_ENV issue by @jeffra in #4992
Delete unused --deepspeed_mpi command line argument by @ShukantPal in #4981
Make installable without torch by @mrwyattii in #5001
Implement some APIs of HPU accelerator by @mmhab in #4935
Refactor the Qwen positional emebdding config code by @ZonePG in #4955

Full Changelog: v0.13.0...v0.13.1

Contributors

arashb, jeffra, and 6 other contributors

Assets 2

19 Jan 23:25

mrwyattii

v0.13.0

1c8b8f3

DeepSpeed v0.13.0

New Features

DeepSpeed-FastGen: Introducting Mixtral, Phi-2, and Falcon support with major performance and feature enhancements.

What's Changed

Update version.txt after 0.12.6 release by @mrwyattii in #4850
doc corrections by @goodship1 in #4861
Fix exception handling in get_all_ranks_from_group() function by @HeyangQin in #4862
deepspeed engine: fp16 support validation on init by @nelyahu in #4843
Remove hooks on gradient accumulation on engine/optimizer destroy by @chiragjn in #4858
optimize grad_norm calculation in stage3.py by @mmhab in #4436
Fix f-string messages by @li-plus in #4865
[NPU] Fix npu offload bug by @CurryRice233 in #4883
Partition parameters: Minor refactoring of use_secondary_tensor condition by @deepcharm in #4868
Pipeline: Add support to eval micro bs configuration by @nelyahu in #4859
zero_to_fp32.py: Handle a case where shape doesn't have numel attr by @nelyahu in #4842
Add support of Microsoft Phi-2 model to DeepSpeed-FastGen by @arashb in #4812
Support cpu tensors without direct device invocation by @abhilash1910 in #3842
add sharded loading for safetensors in AutoTP by @sywangyi in #4854
[XPU] XPU accelerator support for Intel GPU device by @delock in #4547
enable starcode((kv_head=1)) autotp by @Yejing-Lai in #4896
Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 by @li-plus in #4887
[NPU]Add ZeRO-Infinity feature for NPU by @misstek in #4809
fix num_kv_heads sharding in uneven autoTP for Falcon-40b by @Yejing-Lai in #4712
Nvme offload checkpoint by @eisene in #4707
Add WarmupCosineLR to Read the Docs by @dwyatte in #4916
Add Habana Labs HPU accelerator support by @deepcharm in #4912
Unit tests for MiCS by @zarzen in #4792
Fix SD workflow to work with latest diffusers version by @lekurile in #4918
[Fix] Fix cpu inference UT failure by @delock in #4430
Add paths to run SD tests by @loadams in #4919
Change PR/schedule triggers for CPU-inference by @loadams in #4924
fix falcon-40b accuracy issue by @Yejing-Lai in #4895
Refactor the positional emebdding config code by @arashb in #4920
Pin to triton 2.1.0 to fix issues with nv-inference by @loadams in #4929
Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen by @ZonePG in #4913
DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators by @nelyahu in #4833
Fix confusing width in simd_load by @yzhblind in #4714
Specify permissions for secrets.GITHUB_TOKEN by @mrwyattii in #4927
Enable quantizer op on ROCm by @rraminen in #4114
autoTP for Qwen by @inkcherry in #4902
Allow specifying mii branch for nv-a6000 workflow by @mrwyattii in #4936
Only run MII CI for inference changes by @mrwyattii in #4939
InfV2 - remove generation config requirement by @mrwyattii in #4938
Cache HF model list for inference tests by @mrwyattii in #4940
Fix docs inconsistency on default value for ignore_unused_parameters by @loadams in #4949
Fix bug in CI model caching by @mrwyattii in #4951
fix uneven issue & add balance autotp by @Yejing-Lai in #4697
Optimize preprocess for ragged batching by @tohtana in #4942
Fix bug where ZeRO2 never uses the reduce method. by @CurryRice233 in #4946
[docs] Add new autotp supported model in tutorial by @delock in #4960
Add missing op_builder.hpu component for HPU accelerator by @nelyahu in #4963
Stage_1_and_2.py: fix assert for reduce_scatter configurations combinations by @nelyahu in #4964
[MiCS]Add the path to support sequence_data_parallel on MiCS by @ys950902 in #4926
Update the DeepSpeed Phi-2 impl. to work with the HF latest changes by @arashb in #4950
Prevent infinite recursion when DS_ACCELERATOR is set to cuda by @ShukantPal in #4962
Fixes for training models with bf16 + freshly initialized optimizer via load_module_only by @haileyschoelkopf in #4141
params partition for skip_init by @inkcherry in #4722
Enhance query APIs for text generation by @tohtana in #4965
Add API to set a module as a leaf node when recursively setting Z3 hooks by @tohtana in #4966
Fix T5 and mistral model meta data error by @Yejing-Lai in #4958
FastGen Jan 2024 blog by @mrwyattii in #4980

New Contributors

@chiragjn made their first contribution in #4858
@li-plus made their first contribution in #4865
@misstek made their first contribution in #4809
@dwyatte made their first contribution in #4916
@ZonePG made their first contribution in #4913
@yzhblind made their first contribution in #4714
@ShukantPal made their first contribution in #4962
@haileyschoelkopf made their first contribution in #4141

Full Changelog: v0.12.6...v0.13.0

Contributors

arashb, zarzen, and 26 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

New Features

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

New Features

What's Changed

New Contributors

Contributors

Releases: microsoft/DeepSpeed

v0.14.3 Patch release

What's Changed

New Contributors

Contributors

v0.14.2 Patch release

What's Changed

New Contributors

Contributors

v0.14.1 Patch release

What's Changed

New Contributors

Contributors

DeepSpeed v0.14.0

New Features

What's Changed

Contributors

v0.13.5 Patch release

What's Changed

Contributors

v0.13.4 Patch release

What's Changed

New Contributors

Contributors

v0.13.3 Patch release

What's Changed

New Contributors

Contributors

v0.13.2 Patch release

What's Changed

New Contributors

Contributors

v0.13.1 Patch release

What's Changed

Contributors

DeepSpeed v0.13.0

New Features

What's Changed

New Contributors

Contributors