Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.14.3 Patch release
What's Changed
- Update version.txt after 0.14.2 release by @mrwyattii in #5458
- Add getter and setter methods for compile_backend across accelerators. by @vshekhawat-hlab in #5299
- Fix torch.compile error for PyTorch v2.3 by @tohtana in #5463
- Revert "stage3: efficient compute of scaled_global_grad_norm (#5256)" by @lekurile in #5461
- Update ds-chat CI workflow paths to include zero stage 1-3 files by @lekurile in #5462
- Update with ops not supported on Windows by @loadams in #5468
- fix: swapping order of parameters in create_dir_symlink method. by @alvieirajr in #5465
- Un-pin torch version in nv-torch-latest back to latest and skip test_compile_zero tests on v100 by @loadams in #5459
- re-introduce: stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5493
- Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() by @harygo2 in #5464
- Fix compile wrapper by @BacharL in #5455
- enable phi3_mini autotp by @Yejing-Lai in #5501
- Fused adam for HPU by @BacharL in #5500
- [manifest] update mainfest to add hpp file in csrc. by @ys950902 in #5522
- enable phi2 autotp by @Yejing-Lai in #5436
- Switch pynvml to nvidia-ml-py by @loadams in #5529
- Switch from double quotes to match single quotes by @loadams in #5530
- [manifest] update mainfest to add hpp file in deepspeed. by @ys950902 in #5533
- New integration - CometMonitor by @alexkuzmik in #5466
- Improve _configure_optimizer() final optimizer log by @nelyahu in #5528
- Enhance testing: Skip fused_optimizer tests if not supported. by @vshekhawat-hlab in #5159
- Skip the UT cases that use unimplemented op builders. by @foin6 in #5372
- rocblas -> hipblas changes for ROCm by @rraminen in #5401
- Rocm warp size fix by @rraminen in #5402
- CPUAdam fp16 and bf16 support by @BacharL in #5409
- Optimize zero3 fetch params using all_reduce by @deepcharm in #5420
- Fix the TypeError for XPU Accelerator by @shiyang-weng in #5531
- Fix RuntimeError for moe on XPU: tensors found at least two devices by @shiyang-weng in #5519
- Remove synchronize calls from allgather params by @BacharL in #5516
- Avoid overwrite of compiled module wrapper attributes by @deepcharm in #5549
- Small typos in functions set_none_gradients_to_zero by @TravelLeraLone in #5557
- Adapt doc for #4405 by @oraluben in #5552
- Update to HF_HOME from TRANSFORMERS_CACHE by @loadams in #4816
- [INF] DSAttention allow input_mask to have false as value by @oelayan7 in #5546
- Add throughput timer configuration by @deepcharm in #5363
- Add Ulysses DistributedAttention compatibility by @Kwen-Chen in #5525
- Add hybrid_engine.py as path to trigger the DS-Chat GH workflow by @lekurile in #5562
- Update HPU docker version by @loadams in #5566
- Rename files in fp_quantize op from quantize.* to fp_quantize.* by @loadams in #5577
- [MiCS] Remove the handle print on DeepSpeed side by @ys950902 in #5574
- Update to fix sidebar over text by @loadams in #5567
- DeepSpeedCheckpoint: support custom final ln idx by @nelyahu in #5506
- Update minor CUDA version compatibility by @adk9 in #5591
- Add slide deck for meetup in Japan by @tohtana in #5598
- Fixed the Windows build. by @costin-eseanu in #5596
- estimate_zero2_model_states_mem_needs: fixing memory estiamtion by @nelyahu in #5099
- Fix cuda hardcode for inference woq by @Liangliang-Ma in #5565
- fix sequence parallel(Ulysses) grad scale for zero0 by @inkcherry in #5555
- Add Compressedbackend for Onebit optimizers by @Liangliang-Ma in #5473
- Updated hpu-gaudi2 tests content. by @vshekhawat-hlab in #5622
- Pin transformers version for MII tests by @loadams in #5629
- WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo by @NirSonnenschein in #5590
- stage_1_and_2: optimize clip calculation to use clamp by @nelyahu in #5632
- Fix overlap communication of ZeRO stage 1 and 2 by @penn513 in #5606
- fixes in _partition_param_sec function by @mmhab in #5613
- assumption of torch.initial_seed function accepting seed arg in DeepSpeedAccelerator abstract class is incorrect by @polisettyvarma in #5569
- pipe/_exec_backward_pass: fix immediate grad update by @nelyahu in #5605
- Monitor was always enabled causing performance degradation by @deepcharm in #5633
New Contributors
- @alvieirajr made their first contribution in #5465
- @harygo2 made their first contribution in #5464
- @alexkuzmik made their first contribution in #5466
- @foin6 made their first contribution in #5372
- @shiyang-weng made their first contribution in #5531
- @TravelLeraLone made their first contribution in #5557
- @oraluben made their first contribution in #5552
- @Kwen-Chen made their first contribution in #5525
- @adk9 made their first contribution in #5591
- @costin-eseanu made their first contribution in #5596
- @NirSonnenschein made their first contribution in #5590
- @penn513 made their first contribution in #5606
Full Changelog: v0.14.2...v0.14.3
v0.14.2 Patch release
What's Changed
- Update version.txt after 0.14.1 release by @mrwyattii in #5413
- Remove dtype(fp16) condition check for residual_add unit test by @raza-sikander in #5329
- [XPU] Use non_daemonic_proc by default on XPU device by @ys950902 in #5412
- Fix a convergence issues in TP topology caused by incorrect grad_norm. by @inkcherry in #5411
- Update 'create-pr' action in release workflow to latest by @loadams in #5415
- Update engine.py to avoid torch warning by @etiennebonnafoux in #5408
- Update _sidebar.scss by @fasterinnerlooper in #5293
- Add more tests into XPU CI by @Liangliang-Ma in #5427
- [CPU] Support SHM based inference_all_reduce in TorchBackend by @delock in #5391
- Add required paths to trigger AMD tests on PRs by @loadams in #5406
- Bug fix in
split_index
method by @bm-synth in #5292 - Parallel map step for
DistributedDataAnalyzer
map-reduce by @bm-synth in #5291 - Selective dequantization by @RezaYazdaniAminabadi in #5375
- Fix sorting of shard optimizer states files for universal checkpoint by @tohtana in #5395
- add device config env for the accelerator by @shiyuan680 in #5396
- 64bit indexing fused adam by @garrett4wade in #5187
- Improve parallel process of universal checkpoint conversion by @tohtana in #5343
- set the default to use set_to_none for clearing gradients in BF16 optimizer. by @inkcherry in #5434
- OptimizedLinear implementation by @jeffra in #5355
- Update README.md by @Jhonso7393 in #5453
- Update PyTest torch version to match PyTorch latest official (2.3.0) by @loadams in #5454
New Contributors
- @etiennebonnafoux made their first contribution in #5408
- @fasterinnerlooper made their first contribution in #5293
- @shiyuan680 made their first contribution in #5396
- @garrett4wade made their first contribution in #5187
- @Jhonso7393 made their first contribution in #5453
Full Changelog: v0.14.1...v0.14.2
v0.14.1 Patch release
What's Changed
- Update version.txt after 0.14.0 release by @mrwyattii in #5238
- Fp6 blog chinese by @xiaoxiawu-microsoft in #5239
- Add contributed HW support into README by @delock in #5240
- Set tp world size to 1 in ckpt load, if MPU is not provided by @samadejacobs in #5243
- Make op builder detection adapt to accelerator change by @delock in #5206
- Replace HIP_PLATFORM_HCC with HIP_PLATFORM_AMD by @rraminen in #5264
- Add CI for Habana Labs HPU/Gaudi2 by @loadams in #5244
- Fix attention mask handling in the Hybrid Engine Bloom flow by @deepcharm in #5101
- Skip 1Bit Compression and sparsegrad tests for HPU. by @vshekhawat-hlab in #5270
- Enabled LMCorrectness inference tests on HPU. by @vshekhawat-hlab in #5271
- Added HPU backend support for torch.compile tests. by @vshekhawat-hlab in #5269
- Average only valid part of the ipg buffer. by @BacharL in #5268
- Add HPU accelerator support in unit tests. by @vshekhawat-hlab in #5162
- Fix loading a universal checkpoint by @tohtana in #5263
- Add Habana Gaudi2 CI badge to the README by @loadams in #5286
- Add intel gaudi to contributed HW in README by @BacharL in #5300
- Fixed Accelerate Link by @wkaisertexas in #5314
- Enable mixtral 8x7b autotp by @Yejing-Lai in #5257
- support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix by @inkcherry in #5259
- fix comms dtype by @mayank31398 in #5297
- Modified regular expression by @igeni in #5306
- Docs typos fix and grammar suggestions by @Gr0g0 in #5322
- Added Gaudi2 CI tests. by @vshekhawat-hlab in #5275
- Improve universal checkpoint by @tohtana in #5289
- Increase coverage for HPU by @loadams in #5324
- Add NFS path check for default deepspeed triton cache directory by @HeyangQin in #5323
- Correct typo in checking on bf16 unit test support by @loadams in #5317
- Make NFS warning print only once by @HeyangQin in #5345
- resolve KeyError: 'PDSH_SSH_ARGS_APPEND' by @Lzhang-hub in #5318
- BF16 optimizer: Clear lp grads after updating hp grads in hook by @YangQun1 in #5328
- Fix sort of zero checkpoint files by @tohtana in #5342
- Add
distributed_port
fordeepspeed.initialize
by @LZHgrla in #5260 - [fix] fix typo s/simultanenously /simultaneously by @digger-yu in #5359
- Update container version for Gaudi2 CI by @raza-sikander in #5360
- compute global norm on device by @BacharL in #5125
- logger update with torch master changes by @rogerxfeng8 in #5346
- Ensure capacity does not exceed number of tokens by @jeffra in #5353
- Update workflows that use cu116 to cu117 by @loadams in #5361
- FP [6,8,12] quantizer op by @jeffra in #5336
- CPU SHM based inference_all_reduce improve by @delock in #5320
- Auto convert moe param groups by @jeffra in #5354
- Support MoE for pipeline models by @mosheisland in #5338
- Update pytest and transformers with fixes for pytest>= 8.0.0 by @loadams in #5164
- Increase CI coverage for Gaudi2 accelerator. by @vshekhawat-hlab in #5358
- Add CI for Intel XPU/Max1100 by @Liangliang-Ma in #5376
- Update path name on xpu-max1100.yml, add badge in README by @loadams in #5386
- Update checkout action on workflows on ubuntu 20.04 by @loadams in #5387
- Cleanup required_torch_version code and references. by @loadams in #5370
- Update README.md for intel XPU support by @Liangliang-Ma in #5389
- Optimize the fp-dequantizer to get high memory-BW utilization by @RezaYazdaniAminabadi in #5373
- Removal of cuda hardcoded string with get_device function by @raza-sikander in #5351
- Add custom reshaping for universal checkpoint by @tohtana in #5390
- fix pagable h2d memcpy by @GuanhuaWang in #5301
- stage3: efficient compute of scaled_global_grad_norm by @nelyahu in #5256
- Fix the FP6 kernels compilation problem on non-Ampere GPUs. by @JamesTheZ in #5333
New Contributors
- @vshekhawat-hlab made their first contribution in #5270
- @wkaisertexas made their first contribution in #5314
- @igeni made their first contribution in #5306
- @Gr0g0 made their first contribution in #5322
- @Lzhang-hub made their first contribution in #5318
- @YangQun1 made their first contribution in #5328
- @raza-sikander made their first contribution in #5360
- @rogerxfeng8 made their first contribution in #5346
- @JamesTheZ made their first contribution in #5333
Full Changelog: v0.14.0...v0.14.1
DeepSpeed v0.14.0
New Features
What's Changed
- Update version.txt after 0.13.5 release by @mrwyattii in #5229
- MOE gate fixes and enhancements by @mosheisland in #5156
- FP6 quantization end-to-end. by @loadams in #5234
- FP6 blog by @loadams in #5235
Full Changelog: v0.13.5...v0.14.0
v0.13.5 Patch release
What's Changed
- Update version.txt after 0.13.4 release by @mrwyattii in #5196
- Fix assertion to run pipeline engine with a compiled module by @tohtana in #5197
- Allow specifying MII branch on MII CI by @mrwyattii in #5208
- [zero++] Synchronize at the end of secondary partitioning and simplify the logic by @ByronHsu in #5216
- Add fp16 support of Qwen1.5 models (0.5B to 72B) to DeepSpeed-FastGen by @ZonePG in #5219
- Rename nv-torch-latest-cpu workflow to cpu-torch-latest by @loadams in #5226
- Fix moe cpu offload by @RezaYazdaniAminabadi in #5220
- Use
deepspeed.comm
instead oftorch.distributed
by @jinyouzhi in #5225 - fix fused_qkv model accuracy issue by @Yejing-Lai in #5217
Full Changelog: v0.13.4...v0.13.5
v0.13.4 Patch release
What's Changed
- Update version.txt after v0.13.3 release by @loadams in #5185
- Fixes for
--extra-index-url
by @loadams in #5183 - allow debug/experimental compiler backends by @tohtana in #5191
- Disable ninja by default by @mrwyattii in #5194
- [CPUAdam] Update full_precision_optimizer_states in docstring by @rohan-varma in #5181
- Add script to check for
--extra-index-url
by @loadams in #5184
New Contributors
- @rohan-varma made their first contribution in #5181
Full Changelog: v0.13.3...v0.13.4
v0.13.3 Patch release
What's Changed
- Update version.txt after 0.13.2 release by @mrwyattii in #5119
- Stop tracking backward chain of broadcast (ZeRO3) by @tohtana in #5113
- [NPU]ZeRO-Infinity feature compatibility by @misstek in #5077
- BF16 optimizer: Improve device utilization by immediate grad update by @deepcharm in #4975
- removed if condition in
if collate_fn is None
by @bm-synth in #5107 - disable compile tests for torch<2.1 by @mrwyattii in #5121
- Update inference test model names by @mrwyattii in #5127
- Fix issue with zero-sized file after merging file on curriculum
map_reduce
by @bm-synth in #5106 - Update return codes in PyTest to properly error out if tests fail by @loadams in #5122
- add missing methods to MPS_Accelerator by @mrwyattii in #5134
- Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. by @bm-synth in #5108
- Fix broadcast deadlock for incomplete batches in data sample for data analysis by @bm-synth in #5117
- Avoid zero-sized microbatches for incomplete minibatches when doing curriculum learning by @bm-synth in #5118
- remove mandatory
index
key from output ofmetric_function
inDataAnalysis
map operation by @bm-synth in #5112 - tensorboard logging: avoid item() outside gas to improve performance by @nelyahu in #5135
- Check overflow on device without host synchronization for each tensor by @BacharL in #5115
- Update nv-inference torch version by @loadams in #5128
- Method
run_map_reduce
to fix errors when runningrun_map
followed byrun_reduce
by @bm-synth in #5131 - Added missing
isinstance
check in PR 5112 by @bm-synth in #5142 - Fix UserWarning: The torch.cuda.*DtypeTensor constructors are no long… by @ShukantPal in #5018
- TestEmptyParameterGroup: replace fusedAdam with torch.optim.AdamW by @nelyahu in #5139
- Update deprecated HuggingFace function by @mrwyattii in #5144
- Pin to PyTest 8.0.0 by @loadams in #5163
- get_grad_norm_direct: fix a case of empty norm group by @nelyahu in #5148
- Distributed in-memory map-reduce for data analyzer by @bm-synth in #5129
- DeepSpeedZeroOptimizer_Stage3: remove cuda specific optimizer by @nelyahu in #5138
- MOE: Fix save checkpoint when TP > 1 by @mosheisland in #5157
- Fix gradient clipping by @tohtana in #5150
- Use ninja to speed up build by @jinzhen-lin in #5088
- Update flops profiler to handle attn and matmul by @KimmiShi in #4724
- Fix allreduce for BF16 and ZeRO0 by @tohtana in #5170
- Write multiple items to output file at once, in distributed data analyzer. by @bm-synth in #5169
- Fix typos in blogs/ by @jinyouzhi in #5172
- Inference V2 Human Eval by @lekurile in #4804
- Reduce ds_id name length by @jomayeri in #5176
- Switch cpu-inference workflow from --extra-index-url to --index-url by @loadams in #5182
New Contributors
Full Changelog: v0.13.2...v0.13.3
v0.13.2 Patch release
What's Changed
- Update version.txt after 0.13.1 release by @mrwyattii in #5002
- Support
exclude_frozen_parameters
forsave_16bit_model
by @LZHgrla in #4999 - Allow nightly tests dispatch by @mrwyattii in #5014
- Enable hpz based on secondary tensor presence by @HeyangQin in #4906
- Enable workflow dispatch on all workflows by @loadams in #5016
- [minor] improve code quality and readablilty by @ByronHsu in #5011
- Update falcon fused type order by @Yejing-Lai in #5007
- Fix error report of DSElasticAgent._set_master_addr_port() by @RobinDong in #4985
- DS #4993 #662 : autotune single node hostfile bugfix by @oushu1zhangxiangxuan1 in #4996
- [minor] Improve logging for multiprocesses by @ByronHsu in #5004
- deepspeed/launcher: add launcher_helper as each rank's start portal by @YizhouZ in #4699
- Graph capture support on HPU accelerators by @deepcharm in #5013
- launcher/launcher_helper.py: fix PMI name and add EnvironmentError by @YizhouZ in #5025
- Remove MI100 badge from landing page by @mrwyattii in #5036
- Remove coverage reports from workflows and fix for inference CI by @loadams in #5028
- Remove Megatron-DeepSpeed CI workflow by @mrwyattii in #5038
- Fix P40 CI failures by @mrwyattii in #5037
- Fix for nightly torch CI by @mrwyattii in #5039
- Fix nv-accelerate and nv-torch-latest-v100. by @loadams in #5035
- update inference pages to point to FastGen by @mrwyattii in #5029
- launcher_helper: enable fds passing by @YizhouZ in #5042
- Fix nv-torch-latest-cpu CI by @mrwyattii in #5045
- [NPU] Add NPU to support hybrid engine by @CurryRice233 in #4831
- MoE type hints by @ringohoffman in #5043
- [doc] update inference related docs from
mp_size
totensor_parallel
for TP by @yundai424 in #5048 - Fix broken model names in inference CI by @mrwyattii in #5053
- [NPU] Change log level to debug by @CurryRice233 in #5051
- Delay reduce-scatter for ZeRO3 leaf modules by @tohtana in #5008
- Optimize grad_norm calculations by reducing device/host dependency by @nelyahu in #4974
- load linear layer weight with given dtype by @polisettyvarma in #4044
- Update import for changes to latest diffusers by @mrwyattii in #5065
- adding hccl to init_distributed function description by @nelyahu in #5034
- [Zero++ qgZ] Fall back to reduce_scatter if
tensor.numel() % (2 * global_world_size) != 0
by @ByronHsu in #5056 - Make batch size documentation clearer by @segyges in #5072
- [doc/1-line change] default stage3_param_persistence_threshold is wrong in the doc by @ByronHsu in #5073
- Further refactor deepspeed.moe.utils + deepspeed.moe.layer type hints by @ringohoffman in #5060
- Fix verification for ZeRO3 leaf module by @tohtana in #5074
- Stop tracking backward chain of broadcast in initialization by @tohtana in #5075
- Update torch version for nv-torch-latest-cpu by @loadams in #5086
- Add backwards compatibility w/ older versions of diffusers (<0.25.0) by @lekurile in #5083
- Enable torch.compile with ZeRO (Experimental) by @tohtana in #4878
- Update nv-accelerate to latest torch by @loadams in #5040
- HPU Accelerator: fix supported_dtypes API by @nelyahu in #5094
- [NPU] replace 'cuda' with get_accelerator().device_name() by @minchao-sun in #5095
- optimize clip_grad_norm_ function by @mmhab in #4915
- [xs] fix ZEROPP convergence test by @yundai424 in #5061
- Switch hasattr check from compile to compiler by @loadams in #5096
- Split is_synchronized_device api to multiple apis by @BacharL in #5026
- 47% FastGen speedup for low workload - refactor allocator by @HeyangQin in #5090
- Support
exclude_frozen_parameters
forzero_to_fp32.py
script by @andstor in #4979 - Fix alignment of optimizer states when loading by @tohtana in #5105
- Skip Triton import for AMD by @lekurile in #5110
- Add HIP conversion file outputs to .gitignore by @lekurile in #5111
- Remove optimizer step on initialization by @tohtana in #5104
New Contributors
- @ByronHsu made their first contribution in #5011
- @RobinDong made their first contribution in #4985
- @oushu1zhangxiangxuan1 made their first contribution in #4996
- @yundai424 made their first contribution in #5048
- @segyges made their first contribution in #5072
- @andstor made their first contribution in #4979
Full Changelog: v0.13.1...v0.13.2
v0.13.1 Patch release
What's Changed
- Update version.txt after 0.13.0 release by @mrwyattii in #4982
- Update FastGen blog title by @arashb in #4983
- Fix the MoE-params gradient-scaling by @RezaYazdaniAminabadi in #4957
- fix some typo under blogs/ by @digger-yu in #4988
- Fix placeholder value in FastGen Blog by @mrwyattii in #5000
- fix for DS_ENV issue by @jeffra in #4992
- Delete unused --deepspeed_mpi command line argument by @ShukantPal in #4981
- Make installable without torch by @mrwyattii in #5001
- Implement some APIs of HPU accelerator by @mmhab in #4935
- Refactor the Qwen positional emebdding config code by @ZonePG in #4955
Full Changelog: v0.13.0...v0.13.1
DeepSpeed v0.13.0
New Features
What's Changed
- Update version.txt after 0.12.6 release by @mrwyattii in #4850
- doc corrections by @goodship1 in #4861
- Fix exception handling in get_all_ranks_from_group() function by @HeyangQin in #4862
- deepspeed engine: fp16 support validation on init by @nelyahu in #4843
- Remove hooks on gradient accumulation on engine/optimizer destroy by @chiragjn in #4858
- optimize grad_norm calculation in stage3.py by @mmhab in #4436
- Fix f-string messages by @li-plus in #4865
- [NPU] Fix npu offload bug by @CurryRice233 in #4883
- Partition parameters: Minor refactoring of use_secondary_tensor condition by @deepcharm in #4868
- Pipeline: Add support to eval micro bs configuration by @nelyahu in #4859
- zero_to_fp32.py: Handle a case where shape doesn't have numel attr by @nelyahu in #4842
- Add support of Microsoft Phi-2 model to DeepSpeed-FastGen by @arashb in #4812
- Support cpu tensors without direct device invocation by @abhilash1910 in #3842
- add sharded loading for safetensors in AutoTP by @sywangyi in #4854
- [XPU] XPU accelerator support for Intel GPU device by @delock in #4547
- enable starcode((kv_head=1)) autotp by @Yejing-Lai in #4896
- Release overlap_comm & contiguous_gradients restrictions for ZeRO 1 by @li-plus in #4887
- [NPU]Add ZeRO-Infinity feature for NPU by @misstek in #4809
- fix num_kv_heads sharding in uneven autoTP for Falcon-40b by @Yejing-Lai in #4712
- Nvme offload checkpoint by @eisene in #4707
- Add WarmupCosineLR to Read the Docs by @dwyatte in #4916
- Add Habana Labs HPU accelerator support by @deepcharm in #4912
- Unit tests for MiCS by @zarzen in #4792
- Fix SD workflow to work with latest diffusers version by @lekurile in #4918
- [Fix] Fix cpu inference UT failure by @delock in #4430
- Add paths to run SD tests by @loadams in #4919
- Change PR/schedule triggers for CPU-inference by @loadams in #4924
- fix falcon-40b accuracy issue by @Yejing-Lai in #4895
- Refactor the positional emebdding config code by @arashb in #4920
- Pin to triton 2.1.0 to fix issues with nv-inference by @loadams in #4929
- Add support of Qwen models (7b, 14b, 72b) to DeepSpeed-FastGen by @ZonePG in #4913
- DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators by @nelyahu in #4833
- Fix confusing width in simd_load by @yzhblind in #4714
- Specify permissions for secrets.GITHUB_TOKEN by @mrwyattii in #4927
- Enable quantizer op on ROCm by @rraminen in #4114
- autoTP for Qwen by @inkcherry in #4902
- Allow specifying mii branch for nv-a6000 workflow by @mrwyattii in #4936
- Only run MII CI for inference changes by @mrwyattii in #4939
- InfV2 - remove generation config requirement by @mrwyattii in #4938
- Cache HF model list for inference tests by @mrwyattii in #4940
- Fix docs inconsistency on default value for
ignore_unused_parameters
by @loadams in #4949 - Fix bug in CI model caching by @mrwyattii in #4951
- fix uneven issue & add balance autotp by @Yejing-Lai in #4697
- Optimize preprocess for ragged batching by @tohtana in #4942
- Fix bug where ZeRO2 never uses the reduce method. by @CurryRice233 in #4946
- [docs] Add new autotp supported model in tutorial by @delock in #4960
- Add missing op_builder.hpu component for HPU accelerator by @nelyahu in #4963
- Stage_1_and_2.py: fix assert for reduce_scatter configurations combinations by @nelyahu in #4964
- [MiCS]Add the path to support sequence_data_parallel on MiCS by @ys950902 in #4926
- Update the DeepSpeed Phi-2 impl. to work with the HF latest changes by @arashb in #4950
- Prevent infinite recursion when DS_ACCELERATOR is set to cuda by @ShukantPal in #4962
- Fixes for training models with bf16 + freshly initialized optimizer via
load_module_only
by @haileyschoelkopf in #4141 - params partition for skip_init by @inkcherry in #4722
- Enhance query APIs for text generation by @tohtana in #4965
- Add API to set a module as a leaf node when recursively setting Z3 hooks by @tohtana in #4966
- Fix T5 and mistral model meta data error by @Yejing-Lai in #4958
- FastGen Jan 2024 blog by @mrwyattii in #4980
New Contributors
- @chiragjn made their first contribution in #4858
- @li-plus made their first contribution in #4865
- @misstek made their first contribution in #4809
- @dwyatte made their first contribution in #4916
- @ZonePG made their first contribution in #4913
- @yzhblind made their first contribution in #4714
- @ShukantPal made their first contribution in #4962
- @haileyschoelkopf made their first contribution in #4141
Full Changelog: v0.12.6...v0.13.0