All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
v0.6.0 - 2024-12-17
- A bunch of annealing configs
constant_with_warmuplearning rate scheduleone_in_eightconfiguration for activation checkpointing- New tokenizer in the source instead of from huggingface
- Improved support for GCS
torch.compile()now only compiles each block, not the whole model.- Support for
torch.compile()withdynamic=True - Resetting the
torch.compile()after every evaluation, because evaluation messes with the compiled versions - Added more in-loop evaluation tasks to pick from, mostly for scaling law.
v0.5.1 - 2024-10-17
- Added ability to try loading latest checkpoint from save folder using
--try_load_latest_save. - Added support for flash attention and gradient checkpointing to
hf_olmo. - Added to
scripts.compare_wandb_configs.pythe ability to more easily compare differences in data mixes and evaluation tasks. - Added
effective_n_kv_headsto OLMoConfig for hacky VLLM support.
v0.5.0 - 2024-08-26
- Fixed conversion to HuggingFace model for DDP-trained models.
- Added support for remote source and destination for HuggingFace model conversion.
- Added support for document masking via flash-attn during training with
--data.generate_doc_lengths. - Added config options for
model.norm_after,model.scale_emb_init, andauxiliary_loss_multiplier(used with zloss). - Added scripts for running experiments on qk_norm, norm reordering, and zloss.
- Added
model.rope_thetaconfiguration option. - Added
model.embedding_layer_normconfiguration option for adding a LN to the embeddings. - Added
model.emb_init_stdconfiguration option to override the standard deviation used to initialize the embeddings. - Added downstream eval task for requests dumped from oe-eval tasks
- Added
CosLinearEnvelopescheduler, which is a pointwise product of a cosine schedule and a linear decay. - Added ability to save outputs of submodules for debugging purposes.
- Added a number of tasks from oe-eval to the downstream eval tasks.
- Version dolma flan change in named_data_mix.py
- Changed default distributed training strategy from single-GPU to FSDP
- Fixed behavior of
effective_memmap_dtypeto prevent unrecognized dtypes to be parsed asuint16.
- Fixed restarting a training run in later epochs so that we no longer need to set the flag
--epoch=INT. - Swapped in correct flan data mix.
- Fix bug where the attention norm, when applied before the attention block, was modifying the residual stream.
- Fixed
OLMo.from_checkpoint()so that it correctly loadsolmo_coreandtorch_newstyle checkpoints. - Fixed
preserve_rng_statebeing incorrectly set to False when doing gradient checkpointing with dropout
v0.4.0 - 2024-07-11
- Added clipping fix to
Optimizerclass to make it work with FSDPno_shardand DDP. - Added tests to compare grad norm differences between torch optimizer and clipping and OLMo optimizer and clipping on both CPU and GPU.
- Expose memmap dtype in data config
- Added support for DDP training.
- Added caching to disk of HF datasets used in downstream evals
- Added FLOPs logging
- Added configs for OLMo tiny set of models
- Added configuration field
optimizer.record_update_metrics, which defaults toFalse, but when set toTruewill trigger AdamW to collect the step size norm and absolute max for each parameter. - Added configuration field
optimizer.selective_updates, which defaults toFalse, but when set toTruewill tell the optimizer to skip updating the parameter and state when the corresponding gradient is 0. - Added configuration field
optimizer.record_update_metrics, which defaults toFalse, but when set to True will trigger AdamW to collect the step size norm and absolute max for each parameter. - Added
olmo_data, a package holding data files like tokenizers. - Added ability to load tokenizers from
olmo_datapackage data. - Added a script that can run a series of models with predictable scaling properties.
- Added original legacy unsharding implementation back, as the default. The new
shared memory implementation can be used by passing
use_legacy_shared_mem_impltounshard.py. - Refactor weight initialization. IMPORTANT: this does not maintain backwards-compatibility with older configs; the jobs will still run, but may produce different outputs.
- Changed the behavior of the Lion optimizer to only record the update cosine similarity when
optimizer.record_update_metricsisTruein order to be consistent with the API. - Added HF datasets into
olmo_data, and changed downstream eval to load from the package.
- Changed from
ignored_indextoignore_indexforcross_entropy_losswhenflash-attn>=2.5.8. - Make
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 similar HF methods again.
v0.3.0 - 2024-04-25
- Added support for Grouped Query Attention.
- Added commonsense_qa and social_iqa downstream evaluation tasks
- Added ce_loss metric, with TriviaQA and NaturalQuestions tasks
- Makes it possible to read from http/https the same way we read from s3/r2.
- Added MMLU multiple choice (A/B/C/D) 5-shot variant downstream tasks
- Tokenizer patch
- Added option to specify number of model replicas when using hybrid sharding.
- Rename
OlmotoOLMoeverywhere in the codebase - Disabled automatic garbage collection during training, instead we run manually at regular intervals to avoid ranks getting out-of-sync with their own gc.
- Removed
AMDLayerNorm, since the original layer norm bug has been fixed and we don't need this workaround anymore. - Removed
OLMoParallelBlock.
- Don't log garbage on nodes that aren't rank 0
- Don't crash in the HF code when we are referring to a tokenizer in a local file
- Point official training scripts to publicly available URLs
- Corrected the
resize_token_embeddingsmethod in theOLMoForCausalLMclass to properly update the token embeddings when resizing the vocabulary. - Changed
tie_weightsmethod to a no-op as weight tying is handled in olmo/model.py - Fixed the size calculation for qk layer norm
- Fixed pipeline test failure that occurs due to a bug in transformers version 4.39.1
- Make
hf_olmocompatible with transformers versions >=4.40.0
v0.2.5 - 2024-03-06
- Fixed default value of
--tokenizerargument toscripts/prepare_tulu_data.pyto be an absolute path, not relative path, the script can be run from other directories. - Added the option to directly pass input embeddings to
OLMoandOLMoForCausalLM. - Added support for Python 3.8.
- Added code to throw an error if
output_attentionsis set toTruein forward call toOLMoForCausalLM. This functionality hasn't been implemented yet. - Correct scheme displayed in error messages that come from R2
- Fixed running with multiple data loading workers in LUMI
- Minor bug fix: uninitialized prompts variable
- Added
output_hidden_statesargument and associated functionality toOLMoandOLMoForCausalLMto return model intermediate hidden states. - Ability to read from R2 like we read from S3
- Added MMLU downstream evaluation tasks, with prompt variations.
- Added support for PyTorch v2.2.
- Added ability to show logs from all ranks
- Added option for QKV clipping.
- Added basic_arithmetic downstream evaluation task
- Changed legacy checkpoint unsharding to use processes and shared memory instead of threads
v0.2.4 - 2024-02-02
- Fixed an issue with the HuggingFace integration where we were inadvertently using a feature that was introduced in Python 3.10, causing an error for older Python versions.
v0.2.3 - 2024-01-31
v0.2.2 - 2023-12-10
v0.2.1 - 2023-12-10
v0.2.0 - 2023-12-08
- GPT-based model.
- Tokenizer and data pre-processing pipeline.
- training script.
- Triton-based FlashAttention.