Update dependency lm-eval to v0.4.7 #292
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==0.4.4
->==0.4.7
Release Notes
EleutherAI/lm-evaluation-harness (lm-eval)
v0.4.7
Compare Source
lm-eval v0.4.7 Release Notes
This release includes several bug fixes, minor improvements to model handling, and task additions.
Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.
Backwards Incompatibilities
Chat Template Delimiter Handling (in v0.4.6)
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
metrics
andfilter
to logged sample by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2517until
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2518loglikelihood_rolling
across requests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2559DeprecationWarning: invalid escape sequence '\s'
for whitespace filter by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2560New Contributors
Full Changelog: EleutherAI/lm-evaluation-harness@v0.4.6...v0.4.7
v0.4.5
Compare Source
lm-eval v0.4.5 Release Notes
New Additions
Prototype Support for Vision Language Models (VLMs)
We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types
hf-multimodal
andvllm-vlm
. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val
) task and we welcome contributions and feedback from the community!New VLM-Specific Arguments
VLM models can be configured with several new arguments within
--model_args
to support their specific requirements:max_images
(int): Set the maximum number of images for each prompt.interleave
(bool): Determines the positioning of image inputs. WhenTrue
(default) images are interleaved with the text. WhenFalse
all images are placed at the front of the text. This is model dependent.hf-multimodal
specific args:image_token_id
(int) orimage_string
(str): Specifies a custom token or string for image placeholders. For example, Llava models expect an"<image>"
string to indicate the location of images in the input, while Qwen2-VL models expect an"<|image_pad|>"
sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model familyconvert_img_format
(bool): Whether to convert the images to RGB format.Example usage:
lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template
Important considerations
--apply_chat_template
flag to ensure proper input formatting according to the model's expected chat template.max_images=1
. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiringinterleave=False
.Tested VLM Models
We have currently most notably tested the implementation with the following models:
transformers
from source)New Tasks
Several new tasks have been contributed to the library for this version!
New tasks as of v0.4.5 include:
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Backwards Incompatibilities
Finalizing
group
versustag
splitWe've now fully deprecated the use of
group
keys directly within a task's configuration file. The appropriate key to use is now solelytag
for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.Handling of Causal vs. Seq2seq backend in HFLM
In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for
self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
. Some users may want to use causal model behavior, but setself.AUTO_MODEL_CLASS
to a different factory class, such astransformers.AutoModelForVision2Seq
.As a result, those users who subclass HFLM but do not call
HFLM.__init__()
may now also need to set theself.backend
attribute to either"causal"
or"seq2seq"
during initialization themselves.While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see https://github.com/EleutherAI/lm-evaluation-harness/pull/2353 for the full set of changes.
Future Plans
We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!
Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)
What's Changed
evaluate
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2351eus_exams
task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2320self.backend
fromAUTO_MODEL_CLASS
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2353limit_mm_per_prompt
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2387New Contributors
Full Changelog: EleutherAI/lm-evaluation-harness@v0.4.4...v0.4.5
Configuration
📅 Schedule: Branch creation - "after 5am on saturday" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
To execute skipped test pipelines write comment
/ok-to-test
.This PR has been generated by MintMaker (powered by Renovate Bot).