Flexible evaluation tool for language models. Easy to extend, highly customizable!
With FlexEval, you can evaluate language models with:
- Zero/few-shot in-context learning tasks
- Open-ended text-generation benchmarks such as MT-Bench with automatic evaluation using GPT-4
- Log-probability-based multiple-choice tasks
- Computing perplexity of text data
For more use cases, see the documentation.
- Flexibility:
flexevalis flexible in terms of the evaluation setup and the language model to be evaluated. - Modularity: The core components of
flexevalare easily extensible and replaceable. - Clarity: The results of evaluation are clear and all the details are saved.
- Reproducibility:
flexevalshould be reproducible, with the ability to save and load configurations and results.
pip install flexevalThe following minimal example evaluates the hugging face model sbintuitions/sarashina2.2-0.5b with the commonsense_qa task.
flexeval_lm \
--language_model HuggingFaceLM \
--language_model.model "sbintuitions/sarashina2.2-0.5b" \
--eval_setup "commonsense_qa" \
--save_dir "results/commonsense_qa"...
2025-09-03 16:22:58.434 | INFO | flexeval.core.evaluate_generation:evaluate_generation:92 - {'exact_match': 0.3185913185913186, 'finish_reason_ratio-stop': 1.0, 'avg_output_length': 9.095004095004095, 'max_output_length': 69, 'min_output_length': 2}
...
The results saved in --saved_dir contain:
config.json: The configuration of the evaluation, which can be used to replicate the evaluation.metrics.json: The evaluation metrics.outputs.jsonl: The outputs of the language model that comes with instance-level metrics.
You can flexibly customize the evaluation by specifying command-line arguments or configuration files. Besides the Transformers model, you can also evaluate models via OpenAI ChatGPT and vLLM, and other models can be readily added!
- Run
flexeval_presetsto check the list of off-the-shelf presets in addition tocommonsense_qa. You can find the details in the Preset Configs section. - See Getting Started to check the tutorial examples for other kinds of tasks.
- See the Configuration Guide to set up your evaluation.
