Add binary scoring #37

sumukshashidhar · 2025-03-26T09:49:15Z

Add a very simple LLM as a judge, which involves binary scoring of responses, whether or not they align with the produced ground truth answer. Intended only for purely factual questions

clefourrier · 2025-03-26T09:51:33Z

I'm not fond of having eval code in yourbench itself as we want to showcase it as a dataset generation library, and I feel it could muddle the point - but if people ask for it we can add it later

sumukshashidhar · 2025-03-26T09:53:03Z

hmm, but this would be a simple (binary) way to test out a given model. it would simplify the entire end to end pipeline for LLM evals particularly

RonanKMcGovern · 2025-04-12T10:49:33Z

I'm not fond of having eval code in yourbench itself as we want to showcase it as a dataset generation library, and I feel it could muddle the point - but if people ask for it we can add it later

Yeah having clear separation could make sense.

But probably it should be straightforward to run evals either way. Right now, it's not straightforward to just take a dataset from yourbench and run it via lighteval (at least for the common case of wanting llm as judge for ground truth based evaluations - which requires a custom task and metric and is harder than it looks on the surface)

clefourrier · 2025-04-14T08:09:31Z

There's a command to automatically create a custom task file from a yourbench path in lighteval, and then you just need to run lighteval.
cc @NathanHB if you could share the command and @sumukshashidhar if you could add it to the readme?

RonanKMcGovern · 2025-04-14T09:51:29Z

merci @clefourrier , I was able to find the command. Quite a few things failed though and it took me an hour to get things running.

Specifically:

One has to install "lighteval[math]" (otherwise there's a missing latex type package error) which is odd and doesn't match the documentation
Pydantic needs to be installed and seems not packaged with lighteval
The default subset of the dataset is not set to "lighteval", so the dataset loading fails.
openai and tiktoken also seem not to be packaged.

In the end I was able to get things working with hf-inference, but not using an open router approach. Details below:

Evaluation using LightEval

Get set up by running:

git clone https://github.com/huggingface/lighteval
cd lighteval
uv venv
uv pip install "litellm[math]" pydantic openai tiktoken

To run an eval, you need a task file. You can generate one automatically for yourbench using this command:

uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py YOUR_TASK_NAME HF_DATASET_REPO

e.g. (and warning, do not use - dashes.)

uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py "yourbench_touch_rugby" Trelis/touch-rugby-benchmark

You'll then need to go into that file and update the following to:

    hf_subset="lighteval",

That will create a custom task file that can now be run with lighteval using:

uv run lighteval endpoint inference-providers "model=<model_name>,provider=<provider>" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py

which will work for this list of providers.

uv run lighteval endpoint inference-providers "model=Qwen/Qwen2.5-72B-Instruct,provider=hf-inference" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir .

In principle, you can also run like this - although, as of April 13th there are issues with LiteLLM integration failing with 'No module named 'litellm.caching.caching'; 'litellm.caching' is not a package':

uv run lighteval endpoint inference-providers "model=<model_name>,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py

e.g.

lighteval endpoint inference-providers "model=google/gemini-2.0-flash-lite-001,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir .

clefourrier · 2025-04-14T09:52:27Z

Thanks a lot for this detailed feedback! We'll use it to fix the integratio asap

NathanHB · 2025-04-14T15:45:50Z

hey @RonanKMcGovern ! Indeed there are a few things wrong with the dependencies of lighteval, we are working on fixing !
For the wrong hf_subset issue, you can directly modify the template file or create another one that will reflect your needs so that you do not have to do it every time.

sumukshashidhar added 3 commits March 26, 2025 04:47

add binary answer scoring

901f224

add stage to handler

fb41c25

add prompt

fa59ef0

sumukshashidhar requested review from clefourrier and alozowski March 26, 2025 09:49

fix cq

67d674d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add binary scoring #37

Add binary scoring #37

sumukshashidhar commented Mar 26, 2025

clefourrier commented Mar 26, 2025

sumukshashidhar commented Mar 26, 2025

RonanKMcGovern commented Apr 12, 2025

clefourrier commented Apr 14, 2025

RonanKMcGovern commented Apr 14, 2025

clefourrier commented Apr 14, 2025 •

edited

Loading

NathanHB commented Apr 14, 2025

Add binary scoring #37

Are you sure you want to change the base?

Add binary scoring #37

Conversation

sumukshashidhar commented Mar 26, 2025

clefourrier commented Mar 26, 2025

sumukshashidhar commented Mar 26, 2025

RonanKMcGovern commented Apr 12, 2025

clefourrier commented Apr 14, 2025

RonanKMcGovern commented Apr 14, 2025

Evaluation using LightEval

clefourrier commented Apr 14, 2025 • edited Loading

NathanHB commented Apr 14, 2025

clefourrier commented Apr 14, 2025 •

edited

Loading