Skip to content

Add binary scoring #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add binary scoring #37

wants to merge 4 commits into from

Conversation

sumukshashidhar
Copy link
Member

Add a very simple LLM as a judge, which involves binary scoring of responses, whether or not they align with the produced ground truth answer. Intended only for purely factual questions

@clefourrier
Copy link
Member

I'm not fond of having eval code in yourbench itself as we want to showcase it as a dataset generation library, and I feel it could muddle the point - but if people ask for it we can add it later

@sumukshashidhar
Copy link
Member Author

hmm, but this would be a simple (binary) way to test out a given model. it would simplify the entire end to end pipeline for LLM evals particularly

@RonanKMcGovern
Copy link

I'm not fond of having eval code in yourbench itself as we want to showcase it as a dataset generation library, and I feel it could muddle the point - but if people ask for it we can add it later

Yeah having clear separation could make sense.

But probably it should be straightforward to run evals either way. Right now, it's not straightforward to just take a dataset from yourbench and run it via lighteval (at least for the common case of wanting llm as judge for ground truth based evaluations - which requires a custom task and metric and is harder than it looks on the surface)

@clefourrier
Copy link
Member

There's a command to automatically create a custom task file from a yourbench path in lighteval, and then you just need to run lighteval.
cc @NathanHB if you could share the command and @sumukshashidhar if you could add it to the readme?

@RonanKMcGovern
Copy link

merci @clefourrier , I was able to find the command. Quite a few things failed though and it took me an hour to get things running.

Specifically:

  1. One has to install "lighteval[math]" (otherwise there's a missing latex type package error) which is odd and doesn't match the documentation
  2. Pydantic needs to be installed and seems not packaged with lighteval
  3. The default subset of the dataset is not set to "lighteval", so the dataset loading fails.
  4. openai and tiktoken also seem not to be packaged.

In the end I was able to get things working with hf-inference, but not using an open router approach. Details below:


Evaluation using LightEval

Get set up by running:

git clone https://github.com/huggingface/lighteval
cd lighteval
uv venv
uv pip install "litellm[math]" pydantic openai tiktoken

To run an eval, you need a task file. You can generate one automatically for yourbench using this command:

uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py YOUR_TASK_NAME HF_DATASET_REPO

e.g. (and warning, do not use - dashes.)

uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py "yourbench_touch_rugby" Trelis/touch-rugby-benchmark

You'll then need to go into that file and update the following to:

    hf_subset="lighteval",

That will create a custom task file that can now be run with lighteval using:

uv run lighteval endpoint inference-providers "model=<model_name>,provider=<provider>" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py

which will work for this list of providers.

uv run lighteval endpoint inference-providers "model=Qwen/Qwen2.5-72B-Instruct,provider=hf-inference" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir .

In principle, you can also run like this - although, as of April 13th there are issues with LiteLLM integration failing with 'No module named 'litellm.caching.caching'; 'litellm.caching' is not a package':

uv run lighteval endpoint inference-providers "model=<model_name>,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py

e.g.

lighteval endpoint inference-providers "model=google/gemini-2.0-flash-lite-001,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir .

@clefourrier
Copy link
Member

clefourrier commented Apr 14, 2025

Thanks a lot for this detailed feedback! We'll use it to fix the integratio asap

@NathanHB
Copy link
Member

hey @RonanKMcGovern ! Indeed there are a few things wrong with the dependencies of lighteval, we are working on fixing !
For the wrong hf_subset issue, you can directly modify the template file or create another one that will reflect your needs so that you do not have to do it every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants