-
Notifications
You must be signed in to change notification settings - Fork 24
Add binary scoring #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm not fond of having eval code in yourbench itself as we want to showcase it as a dataset generation library, and I feel it could muddle the point - but if people ask for it we can add it later |
hmm, but this would be a simple (binary) way to test out a given model. it would simplify the entire end to end pipeline for LLM evals particularly |
Yeah having clear separation could make sense. But probably it should be straightforward to run evals either way. Right now, it's not straightforward to just take a dataset from yourbench and run it via lighteval (at least for the common case of wanting llm as judge for ground truth based evaluations - which requires a custom task and metric and is harder than it looks on the surface) |
There's a command to automatically create a custom task file from a yourbench path in lighteval, and then you just need to run lighteval. |
merci @clefourrier , I was able to find the command. Quite a few things failed though and it took me an hour to get things running. Specifically:
In the end I was able to get things working with hf-inference, but not using an open router approach. Details below: Evaluation using LightEvalGet set up by running: git clone https://github.com/huggingface/lighteval
cd lighteval
uv venv
uv pip install "litellm[math]" pydantic openai tiktoken To run an eval, you need a task file. You can generate one automatically for yourbench using this command: uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py YOUR_TASK_NAME HF_DATASET_REPO e.g. (and warning, do not use uv run lighteval tasks create examples/custom_tasks_templates/custom_yourbench_task.py "yourbench_touch_rugby" Trelis/touch-rugby-benchmark You'll then need to go into that file and update the following to: hf_subset="lighteval", That will create a custom task file that can now be run with lighteval using: uv run lighteval endpoint inference-providers "model=<model_name>,provider=<provider>" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py which will work for this list of providers. uv run lighteval endpoint inference-providers "model=Qwen/Qwen2.5-72B-Instruct,provider=hf-inference" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir . In principle, you can also run like this - although, as of April 13th there are issues with LiteLLM integration failing with 'No module named 'litellm.caching.caching'; 'litellm.caching' is not a package': uv run lighteval endpoint inference-providers "model=<model_name>,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|<task_name>|0|0" --custom-tasks custom_<task_name>_task.py e.g. lighteval endpoint inference-providers "model=google/gemini-2.0-flash-lite-001,base_url=https://openrouter.ai/api/v1,api_key=$OPENROUTER_API_KEY,max_concurrent_requests=128" "custom|yourbench_touch_rugby|0|0" --custom-tasks custom_yourbench_touch_rugby_task.py --output-dir . |
Thanks a lot for this detailed feedback! We'll use it to fix the integratio asap |
hey @RonanKMcGovern ! Indeed there are a few things wrong with the dependencies of lighteval, we are working on fixing ! |
Add a very simple LLM as a judge, which involves binary scoring of responses, whether or not they align with the produced ground truth answer. Intended only for purely factual questions