-
Notifications
You must be signed in to change notification settings - Fork 66
feat: Test-based Evaluation with LLM-as-a-judge #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Test-based Evaluation with LLM-as-a-judge #225
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
avinash2692
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keshavramji : Thanks for your contribution!
This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:
- More details to your PR description that might later translate as an entry in the library docs.
- Some tests (which could include sample
.jsonfiles to show the user what you are looking for in terms of overall schema of input).
Also left some comments on the files here.
.gitignore
Outdated
| # KR files | ||
| kr_results/ | ||
| kr_data/ | ||
| xet/ | ||
| job.sh | ||
| hub/ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be removed? I think it's user specific
mellea/__init__.py
Outdated
| from mellea.stdlib.test_based_eval import TestBasedEval | ||
|
|
||
| __all__ = ["MelleaSession", "generative", "model_ids", "start_session"] | ||
| __all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this to be a package level import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested it without having the package level import, confirmed it's fine to remove.
mellea/stdlib/test_based_eval.py
Outdated
| self.judge_prompt = """**Input to the model** | ||
| {input} | ||
| **Model output to be rated** | ||
| {prediction} | ||
| **Ground truth text** | ||
| {target} | ||
| **Rating Guidelines** | ||
| The model output should adhere to the following guidelines: | ||
| {guidelines} | ||
| **Scoring Criteria** | ||
| * Score 0: The model output violates any of the guidelines. | ||
| * Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines. | ||
| **Return Your Rating** | ||
| Return your rating in the following format: | ||
| {{\"score\": your_score, \"justification\": \"your_justification\"}} | ||
| Your rating: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a TestBasedEval.jinja2 template and changed test_based_eval.py to use TemplateRepresentation, per your suggestion.
9d79c79 to
90b8f49
Compare
|
Hi @avinash2692, thanks much for the feedback! I've looked to address each of your comments above, and have also updated the PR description for further clarity on the intended usage. For additional context, here is an example of the intended schema for a test file (in JSON format): |
avinash2692
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR adds a flexible paradigm for evaluating a desired task(s) using LLM-as-a-judge evaluation on a set of tests for the task(s), with a corresponding (configurable) command line interface. Given a desired task and a set of instructions (guidelines) specifying intended model behavior, a user can provide a list of examples (consisting of inputs and (optionally) references / targets), and generate candidate responses from an LLM. A judge LLM is simultaneously instantiated and used to determine whether the candidates satisfy the guidelines.
TestBasedEvalcomponent for evaluation with generator and judge sessions.m eval runCLI command for configurable models and backends for the generator and juge.TestBasedEval.jinja2-- for pass/fail scoring.