feat: Test-based Evaluation with LLM-as-a-judge #225

keshavramji · 2025-11-06T20:36:35Z

This PR adds a flexible paradigm for evaluating a desired task(s) using LLM-as-a-judge evaluation on a set of tests for the task(s), with a corresponding (configurable) command line interface. Given a desired task and a set of instructions (guidelines) specifying intended model behavior, a user can provide a list of examples (consisting of inputs and (optionally) references / targets), and generate candidate responses from an LLM. A judge LLM is simultaneously instantiated and used to determine whether the candidates satisfy the guidelines.

Added TestBasedEval component for evaluation with generator and judge sessions.
Implements m eval run CLI command for configurable models and backends for the generator and juge.
Supports flexible test format with multiple examples per test, each containing input-target pairs evaluated against custom instructions.
Adds a Jinja template corresponding to the component -- TestBasedEval.jinja2 -- for pass/fail scoring.

mergify · 2025-11-06T20:37:10Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

avinash2692

@keshavramji : Thanks for your contribution!

This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:

More details to your PR description that might later translate as an entry in the library docs.
Some tests (which could include sample .json files to show the user what you are looking for in terms of overall schema of input).

Also left some comments on the files here.

avinash2692 · 2025-11-07T20:16:10Z

.gitignore

+# KR files
+kr_results/
+kr_data/
+xet/
+job.sh
+hub/
+


Could this be removed? I think it's user specific

avinash2692 · 2025-11-07T20:39:12Z

mellea/__init__.py

+from mellea.stdlib.test_based_eval import TestBasedEval

-__all__ = ["MelleaSession", "generative", "model_ids", "start_session"]
+__all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"]


Is there a reason for this to be a package level import?

Just tested it without having the package level import, confirmed it's fine to remove.

cli/eval/commands.py

avinash2692 · 2025-11-07T21:11:25Z

mellea/stdlib/test_based_eval.py

+        self.judge_prompt = """**Input to the model**
+
+            {input}
+
+            **Model output to be rated**
+
+            {prediction}
+
+            **Ground truth text**
+
+            {target}
+
+            **Rating Guidelines**
+            The model output should adhere to the following guidelines:
+             {guidelines}
+
+            **Scoring Criteria**
+             * Score 0: The model output violates any of the guidelines.
+             * Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines.
+
+            **Return Your Rating**
+               Return your rating in the following format:
+               {{\"score\": your_score, \"justification\": \"your_justification\"}}
+
+            Your rating:
+            """


This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component

I have added a TestBasedEval.jinja2 template and changed test_based_eval.py to use TemplateRepresentation, per your suggestion.

cli/eval/runner.py

mellea/stdlib/test_based_eval.py

keshavramji · 2025-11-25T19:08:48Z

Hi @avinash2692, thanks much for the feedback! I've looked to address each of your comments above, and have also updated the PR description for further clarity on the intended usage. For additional context, here is an example of the intended schema for a test file (in JSON format):

[
  { 
    "id": "test1"
    "source": "example_tests"
    "name": "Basic Addition"
    "instructions": "1. The answer should produce correct calculations. 2. The answer format should be consistent with the format described in the input question."
    "examples": [
        {
          "input_id": "test1.1",
          "input": [
             {
               "role": "user",
               "content": "2 + 2 = ?"
             }
          ],
          "targets": [
             {
               "role": "assistant",
               "content": "2 + 2 = 4"
             }
          ] 
        },
        {
          "input_id": "test1.2"
          "input": [
            {
              "role": "user",
              "content": "3 + 5 = ?"
            }
          ],
          "targets": [
            {
              "role":"assistant",
              "content": "3 + 5 = 8"
            }    
          ]
        }
     ]
  }
]

avinash2692

LGTM

avinash2692 self-assigned this Nov 7, 2025

avinash2692 self-requested a review November 7, 2025 20:05

avinash2692 removed their assignment Nov 7, 2025

avinash2692 requested changes Nov 7, 2025

View reviewed changes

Keshav Ramji [email protected] and others added 4 commits November 25, 2025 18:12

v1 working

c5f3d9f

Update v1 data format and judge call

c9bc4b5

Pre-commit fixes

6c7ad29

PR update: jinja, pydantic

90b8f49

keshavramji force-pushed the test-based-eval branch from 9d79c79 to 90b8f49 Compare November 25, 2025 18:24

Revert .gitignore to main

0010187

keshavramji requested a review from avinash2692 November 25, 2025 19:06

Merge branch 'main' into test-based-eval

7a42b1c

avinash2692 approved these changes Dec 9, 2025

View reviewed changes

avinash2692 merged commit 0f1f0f8 into generative-computing:main Dec 9, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Test-based Evaluation with LLM-as-a-judge #225

feat: Test-based Evaluation with LLM-as-a-judge #225

keshavramji commented Nov 6, 2025 •

edited

Loading

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

avinash2692 left a comment

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

keshavramji Nov 25, 2025

Uh oh!

Uh oh!

avinash2692 Nov 7, 2025

Uh oh!

keshavramji Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

keshavramji commented Nov 25, 2025 •

edited

Loading

Uh oh!

avinash2692 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Test-based Evaluation with LLM-as-a-judge #225

feat: Test-based Evaluation with LLM-as-a-judge #225

Conversation

keshavramji commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 6, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

avinash2692 left a comment

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

keshavramji Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avinash2692 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

keshavramji Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keshavramji commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avinash2692 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keshavramji commented Nov 6, 2025 •

edited

Loading

keshavramji commented Nov 25, 2025 •

edited

Loading