Skip to content

Conversation

@keshavramji
Copy link
Contributor

@keshavramji keshavramji commented Nov 6, 2025

This PR adds a flexible paradigm for evaluating a desired task(s) using LLM-as-a-judge evaluation on a set of tests for the task(s), with a corresponding (configurable) command line interface. Given a desired task and a set of instructions (guidelines) specifying intended model behavior, a user can provide a list of examples (consisting of inputs and (optionally) references / targets), and generate candidate responses from an LLM. A judge LLM is simultaneously instantiated and used to determine whether the candidates satisfy the guidelines.

  • Added TestBasedEval component for evaluation with generator and judge sessions.
  • Implements m eval run CLI command for configurable models and backends for the generator and juge.
  • Supports flexible test format with multiple examples per test, each containing input-target pairs evaluated against custom instructions.
  • Adds a Jinja template corresponding to the component -- TestBasedEval.jinja2 -- for pass/fail scoring.

@mergify
Copy link

mergify bot commented Nov 6, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@avinash2692 avinash2692 self-assigned this Nov 7, 2025
@avinash2692 avinash2692 self-requested a review November 7, 2025 20:05
@avinash2692 avinash2692 removed their assignment Nov 7, 2025
Copy link
Contributor

@avinash2692 avinash2692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keshavramji : Thanks for your contribution!

This might be due to my ignorance, but I'm having some trouble understanding the overall goal of the PR and what a "Test based Evaluation" entails. It might be helpful if you add:

  • More details to your PR description that might later translate as an entry in the library docs.
  • Some tests (which could include sample .json files to show the user what you are looking for in terms of overall schema of input).

Also left some comments on the files here.

.gitignore Outdated
Comment on lines 1 to 7
# KR files
kr_results/
kr_data/
xet/
job.sh
hub/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be removed? I think it's user specific

from mellea.stdlib.test_based_eval import TestBasedEval

__all__ = ["MelleaSession", "generative", "model_ids", "start_session"]
__all__ = ["MelleaSession", "TestBasedEval", "generative", "model_ids", "start_session"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this to be a package level import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it without having the package level import, confirmed it's fine to remove.

Comment on lines 32 to 57
self.judge_prompt = """**Input to the model**
{input}
**Model output to be rated**
{prediction}
**Ground truth text**
{target}
**Rating Guidelines**
The model output should adhere to the following guidelines:
{guidelines}
**Scoring Criteria**
* Score 0: The model output violates any of the guidelines.
* Score 1: The model output is well aligned with the ground truth - if it exists, the input to the model, and adheres to all guidelines.
**Return Your Rating**
Return your rating in the following format:
{{\"score\": your_score, \"justification\": \"your_justification\"}}
Your rating:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be cause I don't have a proper understanding of what you're trying to achieve here, but could this be moved as a Jinja template under mellea/templates/prompts to generate a mellea.stdlib.base.TemplateRepresentation: ? You might also want to use the format_for_llm method in tandem with the TemplateRepresentation to create the Component

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a TestBasedEval.jinja2 template and changed test_based_eval.py to use TemplateRepresentation, per your suggestion.

@keshavramji
Copy link
Contributor Author

keshavramji commented Nov 25, 2025

Hi @avinash2692, thanks much for the feedback! I've looked to address each of your comments above, and have also updated the PR description for further clarity on the intended usage. For additional context, here is an example of the intended schema for a test file (in JSON format):

[
  { 
    "id": "test1"
    "source": "example_tests"
    "name": "Basic Addition"
    "instructions": "1. The answer should produce correct calculations. 2. The answer format should be consistent with the format described in the input question."
    "examples": [
        {
          "input_id": "test1.1",
          "input": [
             {
               "role": "user",
               "content": "2 + 2 = ?"
             }
          ],
          "targets": [
             {
               "role": "assistant",
               "content": "2 + 2 = 4"
             }
          ] 
        },
        {
          "input_id": "test1.2"
          "input": [
            {
              "role": "user",
              "content": "3 + 5 = ?"
            }
          ],
          "targets": [
            {
              "role":"assistant",
              "content": "3 + 5 = 8"
            }    
          ]
        }
     ]
  }
]

Copy link
Contributor

@avinash2692 avinash2692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@avinash2692 avinash2692 merged commit 0f1f0f8 into generative-computing:main Dec 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants