-
Notifications
You must be signed in to change notification settings - Fork 34
eval/top level evals #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
eval/top level evals #204
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| """Common helpers for evaluation suites.""" | ||
|
|
||
| from sre_agent.eval.common.case_loader import load_json_case_models | ||
| from evals.common.case_loader import load_json_case_models | ||
|
|
||
| __all__ = ["load_json_case_models"] |
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think command in the run section needs to be updated: uv sync --group eval
uv run sre-agent-run-diagnosis-quality-evaldoes not work. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| """Mock tools for diagnosis quality evaluation.""" | ||
|
|
||
| from evals.diagnosis_quality.mocks.runtime import MockToolRuntime | ||
| from evals.diagnosis_quality.mocks.toolset import build_mock_toolset | ||
|
|
||
| __all__ = [ | ||
| "MockToolRuntime", | ||
| "build_mock_toolset", | ||
| ] |
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think command in the run section needs to be updated: uv sync --group eval
uv run sre-agent-run-tool-call-evaldoes not work. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,9 +1,9 @@ | ||
| """Dataset for tool call evaluation.""" | ||
|
|
||
| from sre_agent.eval.tool_call.dataset.create_and_populate import ( | ||
| from evals.tool_call.dataset.create_and_populate import ( | ||
| DEFAULT_DATASET_NAME, | ||
| create_and_populate_dataset, | ||
| ) | ||
| from sre_agent.eval.tool_call.dataset.schema import ToolCallEvalCase | ||
| from evals.tool_call.dataset.schema import ToolCallEvalCase | ||
|
|
||
| __all__ = ["create_and_populate_dataset", "ToolCallEvalCase", "DEFAULT_DATASET_NAME"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| """Metrics for tool call evaluation.""" | ||
|
|
||
| from evals.tool_call.metrics.expected_tool_select_order import ExpectedToolSelectOrder | ||
| from evals.tool_call.metrics.expected_tool_selection import ExpectedToolSelection | ||
|
|
||
| __all__ = ["ExpectedToolSelection", "ExpectedToolSelectOrder"] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| """Mock tools for tool call evaluation.""" | ||
|
|
||
| from evals.tool_call.mocks.runtime import MockToolRuntime | ||
| from evals.tool_call.mocks.toolset import build_mock_toolset | ||
|
|
||
| __all__ = ["MockToolRuntime", "build_mock_toolset"] |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s add
uv sync --group evalabove this to ensure Opik is installed before running the eval suites. We should also make it clear that Opik needs to be set up first.Let's add the below:
"""
Assuming you already have Opik up and running. If not, please refer to the README in either of the eval suites for setup instructions. Once ready, run the following to install prerequisites:
"""