Skip to content

Commit

Permalink
Nuclia Evaluation Library v1.0.0 (#1)
Browse files Browse the repository at this point in the history
* First commit

* Progress

* Refactor to work with split ctx rel and groundedness

* Improve testing, add CI workflow

* Add temp branch

* Add docstrings

* Validate function call name

* Test CI

* Format

* Update mistral libraries

* Better test

* Fix

* Test new workflow

* Add missing param

* Update workflow

* Update README and CI

* Lint

* Add changelog, improve readme

* Update readme

* Add release script

* Correct capitalization
  • Loading branch information
carlesonielfa authored Jul 23, 2024
1 parent 58b3f97 commit a1b8e4d
Show file tree
Hide file tree
Showing 26 changed files with 1,110 additions and 2 deletions.
58 changes: 58 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
name: CI

on:
push:
branches:
- main
paths:
- 'src/**'
- 'tests/**'
pull_request:
branches:
- main
paths:
- 'src/**'
- 'tests/**'

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
cancel-in-progress: true

jobs:
test:
name: "Test"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: make install-no-cuda
- name: Run pre-checks
run: make lint
- name: Run tests
run: make test
- name: Get coverage
uses: orgoro/[email protected]
if: github.event_name == 'pull_request'
with:
coverageFile: coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.9
- name: Update and push coverage badge
# Only update the badge on the main branch
if: github.ref == 'refs/heads/main'
run: |
pip install genbadge[coverage]
genbadge coverage -i coverage.xml
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git clone --single-branch --branch gh-pages https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }} gh-pages
mkdir -p gh-pages/badges
mv coverage-badge.svg gh-pages/badges/coverage.svg
cd gh-pages
git add badges/coverage.svg
git commit -m 'Update coverage badge'
git push origin gh-pages
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ htmlcov/
.cache
nosetests.xml
coverage.xml
coverage.json
*.cover
*.py,cover
.hypothesis/
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## 1.0.0 (unreleased)

- Initial release
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Nuclia

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
24 changes: 24 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
install:
python -m pip install --upgrade pip
python -m pip install -e .
python -m pip install -e ".[dev]"

install-no-cuda:
python -m pip install --upgrade pip
python -m pip install -e . --extra-index-url https://download.pytorch.org/whl/cpu
python -m pip install -e ".[dev]"
fmt:
ruff format .
ruff check . --select I --fix

lint:
ruff check .
ruff format --check .
mypy .

test:
pytest -svx . --tb=native

release:
pip install zest.releaser
fullrelease
126 changes: 124 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,124 @@
# nuclia-eval
Library for evaluating RAG using Nuclia's models
<!--- BADGES: START --->
[![Slack](https://img.shields.io/badge/Slack-nuclia-magenta?logo=slack)](https://join.slack.com/t/nuclia-community/shared_invite/zt-2l7jlgi6c-Oohv8j3ygdKOvD_PwZhfdg)
[![HF Nuclia](https://img.shields.io/badge/%F0%9F%A4%97_%20Hugging_Face-nuclia-yellow)](https://huggingface.co/nuclia)
[![GitHub - License](https://img.shields.io/github/license/nuclia/nuclia-eval?logo=github&style=flat&color=green)][#github-license]
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/nuclia-eval?logo=pypi&style=flat&color=blue)][#pypi-package]
[![PyPI - Package Version](https://img.shields.io/pypi/v/nuclia-eval?logo=pypi&style=flat&color=orange)][#pypi-package]
[![Code coverage](https://nuclia.github.io/nuclia-eval/badges/coverage.svg)](https://github.com/nuclia/nuclia-eval/actions)


[#github-license]: https://github.com/nuclia/nuclia-eval/blob/master/LICENSE
[#pypi-package]: https://pypi.org/project/nuclia-eval/
<!--- BADGES: END --->

# nuclia-eval: Evaluate your RAG with nuclia's models
<p align="center">
<img src="assets/Nuclia_vertical.png" width="350" title="nuclia logo" alt="nuclia, the all-in-one RAG as a service platform.">
</p>

Library for evaluating RAG using **nuclia**'s models

Its evaluation follows the RAG triad as proposed by [TruLens](https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/):

![rag triad](assets/RAG_Triad.jpg)

In summary, the metrics **nuclia-eval** provides for a RAG Experience involving a **question** an **answer** and N pieces of **context** are:

* **Answer Relevance**: Answer relevance refers to the directness and appropriateness of the response in addressing the specific question asked, providing accurate, complete, and contextually suitable information.
* **score**: A number between 0 and 5 indicating the score of the relevance of the answer to the question.
* **reason**: A string explaining the reason for the score.
* For each of the N pieces of context:
* **Context Relevance Score**: The context relevance is the relevance of the **context** to the **question**, on a scale of 0 to 5.
* **Groudedness Score**: Groundedness is defined as the degree of information overlap to which the **answer** contains information that is substantially similar or identical to that in the **context** piece. The score is between 0 and 5.

## Installation

```bash
pip install nuclia-eval
```

## Available Models

### REMi-v0

[REMi-v0](https://huggingface.co/nuclia/REMi-v0) (RAG Evaluation MetrIcs) is a LoRa adapter for the
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) model.

It has been finetuned by the team at [**nuclia**](nuclia.com) to evaluate the quality of all parts of the RAG experience.

## Usage

```python
from nuclia_eval import REMi

evaluator = REMiEvaluator()

query = "By how many Octaves can I shift my OXYGEN PRO 49 keyboard?"

context1 = """\
* Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up.
* Oxygen Pro 61's keyboard can be shifted 3 octaves down or 3 octaves up.
To change the transposition of the keyboard, press and hold Shift, and then use the Key Octave –/+ buttons to lower or raise the keybed by one one, respectively.
The display will temporarily show TRANS and the current transposition (-12 to 12)."""
context2 ="""\
To change the octave of the keyboard, use the Key Octave –/+ buttons to lower or raise the octave, respectively
The display will temporarily show OCT and the current octave shift.\n\nOxygen Pro 25's keyboard can be shifted 4 octaves down or 5 octaves up""",
context3 = """\
If your DAW does not automatically configure your Oxygen Pro series keyboard, please follow the setup steps listed in the Oxygen Pro DAW Setup Guides.
To set the keyboard to operate in Preset Mode, press the DAW/Preset Button (on the Oxygen Pro 25) or Preset Button (on the Oxygen Pro 49 and 61).
On the Oxygen Pro 25 the DAW/Preset button LED will be off to show that Preset Mode is selected.
On the Oxygen Pro 49 and 61 the Preset button LED will be lit to show that Preset Mode is selected.""",

answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."

result = evaluator.evaluate_rag(query=query, answer=answer, contexts=[context1, context2, context3])
answer_relevance, context_relevances, groundednesses = result

print(f"{answer_relevance.score}, {answer_relevance.reason}")
# 4, The response is relevant to the entire query and answers it completely, but it could be more specific about the limitations of the keyboard.
print([cr.score for cr in context_relevances]) # [5, 1, 0]
print([g.score for g in groundednesses]) # [2, 0, 0]
```
### Granularity

The **REMi** evaluator provides a fine-grained and strict evaluation of the RAG triad. For instance if we slightly modify the answer to the query:

```diff
- answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."
+ answer = "Based on the context provided, the Oxygen Pro 49's keyboard can be shifted 4 octaves down or 4 octaves up."

...

print([g.score for g in groundednesses]) # [0, 0, 0]
```

As the information provided in the answer is not present in any of the contexts, the groundedness score is 0 for all contexts.

What if the information in the answer does not answer the question?

```diff
- answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."
+ answer = "Based on the context provided, the Oxygen Pro 61's keyboard can be shifted 3 octaves down or 4 octaves up."

...

print(f"{answer_relevance.score}, {answer_relevance.reason}")
# 1, The response is relevant to the entire query but incorrectly mentions the Oxygen Pro 61 instead of the Oxygen Pro 49
```
### Individual Metrics

We can also compute each metric separately:

```python
...

answer_relevance = evaluator.answer_relevance(query=query, answer=answer)
context_relevances = evaluator.context_relevance(query=query, contexts=[context1, context2, context3])
groundednesses = evaluator.groundedness(answer=answer, contexts=[context1, context2, context3])
...
```

## Feedback and Community

For feedback, questions, or to get in touch with the **nuclia** team, we are available on our [community Slack channel](https://join.slack.com/t/nuclia-community/shared_invite/zt-2l7jlgi6c-Oohv8j3ygdKOvD_PwZhfdg).
Binary file added assets/Nuclia_vertical.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/RAG_Triad.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
60 changes: 60 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
[build-system]
requires = ["pdm-backend"]
build-backend = "pdm.backend"

[tool.pdm]
includes = [
"src/nuclia_eval/py.typed"
]
source = "src"

[tool.pytest.ini_options]
testpaths = ["./tests"]
addopts = "--cov=nuclia_eval --cov-report=xml --cov-report term"

[tool.mypy]
ignore_missing_imports = true

[tool.ruff.lint.isort]
known-first-party = ["nuclia_eval"]

[pytest]
log_cli=true

[project]
name = "nuclia_eval"
version = "1.0.0"
authors = [
{ name="Carmen Iniesta", email="[email protected]" },
{ name="Ramon Navarro", email="[email protected]" },
{ name="Carles Onielfa", email="[email protected]" },
]
description = "Library for evaluating RAG using Nuclia's models"
readme = "README.md"
requires-python = ">=3.8"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = [
"huggingface-hub>=0.23.4",
"mistral-common>=1.3.1",
"mistral-inference>=1.3.0",
"pydantic>=2.6.1",
"pydantic-settings>=2.2.1",
]

[project.optional-dependencies]
dev = [
"pytest",
"pytest-cov",
"ruff",
"mypy",
]

[project.urls]
homepage = "https://nuclia.com"
repository = "https://github.com/nuclia/nuclia-eval"
changelog = "https://github.com/nuclia/nuclia-eval/blob/main/CHANGELOG.md"
issues = "https://github.com/nuclia/nuclia-eval/issues"
5 changes: 5 additions & 0 deletions pyrightconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"extraPaths": [
"./src"
]
}
5 changes: 5 additions & 0 deletions src/nuclia_eval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""nuclia-eval is a library that simplifies evaluating the RAG experience using nuclia's models."""

import logging

logger = logging.getLogger(__name__)
10 changes: 10 additions & 0 deletions src/nuclia_eval/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
class ModelException(Exception):
"""Generic exception for model errors."""

pass


class InvalidToolCallException(ModelException):
"""Exception for when a model does not generate an output that can be mapped to the desired metric."""

pass
11 changes: 11 additions & 0 deletions src/nuclia_eval/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""This module contains definition of the metrics used to evaluate the quality of the generated answers."""

from nuclia_eval.metrics.answer_relevance import AnswerRelevance
from nuclia_eval.metrics.context_relevance import ContextRelevance
from nuclia_eval.metrics.groundedness import Groundedness

__all__ = [
"AnswerRelevance",
"ContextRelevance",
"Groundedness",
]
44 changes: 44 additions & 0 deletions src/nuclia_eval/metrics/answer_relevance.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from nuclia_eval.metrics.base import DiscreteScoreReasonResponse, Metric

ANSWER_RELEVANCE_TEMPLATE = """You are a RELEVANCE grader, tasked with assessing the relevance of a given RESPONSE to a given QUERY and providing a score along with a brief REASON. Relevance refers to the directness and appropriateness of the response in addressing the specific question asked, providing accurate, complete, and contextually suitable information.
Respond by reporting the answer relevance metric with the provided function
Additional scoring guidelines:
- Long and short responses should be scored equally.
- Relevance score should increase as the response provides relevant context to more parts of the query.
- SCORE 0: RESPONSE is relevant to none of the QUERY.
- SCORE 1: RESPONSE is relevant to some parts of the QUERY.
- SCORE 2: RESPONSE is relevant to most parts of the QUERY but contains superfluous information.
- SCORE 3: RESPONSE is relevant to almost all parts of the QUERY or to the entire QUERY but contains superfluous information.
- SCORE 4: RESPONSE is relevant to the entire QUERY.
- SCORE 5: RESPONSE is relevant to the entire QUERY and answers it completely.
The REASON should be brief and clear, explaining why the RESPONSE received the given SCORE. If the SCORE is not a 5, the REASON should contain how the ANSWER could be improved to a 5.
QUERY: {query}
RESPONSE: {answer}
ANSWER RELEVANCE: """

ANSWER_RELEVANCE_TOOL = {
"type": "function",
"function": {
"name": "answer_relevance",
"description": "The relevance of an answer is its directness and appropriateness in addressing the specific question asked, providing accurate, complete, and contextually suitable information. It ensures clarity and specificity, avoiding extraneous details while fully satisfying the inquiry.",
"parameters": {
"type": "object",
"properties": DiscreteScoreReasonResponse.model_json_schema()["properties"],
"required": DiscreteScoreReasonResponse.model_json_schema()["required"],
},
},
}

AnswerRelevance = Metric(
template=ANSWER_RELEVANCE_TEMPLATE,
response_model=DiscreteScoreReasonResponse,
tool=ANSWER_RELEVANCE_TOOL,
)
Loading

0 comments on commit a1b8e4d

Please sign in to comment.