Nuclia Evaluation Library v1.0.0 (#1)

* First commit * Progress * Refactor to work with split ctx rel and groundedness * Improve testing, add CI workflow * Add temp branch * Add docstrings * Validate function call name * Test CI * Format * Update mistral libraries * Better test * Fix * Test new workflow * Add missing param * Update workflow * Update README and CI * Lint * Add changelog, improve readme * Update readme * Add release script * Correct capitalization
nuclia · Jul 23, 2024 · a1b8e4d · a1b8e4d
1 parent 58b3f97
commit a1b8e4d
Show file tree

Hide file tree

Showing 26 changed files with 1,110 additions and 2 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,58 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+        - 'src/**'
+        - 'tests/**'
+  pull_request:
+    branches:
+      - main
+    paths:
+        - 'src/**'
+        - 'tests/**'
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number }}
+  cancel-in-progress: true
+
+jobs:
+    test:
+      name: "Test"
+      runs-on: ubuntu-latest
+      steps:
+        - uses: actions/checkout@v4
+        - name: Set up Python
+          uses: actions/setup-python@v2
+          with:
+            python-version: '3.9'
+        - name: Install dependencies
+          run: make install-no-cuda
+        - name: Run pre-checks
+          run: make lint
+        - name: Run tests
+          run: make test
+        - name: Get coverage
+          uses: orgoro/[email protected]
+          if: github.event_name == 'pull_request'
+          with:
+              coverageFile: coverage.xml
+              token: ${{ secrets.GITHUB_TOKEN }}
+              thresholdAll: 0.9
+        - name: Update and push coverage badge
+        # Only update the badge on the main branch
+          if: github.ref == 'refs/heads/main'
+          run: |
+            pip install genbadge[coverage]
+            genbadge coverage -i coverage.xml
+            git config --global user.name 'github-actions[bot]'
+            git config --global user.email 'github-actions[bot]@users.noreply.github.com'
+            git clone --single-branch --branch gh-pages https://x-access-token:${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }} gh-pages
+            mkdir -p gh-pages/badges
+            mv coverage-badge.svg gh-pages/badges/coverage.svg
+            cd gh-pages
+            git add badges/coverage.svg
+            git commit -m 'Update coverage badge'
+            git push origin gh-pages
diff --git a/.gitignore b/.gitignore
@@ -45,6 +45,7 @@ htmlcov/
 .cache
 nosetests.xml
 coverage.xml
+coverage.json
 *.cover
 *.py,cover
 .hypothesis/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,5 @@
+# Changelog
+
+## 1.0.0 (unreleased)
+
+- Initial release
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Nuclia
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,24 @@
+install:
+	python -m pip install --upgrade pip
+	python -m pip install -e .
+	python -m pip install -e ".[dev]"
+
+install-no-cuda:
+	python -m pip install --upgrade pip
+	python -m pip install -e . --extra-index-url https://download.pytorch.org/whl/cpu
+	python -m pip install -e ".[dev]"
+fmt:
+	ruff format .
+	ruff check .  --select I --fix
+
+lint:
+	ruff check .
+	ruff format --check .
+	mypy .
+
+test:
+	pytest -svx . --tb=native
+
+release:
+	pip install zest.releaser
+	fullrelease
diff --git a/README.md b/README.md
@@ -1,2 +1,124 @@
-# nuclia-eval
-Library for evaluating RAG using Nuclia's models
+<!--- BADGES: START --->
+[![Slack](https://img.shields.io/badge/Slack-nuclia-magenta?logo=slack)](https://join.slack.com/t/nuclia-community/shared_invite/zt-2l7jlgi6c-Oohv8j3ygdKOvD_PwZhfdg)
+[![HF Nuclia](https://img.shields.io/badge/%F0%9F%A4%97_%20Hugging_Face-nuclia-yellow)](https://huggingface.co/nuclia)
+[![GitHub - License](https://img.shields.io/github/license/nuclia/nuclia-eval?logo=github&style=flat&color=green)][#github-license]
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/nuclia-eval?logo=pypi&style=flat&color=blue)][#pypi-package]
+[![PyPI - Package Version](https://img.shields.io/pypi/v/nuclia-eval?logo=pypi&style=flat&color=orange)][#pypi-package]
+[![Code coverage](https://nuclia.github.io/nuclia-eval/badges/coverage.svg)](https://github.com/nuclia/nuclia-eval/actions)
+
+
+[#github-license]: https://github.com/nuclia/nuclia-eval/blob/master/LICENSE
+[#pypi-package]: https://pypi.org/project/nuclia-eval/
+<!--- BADGES: END --->
+
+# nuclia-eval: Evaluate your RAG with nuclia's models
+<p align="center">
+  <img src="assets/Nuclia_vertical.png" width="350" title="nuclia logo" alt="nuclia, the all-in-one RAG as a service platform.">
+</p>
+
+Library for evaluating RAG using **nuclia**'s models
+
+Its evaluation follows the RAG triad as proposed by [TruLens](https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/):
+
+![rag triad](assets/RAG_Triad.jpg)
+
+In summary, the metrics **nuclia-eval** provides for a RAG Experience involving a **question** an **answer** and N pieces of **context** are:
+
+* **Answer Relevance**: Answer relevance refers to the directness and appropriateness of the response in addressing the specific question asked, providing accurate, complete, and contextually suitable information.
+    * **score**: A number between 0 and 5 indicating the score of the relevance of the answer to the question.
+    * **reason**: A string explaining the reason for the score.
+* For each of the N pieces of context:
+    * **Context Relevance Score**: The context relevance is the relevance of the **context** to the **question**, on a scale of 0 to 5.
+    * **Groudedness Score**: Groundedness is defined as the degree of information overlap to which the **answer** contains information that is substantially similar or identical to that in the **context** piece. The score is between 0 and 5.
+
+## Installation
+
+```bash
+pip install nuclia-eval
+```
+
+## Available Models
+
+### REMi-v0
+
+[REMi-v0](https://huggingface.co/nuclia/REMi-v0) (RAG Evaluation MetrIcs) is a LoRa adapter for the 
+[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) model. 
+
+It has been finetuned by the team at [**nuclia**](nuclia.com) to evaluate the quality of all parts of the RAG experience.
+
+## Usage
+
+```python
+from nuclia_eval import REMi
+
+evaluator = REMiEvaluator()
+
+query = "By how many Octaves can I shift my OXYGEN PRO 49 keyboard?"
+
+context1 = """\
+* Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up.
+* Oxygen Pro 61's keyboard can be shifted 3 octaves down or 3 octaves up.
+
+To change the transposition of the keyboard, press and hold Shift, and then use the Key Octave –/+ buttons to lower or raise the keybed by one one, respectively.
+The display will temporarily show TRANS and the current transposition (-12 to 12)."""
+context2 ="""\
+To change the octave of the keyboard, use the Key Octave –/+ buttons to lower or raise the octave, respectively
+The display will temporarily show OCT and the current octave shift.\n\nOxygen Pro 25's keyboard can be shifted 4 octaves down or 5 octaves up""",
+context3 = """\
+If your DAW does not automatically configure your Oxygen Pro series keyboard, please follow the setup steps listed in the Oxygen Pro DAW Setup Guides.
+To set the keyboard to operate in Preset Mode, press the DAW/Preset Button (on the Oxygen Pro 25) or Preset Button (on the Oxygen Pro 49 and 61).
+On the Oxygen Pro 25 the DAW/Preset button LED will be off to show that Preset Mode is selected.
+On the Oxygen Pro 49 and 61 the Preset button LED will be lit to show that Preset Mode is selected.""",
+
+answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."
+
+result = evaluator.evaluate_rag(query=query, answer=answer, contexts=[context1, context2, context3])
+answer_relevance, context_relevances, groundednesses = result
+
+print(f"{answer_relevance.score}, {answer_relevance.reason}")
+# 4, The response is relevant to the entire query and answers it completely, but it could be more specific about the limitations of the keyboard.
+print([cr.score for cr in context_relevances]) # [5, 1, 0]
+print([g.score for g in groundednesses]) # [2, 0, 0]
+```
+### Granularity
+
+The **REMi** evaluator provides a fine-grained and strict evaluation of the RAG triad. For instance if we slightly modify the answer to the query:
+
+```diff
+- answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."
++ answer = "Based on the context provided, the Oxygen Pro 49's keyboard can be shifted 4 octaves down or 4 octaves up."
+
+...
+
+print([g.score for g in groundednesses]) # [0, 0, 0]
+```
+
+As the information provided in the answer is not present in any of the contexts, the groundedness score is 0 for all contexts.
+
+What if the information in the answer does not answer the question?
+
+```diff
+- answer = "Based on the context provided, The Oxygen Pro 49's keyboard can be shifted 3 octaves down or 4 octaves up."
++ answer = "Based on the context provided, the Oxygen Pro 61's keyboard can be shifted 3 octaves down or 4 octaves up."
+
+...
+
+print(f"{answer_relevance.score}, {answer_relevance.reason}")
+# 1, The response is relevant to the entire query but incorrectly mentions the Oxygen Pro 61 instead of the Oxygen Pro 49
+```
+### Individual Metrics
+
+We can also compute each metric separately:
+
+```python
+...
+
+answer_relevance = evaluator.answer_relevance(query=query, answer=answer)
+context_relevances = evaluator.context_relevance(query=query, contexts=[context1, context2, context3])
+groundednesses = evaluator.groundedness(answer=answer, contexts=[context1, context2, context3])
+...
+```
+
+## Feedback and Community
+
+For feedback, questions, or to get in touch with the **nuclia** team, we are available on our [community Slack channel](https://join.slack.com/t/nuclia-community/shared_invite/zt-2l7jlgi6c-Oohv8j3ygdKOvD_PwZhfdg).
diff --git a/assets/Nuclia_vertical.png b/assets/Nuclia_vertical.png
diff --git a/assets/RAG_Triad.jpg b/assets/RAG_Triad.jpg
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,60 @@
+[build-system]
+requires = ["pdm-backend"]
+build-backend = "pdm.backend"
+
+[tool.pdm]
+includes = [
+    "src/nuclia_eval/py.typed"
+]
+source = "src"
+
+[tool.pytest.ini_options]
+testpaths = ["./tests"]
+addopts = "--cov=nuclia_eval --cov-report=xml --cov-report term"
+
+[tool.mypy]
+ignore_missing_imports = true
+
+[tool.ruff.lint.isort]
+known-first-party = ["nuclia_eval"]
+
+[pytest]
+log_cli=true
+
+[project]
+name = "nuclia_eval"
+version = "1.0.0"
+authors = [
+    { name="Carmen Iniesta", email="[email protected]" },
+    { name="Ramon Navarro", email="[email protected]" },
+    { name="Carles Onielfa", email="[email protected]" },
+]
+description = "Library for evaluating RAG using Nuclia's models"
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: MIT License",
+    "Operating System :: OS Independent",
+]
+dependencies = [
+    "huggingface-hub>=0.23.4",
+    "mistral-common>=1.3.1",
+    "mistral-inference>=1.3.0",
+    "pydantic>=2.6.1",
+    "pydantic-settings>=2.2.1",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest",
+    "pytest-cov",
+    "ruff",
+    "mypy",
+]
+
+[project.urls]
+homepage = "https://nuclia.com"
+repository = "https://github.com/nuclia/nuclia-eval"
+changelog = "https://github.com/nuclia/nuclia-eval/blob/main/CHANGELOG.md"
+issues = "https://github.com/nuclia/nuclia-eval/issues"
diff --git a/pyrightconfig.json b/pyrightconfig.json
@@ -0,0 +1,5 @@
+{
+    "extraPaths": [
+        "./src"
+    ]
+}
diff --git a/src/nuclia_eval/__init__.py b/src/nuclia_eval/__init__.py
@@ -0,0 +1,5 @@
+"""nuclia-eval is a library that simplifies evaluating the RAG experience using nuclia's models."""
+
+import logging
+
+logger = logging.getLogger(__name__)
diff --git a/src/nuclia_eval/exceptions.py b/src/nuclia_eval/exceptions.py
@@ -0,0 +1,10 @@
+class ModelException(Exception):
+    """Generic exception for model errors."""
+
+    pass
+
+
+class InvalidToolCallException(ModelException):
+    """Exception for when a model does not generate an output that can be mapped to the desired metric."""
+
+    pass
diff --git a/src/nuclia_eval/metrics/__init__.py b/src/nuclia_eval/metrics/__init__.py
@@ -0,0 +1,11 @@
+"""This module contains definition of the metrics used to evaluate the quality of the generated answers."""
+
+from nuclia_eval.metrics.answer_relevance import AnswerRelevance
+from nuclia_eval.metrics.context_relevance import ContextRelevance
+from nuclia_eval.metrics.groundedness import Groundedness
+
+__all__ = [
+    "AnswerRelevance",
+    "ContextRelevance",
+    "Groundedness",
+]
diff --git a/src/nuclia_eval/metrics/answer_relevance.py b/src/nuclia_eval/metrics/answer_relevance.py
@@ -0,0 +1,44 @@
+from nuclia_eval.metrics.base import DiscreteScoreReasonResponse, Metric
+
+ANSWER_RELEVANCE_TEMPLATE = """You are a RELEVANCE grader, tasked with assessing the relevance of a given RESPONSE to a given QUERY and providing a score along with a brief REASON. Relevance refers to the directness and appropriateness of the response in addressing the specific question asked, providing accurate, complete, and contextually suitable information.
+
+Respond by reporting the answer relevance metric with the provided function
+
+Additional scoring guidelines:
+- Long and short responses should be scored equally.
+- Relevance score should increase as the response provides relevant context to more parts of the query.
+
+- SCORE 0: RESPONSE is relevant to none of the QUERY.
+- SCORE 1: RESPONSE is relevant to some parts of the QUERY.
+- SCORE 2: RESPONSE is relevant to most parts of the QUERY but contains superfluous information.
+- SCORE 3: RESPONSE is relevant to almost all parts of the QUERY or to the entire QUERY but contains superfluous information.
+- SCORE 4: RESPONSE is relevant to the entire QUERY.
+- SCORE 5: RESPONSE is relevant to the entire QUERY and answers it completely.
+
+The REASON should be brief and clear, explaining why the RESPONSE received the given SCORE. If the SCORE is not a 5, the REASON should contain how the ANSWER could be improved to a 5.
+
+
+QUERY: {query}
+
+RESPONSE: {answer}
+
+ANSWER RELEVANCE: """
+
+ANSWER_RELEVANCE_TOOL = {
+    "type": "function",
+    "function": {
+        "name": "answer_relevance",
+        "description": "The relevance of an answer is its directness and appropriateness in addressing the specific question asked, providing accurate, complete, and contextually suitable information. It ensures clarity and specificity, avoiding extraneous details while fully satisfying the inquiry.",
+        "parameters": {
+            "type": "object",
+            "properties": DiscreteScoreReasonResponse.model_json_schema()["properties"],
+            "required": DiscreteScoreReasonResponse.model_json_schema()["required"],
+        },
+    },
+}
+
+AnswerRelevance = Metric(
+    template=ANSWER_RELEVANCE_TEMPLATE,
+    response_model=DiscreteScoreReasonResponse,
+    tool=ANSWER_RELEVANCE_TOOL,
+)