Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 · 2025-01-10T20:46:54Z

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

This PR is a continuation of #257 -- shepherding the PR across!

What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

Classifiers (binary evaluations like True/False).
Graders (scored evaluations on numerical scales).

The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

Demonstrates how to use judges with litellm to evaluate AI search engine responses.
Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
- Correctness (factual accuracy).
- Quality (clarity, helpfulness).
Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources

Library: judges
Model: LLaMA 3.3 70B Instruct-Turbo via litellm
Dataset: Natural Questions Subset

Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

review-notebook-app · 2025-01-10T20:46:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

merveenoyan · 2025-01-12T19:30:11Z

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

freddiev4 · 2025-01-12T19:44:23Z

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother.

I’m happy to copy over the comments from the previous PR as well if that helps!

Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me.

merveenoyan · 2025-01-13T09:51:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


you can use notebook_login instead

Reply via ReviewNB

merveenoyan · 2025-01-13T09:51:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


you could explain the error or ask users to ignore imo, otherwise it's confusing

Reply via ReviewNB

merveenoyan

I just left some nits, otherwise looks good! @stevhliu should review too

stevhliu · 2025-01-13T18:54:24Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


"...research-backed evaluator prompts..."

Reply via ReviewNB

stevhliu · 2025-01-13T18:54:24Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


It may be easier to consume this content in table-form

| Judge | What | Why | Source | When to use |

|---|---|---|---|---|

| PollMultihopCorrectness | | | | |

| PrometheusAbsoluteCoarseCorrectness | | | | |

| MTBenchChatBotResponseQuality | | | | |

Reply via ReviewNB

stevhliu

Thanks, just a few more comments and then we can merge! 🤗

stevhliu · 2025-01-13T18:55:06Z

notebooks/en/index.md

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
 - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct)
 - [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
 - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
+- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library)


I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy

freddiev4 · 2025-01-16T18:23:25Z

Think this should be cleaned up now!

stevhliu · 2025-01-17T19:20:02Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1649 @@
+{


Can you remove the output here? I think it's giving the doc-builder some issues 😅

Reply via ReviewNB

@stevhliu sorry for the delay! should be fixed now :D

HuggingFaceDocBuilderDev · 2025-01-31T20:48:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu

Thanks again!

James Liounis added 6 commits December 17, 2024 19:37

Add notebook: Evaluating AI search engines with the judges library

48ae926

deploy stevhliu fixes

1b98e78

add nb to toctree

c140910

add nb to index

8200bd9

reorganize nbs

c1d976f

add merveenoyan comments

85aee50

freddiev4 mentioned this pull request Jan 10, 2025

Add notebook: Evaluating AI search engines with the judges library #257

Closed

merveenoyan reviewed Jan 13, 2025

View reviewed changes

stevhliu reviewed Jan 13, 2025

View reviewed changes

freddiev4 added 4 commits January 16, 2025 13:11

Use notebook_login instead

25f4669

Integrate feedback

b3264ec

Fix links

394466a

Update index

d470712

freddiev4 force-pushed the evaluating-search-engines branch from a757aa3 to d470712 Compare January 16, 2025 18:20

freddiev4 requested review from merveenoyan and stevhliu January 16, 2025 18:20

stevhliu reviewed Jan 17, 2025

View reviewed changes

Fix cell output

8499a94

freddiev4 force-pushed the evaluating-search-engines branch from 098d3e0 to 8499a94 Compare January 31, 2025 17:50

Merge branch 'main' into evaluating-search-engines

06928bf

stevhliu approved these changes Jan 31, 2025

View reviewed changes

stevhliu merged commit 52e7130 into huggingface:main Jan 31, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 commented Jan 10, 2025

review-notebook-app bot commented Jan 10, 2025

merveenoyan commented Jan 12, 2025

freddiev4 commented Jan 12, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan left a comment

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu left a comment

stevhliu Jan 13, 2025

freddiev4 commented Jan 16, 2025

stevhliu Jan 17, 2025 •

edited

Loading

freddiev4 Jan 31, 2025

HuggingFaceDocBuilderDev commented Jan 31, 2025

stevhliu left a comment

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Conversation

freddiev4 commented Jan 10, 2025

Description

What This Notebook Does

Open-Source Tools & Resources

Why This Notebook?

review-notebook-app bot commented Jan 10, 2025

merveenoyan commented Jan 12, 2025

freddiev4 commented Jan 12, 2025 • edited Loading

merveenoyan Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

merveenoyan Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

stevhliu Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

stevhliu Jan 13, 2025

Choose a reason for hiding this comment

freddiev4 commented Jan 16, 2025

stevhliu Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

freddiev4 Jan 31, 2025

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 31, 2025

stevhliu left a comment

Choose a reason for hiding this comment

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 commented Jan 12, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu Jan 17, 2025 •

edited

Loading