Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Merged
merged 12 commits into from
Jan 31, 2025

Conversation

freddiev4
Copy link
Contributor

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

This PR is a continuation of #257 -- shepherding the PR across!

What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

  1. Classifiers (binary evaluations like True/False).
  2. Graders (scored evaluations on numerical scales).

The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

  • Demonstrates how to use judges with litellm to evaluate AI search engine responses.
  • Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
    • Correctness (factual accuracy).
    • Quality (clarity, helpfulness).
  • Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources

Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@merveenoyan
Copy link
Collaborator

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

@freddiev4
Copy link
Contributor Author

freddiev4 commented Jan 12, 2025

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother.

I’m happy to copy over the comments from the previous PR as well if that helps!

Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me.

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use notebook_login instead


Reply via ReviewNB

@@ -0,0 +1,1680 @@
{
Copy link
Collaborator

@merveenoyan merveenoyan Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could explain the error or ask users to ignore imo, otherwise it's confusing


Reply via ReviewNB

Copy link
Collaborator

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left some nits, otherwise looks good! @stevhliu should review too

@@ -0,0 +1,1680 @@
{
Copy link
Member

@stevhliu stevhliu Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...research-backed evaluator prompts..."


Reply via ReviewNB

@@ -0,0 +1,1680 @@
{
Copy link
Member

@stevhliu stevhliu Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be easier to consume this content in table-form

| Judge | What | Why | Source | When to use |

|---|---|---|---|---|

| PollMultihopCorrectness | | | | |

| PrometheusAbsoluteCoarseCorrectness | | | | |

| MTBenchChatBotResponseQuality | | | | |


Reply via ReviewNB

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few more comments and then we can merge! 🤗

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
- [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct)
- [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
- [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy

@freddiev4 freddiev4 force-pushed the evaluating-search-engines branch from a757aa3 to d470712 Compare January 16, 2025 18:20
@freddiev4
Copy link
Contributor Author

Think this should be cleaned up now!

@@ -0,0 +1,1649 @@
{
Copy link
Member

@stevhliu stevhliu Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the output here? I think it's giving the doc-builder some issues 😅


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevhliu sorry for the delay! should be fixed now :D

@freddiev4 freddiev4 force-pushed the evaluating-search-engines branch from 098d3e0 to 8499a94 Compare January 31, 2025 17:50
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again!

@stevhliu stevhliu merged commit 52e7130 into huggingface:main Jan 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants