Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
-
Updated
May 8, 2025 - HTML
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
Ranking LLMs on agentic tasks
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
one click to open multi AI sites | 一键打开多个 AI 站点,查看 AI 结果
Code scanner to check for issues in prompts and LLM calls
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
Cost-of-Pass: An Economic Framework for Evaluating Language Models
LLM-as-a-judge for Extractive QA datasets
Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)
JudgeGPT - (Fake) News Evaluation, a research project
A playbook and Colab-based demo for helping engineering teams adopt Generative AI. Includes a working RAG pipeline using LangChain + Chroma, OpenAI GPT-4o embeddings, prompt engineering best practices, and automated LLM evaluations with Ragas.
RJafroc quick start for those already familiar with windows jafroc
CLI tool to evaluate LLM factuality on MMLU benchmark.
This repo is for my final year stuff, with the peer review evaluator.
Enter the chamber. Face the gators. Earn the truth.
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."