ai-evaluation

Here are 26 public repositories matching this topic...

lechmazur / confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

benchmark leaderboard gemini llama language-model claude rag o1 hallucinations ai-evaluation llm gemini-pro llm-benchmarking confabulations deepseek-r1 o3-mini

Updated May 8, 2025
HTML

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents ai-evaluation llms

Updated May 16, 2025
Jupyter Notebook

METR / vivaria

Star

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated May 16, 2025
TypeScript

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated May 15, 2025
Python

taoAIGC / AI-Shortcuts

Star

one click to open multi AI sites ｜一键打开多个 AI 站点，查看 AI 结果

ai gemini poe claude perplexity ai-evaluation llm chatgpt

Updated Jan 21, 2025

kereva-dev / kereva-scanner

Star

Code scanner to check for issues in prompts and LLM calls

cli security ai linter evaluation code-scanning red-teaming ai-security hallucination ai-evaluation llm prompt-injection llm-security ai-code-review llm-evaluation owasp-llm-top-10 ai-performance ai-red-teaming llm-performance

Updated Apr 6, 2025
Python

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

mhamzaerol / Cost-of-Pass

Star

Cost-of-Pass: An Economic Framework for Evaluating Language Models

benchmark economics language-model evaluation-framework ai-evaluation cost-efficiency cost-performance

Updated Apr 25, 2025
Python

Alab-NII / llm-judge-extract-qa

Star

LLM-as-a-judge for Extractive QA datasets

qa evaluation evaluation-metrics ai-evaluation llm-as-a-judge

Updated Apr 22, 2025
Python

bigdata-ustc / CAT4AI

Star

Adaptive Testing Framework for AI Models (Psychometrics in AI Evaluation)

psychometrics adaptive-testing ai-evaluation

Updated Oct 1, 2024
Jupyter Notebook

aloth / JudgeGPT

Star

JudgeGPT - (Fake) News Evaluation, a research project

nlp machine-learning mongodb survey research-project fake-news survey-app fake-news-challenge crowdsource-experiments misinformation explainable-ai ai-ethics streamlit streamlit-webapp fake-news-analysis human-ai-interaction ai-evaluation generative-ai

Updated May 13, 2025
Python

dpc10ster / RJafrocFrocBook

Star

FROC methodology explained with R-examples

pdf r book ai-evaluation

Updated Apr 18, 2025
TeX

nnennandukwe / GenAI-Dev-Onboarding-Starter-Kit

Star

A playbook and Colab-based demo for helping engineering teams adopt Generative AI. Includes a working RAG pipeline using LangChain + Chroma, OpenAI GPT-4o embeddings, prompt engineering best practices, and automated LLM evaluations with Ragas.

openai developer-experience rag ai-evaluation langchain llmops chromadb genai evals