Skip to content

rpatrik96/hallmark

Repository files navigation

HALLMARK

HALLucination benchMARK: A benchmark for evaluating citation hallucination detection tools.

Tests Baselines Python 3.10+ License: MIT

Why HALLMARK?

The NeurIPS 2025 incident---where 53 papers were found to contain fabricated citations that passed peer review---exposed a critical gap: we have no standardized way to measure how well tools detect citation hallucinations. HALLMARK fills this gap.

HALLMARK draws on best practices from established benchmarks:

  • HumanEval: Multi-criteria sub-tests per entry (~6 checks per citation)
  • SWE-bench: Contamination awareness via temporal segmentation
  • LiveCodeBench: Continuous updates and post-cutoff evaluation
  • ONEBench: Sample-level atomic evaluation with ever-expanding pool

Features

  • Hallucination taxonomy: 14 types across 3 difficulty tiers (Easy / Medium / Hard)
  • 2,525 annotated entries: 773 valid (from DBLP) + 1,177 hallucinated with ground truth (public splits)
  • 6 sub-tests per entry: DOI resolution, title matching, author consistency, venue verification, field completeness, cross-database agreement
  • Evaluation metrics: Detection Rate, F1, tier-weighted F1, detect@k, ECE
  • Built-in baselines: DOI-only, bibtex-updater, LLM-based (OpenAI, Anthropic, OpenRouter), ensemble, HaRC, verify-citations (CiteVerifier and hallucinator are available as wrapper modules but not registered in the default registry)
  • Baseline registry: Central discovery, availability checking, and dispatch for all baselines (17 variants)
  • Plackett-Luce ranking: ONEBench-inspired ranking that handles incomplete evaluation data
  • Automated execution: Orchestrator script and CI workflow for batch baseline evaluation
  • Temporal analysis: Contamination detection via pre/post-cutoff comparison
  • Community contributions: ONEBench-style ever-expanding sample pool

Installation

# Recommended: clone and install in development mode
git clone https://github.com/rpatrik96/hallmark.git
cd hallmark
uv pip install -e ".[dev]"

# With LLM baseline SDKs (openai, anthropic)
uv pip install -e ".[baselines]"

# With ranking support (Plackett-Luce model via choix)
uv pip install -e ".[ranking]"

# All optional dependencies
uv pip install -e ".[all]"

Note: pip install hallmark is not yet published to PyPI. Use the clone + install path above.

Baseline Installation Guide

The [baselines] extra installs only the LLM SDKs (openai, anthropic). External CLI tools require separate installation due to a bibtexparser 1.x dependency conflict:

# HaRC
pipx install harcx

# bibtex-updater
pipx install bibtex-updater

# verify-citations
pipx install verify-citations

# CiteVerifier (GhostCite) — clone required
git clone https://github.com/NKU-AOSP-Lab/CiteVerifier

# hallucinator — clone required
git clone https://github.com/gianlucasb/hallucinator

Using pipx isolates each tool's bibtexparser 1.x from your project environment.

Quick Start

Evaluate a built-in baseline

# Run DOI-only baseline on the dev split
hallmark evaluate --split dev_public --baseline doi_only

# Run with custom predictions
hallmark evaluate --split dev_public --predictions my_predictions.jsonl --tool-name my-tool

Show dataset statistics

hallmark stats --split dev_public

Run all baselines at once

# Run all free baselines and generate leaderboard
python scripts/run_all_baselines.py --split dev_public --output-dir results/

# Run specific baselines in parallel
python scripts/run_all_baselines.py --baselines doi_only,bibtexupdater --parallel

# Run only free (no API key) baselines, skip unavailable
python scripts/run_all_baselines.py --baselines free --skip-unavailable

View the leaderboard

hallmark leaderboard --results-dir results/

See examples/ for full walkthroughs, including writing a custom baseline and per-type analysis.

Evaluate Your Tool

To evaluate any external tool against HALLMARK, produce a JSONL file with one prediction per line and run:

hallmark evaluate --predictions my_preds.jsonl --split dev_public

Each prediction must include:

{
  "bibtex_key": "a3f9c2b1...",
  "label": "HALLUCINATED",
  "confidence": 0.87,
  "reason": "DOI does not resolve",
  "subtest_results": {"doi_resolves": false},
  "api_sources_queried": ["crossref"],
  "wall_clock_seconds": 1.2,
  "api_calls": 1
}

bibtex_key format: Keys in the benchmark are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys like vaswani2017attention. Your predictions must use the exact keys from the loaded entries — use entry.bibtex_key when iterating over load_split() results.

See examples/03_custom_baseline.py for a complete end-to-end example.

Prediction Fields

Field Required Affects
bibtex_key Yes Entry matching
label Yes All metrics
confidence Yes ECE, AUROC, AUPRC
reason No Diagnose output
subtest_results No Subtest accuracy
api_sources_queried No Source-stratified metrics
wall_clock_seconds No Cost efficiency
api_calls No Mean API calls

UNCERTAIN label: UNCERTAIN is accepted as a prediction label. UNCERTAIN predictions are treated as VALID for confusion-matrix metrics (conservative default) and excluded from AUROC/AUPRC. Prefer VALID or HALLUCINATED with calibrated confidence when possible.

Confidence semantics: confidence = P(your predicted label is correct). If you predict HALLUCINATED with 0.9, you claim 90% certainty it is hallucinated. If you predict VALID with 0.8, you claim 80% certainty it is valid. This is NOT P(HALLUCINATED).

Hallucination Taxonomy

Tier 1: Easy (detectable by simple API lookup)

Type Description Example
fabricated_doi DOI that doesn't resolve doi = {10.9999/fake.2024.001}
nonexistent_venue Invented journal/conference booktitle = {Intl. Conf. on Advanced AI Systems}
placeholder_authors Generic/fake author names author = {John Doe and Jane Smith}
future_date Publication year in the future year = {2030}

Tier 2: Medium (requires cross-referencing metadata)

Type Description Example
chimeric_title Real author + fabricated title Real authors, plausible but non-existent paper
wrong_venue Real paper, wrong venue/year Correct title but at ICML not NeurIPS
author_mismatch Author list swapped or fabricated (data value: swapped_authors) Correct title, wrong author list
preprint_as_published arXiv paper cited as venue paper Correct paper, fabricated venue acceptance
hybrid_fabrication Real DOI + fabricated metadata Valid DOI resolves but authors/title don't match
merged_citation Metadata from 2-3 papers merged Authors from paper A, title from paper B
partial_author_list Subset of real author list First and last author only, middle dropped

Tier 3: Hard (requires deep verification)

Type Description Example
near_miss_title Title off by 1-2 words "Attention Is All You Want" vs "...Need"
plausible_fabrication Entirely fabricated but realistic Realistic author + plausible title
arxiv_version_mismatch Mixed preprint/published metadata arXiv ID with conference venue claim

Dataset

Splits

Split Valid Hallucinated Total Purpose
dev_public 486 633 1,119 Development and tuning
test_public 287 544 831 Public leaderboard
test_hidden 453 Anti-gaming evaluation
stress_test 1 121 122 Stress-test types depth

stress_test design note: The stress_test split is all-hallucinated by design. It contains challenging edge cases (merged citations, partial author lists, arXiv version mismatches) intended to stress-test detection robustness beyond the main splits. Because there are no valid entries, FPR and specificity are undefined for this split. Use detection rate as the primary metric when reporting stress_test results.

Tier distribution per split: ~27% Tier 1, ~47% Tier 2, ~26% Tier 3 (hallucinated entries).

Subtest Definitions

Subtest Definition
doi_resolves DOI returns HTTP 200 from doi.org (redirects count as resolved)
title_exists Title found in Semantic Scholar or DBLP via exact or fuzzy match (threshold 0.9)
authors_match Author last names match the record retrieved via DOI or title lookup
venue_correct The venue/journal is correct for this specific paper (not just "a real venue")
fields_complete All standard BibTeX fields for this entry type are present and non-empty
cross_db_agreement Metadata from DOI resolution matches metadata from title/author search in DBLP/S2

Data Format

Each entry is a JSON object in JSONL format:

bibtex_key format: Keys are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys. When writing predictions, always use entry.bibtex_key directly — do not construct keys manually.

{
  "bibtex_key": "a3f9c2b1d4e76f85",
  "bibtex_type": "inproceedings",
  "fields": {
    "title": "Attention Is All You Need",
    "author": "Ashish Vaswani and Noam Shazeer and ...",
    "year": "2017",
    "booktitle": "NeurIPS",
    "doi": "10.5555/3295222.3295349"
  },
  "label": "VALID",
  "hallucination_type": null,
  "difficulty_tier": null,
  "explanation": "Valid entry scraped from DBLP and verified",
  "subtests": {
    "doi_resolves": true,
    "title_exists": true,
    "authors_match": true,
    "venue_correct": true,
    "fields_complete": true,
    "cross_db_agreement": true
  }
}

Evaluation Metrics

Metric Description
Detection Rate (DR) Recall on hallucinated entries
False Positive Rate (FPR) Valid entries incorrectly flagged
F1-Hallucination Harmonic mean of precision and recall on HALLUCINATED class
Tier-weighted F1 F1 weighted by difficulty (Tier 3 = 3x weight)
ECE Expected Calibration Error — measures confidence calibration quality
detect@k Fraction detected using k verification strategies (deterministic and order-dependent, unlike the stochastic pass@k)
MCC Matthews Correlation Coefficient — prevalence-invariant; use as primary metric when comparing results across splits

Title-Oracle Baseline (Diagnostic)

The title_oracle baseline quantifies the ceiling of a perturbation-structure shortcut present in HALLMARK's design. Because most HALLUCINATED entries are generated by perturbing real (VALID) papers, they inherit the original title. This means a title that appears as VALID in the dev split almost certainly belongs to a perturbed — hence hallucinated — entry when it reappears in another split.

The oracle exploits this directly: if a blind entry's title matches any VALID title in the dev split, it predicts HALLUCINATED.

Empirical results on v1.0 data:

  • ~33% of unique titles appear as both VALID and HALLUCINATED across dev/test splits.
  • Applied to the hidden split: F1 = 0.389 at perfect precision (P = 1.0, recall = ~0.24).
  • Titles absent from any valid pool are 100% HALLUCINATED in the dataset.

This is not a legitimate detection method — it requires access to dev ground-truth labels as a look-up table, which constitutes label leakage when evaluating on dev itself. Report it alongside real baselines to make the shortcut visible. Any real tool that achieves F1 below the title oracle on the hidden split is arguably exploiting benchmark structure rather than performing genuine citation verification.

from hallmark.baselines.title_oracle import run_title_oracle
from hallmark.dataset.loader import load_split

dev_entries  = load_split("dev_public")
test_entries = load_split("test_public")
blind_test   = [e.to_blind() for e in test_entries]

predictions = run_title_oracle(blind_test, reference_pool=dev_entries)

Baseline Results (dev_public, 1,119 entries)

Baseline Detection Rate F1 Tier-weighted F1 FPR ECE
HaRC* 0.155 0.268 0.188 0.000 0.361
bibtex-updater 0.124 0.220 0.131 0.000 0.018

*Partial evaluation due to API rate limiting (HaRC: 521/1,063 entries completed).

External Tool Baselines

HALLMARK also wraps several external citation verification tools as baselines:

Baseline Tool Databases Install
HaRC harcx Semantic Scholar, DBLP, Google Scholar, Open Library pip install harcx
CiteVerifier GhostCite DBLP (local), Google Scholar, Google Search Clone repo
hallucinator hallucinator CrossRef, arXiv, DBLP, Semantic Scholar, ACL Anthology, PubMed, OpenAlex Clone repo
verify-citations verify-citations arXiv, ACL Anthology, Semantic Scholar, DBLP, Google Scholar, DuckDuckGo pip install verify-citations

LLM Baselines

Baseline Model Provider API Key Env Var
llm_openai GPT-5.1 OpenAI OPENAI_API_KEY
llm_anthropic Claude Sonnet 4.5 Anthropic ANTHROPIC_API_KEY
llm_openrouter_deepseek_r1 DeepSeek R1 OpenRouter OPENROUTER_API_KEY
llm_openrouter_deepseek_v3 DeepSeek V3.2 OpenRouter OPENROUTER_API_KEY
llm_openrouter_qwen Qwen 3 235B OpenRouter OPENROUTER_API_KEY
llm_openrouter_mistral Mistral Large OpenRouter OPENROUTER_API_KEY
llm_openrouter_gemini_flash Gemini 2.5 Flash OpenRouter OPENROUTER_API_KEY
# Use the baseline registry to discover and run any baseline
from hallmark.baselines.registry import list_baselines, check_available, run_baseline
from hallmark.dataset.loader import load_split

entries = load_split("dev_public")

# List all registered baselines (or just the free ones)
print(list_baselines(free_only=True))

# Check if a baseline's dependencies are installed
available, msg = check_available("harc")

# Run a baseline by name
predictions = run_baseline("harc", entries)

See also:

Python API

from hallmark.dataset.loader import load_split
from hallmark.evaluation.metrics import evaluate
from hallmark.dataset.schema import Prediction

# Load benchmark entries
entries = load_split("dev_public")

# Create predictions (your tool's output)
predictions = [
    Prediction(bibtex_key=e.bibtex_key, label="VALID", confidence=0.5)
    for e in entries
]

# Evaluate
result = evaluate(entries, predictions, tool_name="my-tool", split_name="dev_public")
print(f"F1: {result.f1_hallucination:.3f}")
print(f"Detection Rate: {result.detection_rate:.3f}")

Ranking

HALLMARK includes an ONEBench-inspired ranking system based on the Plackett-Luce model that handles incomplete evaluation data (not all tools evaluated on all entries):

from hallmark.evaluation.ranking import rank_tools_plackett_luce, rank_tools_mean_score

# Rank tools using Plackett-Luce (requires choix: pip install hallmark[ranking])
pl_ranking = rank_tools_plackett_luce(entry_keys, tool_names, matrix)

# Fallback: simple mean-score ranking (no extra dependencies)
mean_ranking = rank_tools_mean_score(entry_keys, tool_names, matrix)

CI/CD

HALLMARK includes two GitHub Actions workflows:

  • tests.yml: Runs the full test suite across Python 3.10-3.13 on every push/PR
  • baselines.yml: Runs live free baselines (doi_only, verify_citations) weekly and on demand; harc and bibtexupdater use pre-computed result validation (checksum checks) instead of live re-execution due to API rate limiting

Contributing Entries

HALLMARK uses an ever-expanding pool inspired by ONEBench. To contribute new entries:

hallmark contribute --file my_entries.jsonl --contributor "Your Name"

See CONTRIBUTING.md for details on entry format, validation requirements, and the review process.

Project Structure

hallmark/
├── hallmark/                  # Python package
│   ├── dataset/               # Schema, loader, scraper, generator
│   ├── evaluation/            # Metrics, subtests, aggregator, temporal, ranking
│   ├── baselines/             # Registry + baselines (DOI-only, bibtex-updater, LLM×6, ensemble, HaRC, CiteVerifier, hallucinator, verify-citations)
│   │   └── registry.py        # Central baseline discovery, availability, dispatch
│   ├── contribution/          # Pool manager, entry validation
│   └── cli.py                 # Command-line interface
├── data/
│   ├── v1.0/                  # Benchmark splits (dev_public, test_public)
│   ├── hidden/                # Hidden test set (not public)
│   └── raw/                   # Raw scraped/generated entries
├── scripts/
│   └── run_all_baselines.py   # Batch orchestrator for baseline evaluation
├── .github/workflows/
│   ├── tests.yml              # CI: test suite across Python versions
│   └── baselines.yml          # CI: weekly free baseline evaluation
├── tests/                     # Test suite (562 tests)
├── figures/                   # Evaluation figures
└── examples/                  # Usage examples

Citation

If you use HALLMARK in your research, please cite:

@misc{hallmark2026,
    title={HALLMARK: A HALLucination benchMARK for Citation Verification},
    author={Reizinger, Patrik},
    year={2026},
    url={https://github.com/rpatrik96/hallmark}
}

License

MIT

About

HALLMARK: Citation hallucination detection benchmark for ML papers — 2,525 entries, 14 hallucination types, 3 difficulty tiers, 10 baselines including LLMs and verification tools

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors