HALLucination benchMARK: A benchmark for evaluating citation hallucination detection tools.
The NeurIPS 2025 incident---where 53 papers were found to contain fabricated citations that passed peer review---exposed a critical gap: we have no standardized way to measure how well tools detect citation hallucinations. HALLMARK fills this gap.
HALLMARK draws on best practices from established benchmarks:
- HumanEval: Multi-criteria sub-tests per entry (~6 checks per citation)
- SWE-bench: Contamination awareness via temporal segmentation
- LiveCodeBench: Continuous updates and post-cutoff evaluation
- ONEBench: Sample-level atomic evaluation with ever-expanding pool
- Hallucination taxonomy: 14 types across 3 difficulty tiers (Easy / Medium / Hard)
- 2,525 annotated entries: 773 valid (from DBLP) + 1,177 hallucinated with ground truth (public splits)
- 6 sub-tests per entry: DOI resolution, title matching, author consistency, venue verification, field completeness, cross-database agreement
- Evaluation metrics: Detection Rate, F1, tier-weighted F1, detect@k, ECE
- Built-in baselines: DOI-only, bibtex-updater, LLM-based (OpenAI, Anthropic, OpenRouter), ensemble, HaRC, verify-citations (CiteVerifier and hallucinator are available as wrapper modules but not registered in the default registry)
- Baseline registry: Central discovery, availability checking, and dispatch for all baselines (17 variants)
- Plackett-Luce ranking: ONEBench-inspired ranking that handles incomplete evaluation data
- Automated execution: Orchestrator script and CI workflow for batch baseline evaluation
- Temporal analysis: Contamination detection via pre/post-cutoff comparison
- Community contributions: ONEBench-style ever-expanding sample pool
# Recommended: clone and install in development mode
git clone https://github.com/rpatrik96/hallmark.git
cd hallmark
uv pip install -e ".[dev]"
# With LLM baseline SDKs (openai, anthropic)
uv pip install -e ".[baselines]"
# With ranking support (Plackett-Luce model via choix)
uv pip install -e ".[ranking]"
# All optional dependencies
uv pip install -e ".[all]"Note:
pip install hallmarkis not yet published to PyPI. Use the clone + install path above.
The [baselines] extra installs only the LLM SDKs (openai, anthropic). External CLI tools require separate installation due to a bibtexparser 1.x dependency conflict:
# HaRC
pipx install harcx
# bibtex-updater
pipx install bibtex-updater
# verify-citations
pipx install verify-citations
# CiteVerifier (GhostCite) — clone required
git clone https://github.com/NKU-AOSP-Lab/CiteVerifier
# hallucinator — clone required
git clone https://github.com/gianlucasb/hallucinatorUsing pipx isolates each tool's bibtexparser 1.x from your project environment.
# Run DOI-only baseline on the dev split
hallmark evaluate --split dev_public --baseline doi_only
# Run with custom predictions
hallmark evaluate --split dev_public --predictions my_predictions.jsonl --tool-name my-toolhallmark stats --split dev_public# Run all free baselines and generate leaderboard
python scripts/run_all_baselines.py --split dev_public --output-dir results/
# Run specific baselines in parallel
python scripts/run_all_baselines.py --baselines doi_only,bibtexupdater --parallel
# Run only free (no API key) baselines, skip unavailable
python scripts/run_all_baselines.py --baselines free --skip-unavailablehallmark leaderboard --results-dir results/See examples/ for full walkthroughs, including writing a custom baseline and per-type analysis.
To evaluate any external tool against HALLMARK, produce a JSONL file with one prediction per line and run:
hallmark evaluate --predictions my_preds.jsonl --split dev_publicEach prediction must include:
{
"bibtex_key": "a3f9c2b1...",
"label": "HALLUCINATED",
"confidence": 0.87,
"reason": "DOI does not resolve",
"subtest_results": {"doi_resolves": false},
"api_sources_queried": ["crossref"],
"wall_clock_seconds": 1.2,
"api_calls": 1
}bibtex_key format: Keys in the benchmark are hex hashes (e.g.,
a3f9c2b1d4e7...), not human-readable keys likevaswani2017attention. Your predictions must use the exact keys from the loaded entries — useentry.bibtex_keywhen iterating overload_split()results.
See examples/03_custom_baseline.py for a complete end-to-end example.
| Field | Required | Affects |
|---|---|---|
bibtex_key |
Yes | Entry matching |
label |
Yes | All metrics |
confidence |
Yes | ECE, AUROC, AUPRC |
reason |
No | Diagnose output |
subtest_results |
No | Subtest accuracy |
api_sources_queried |
No | Source-stratified metrics |
wall_clock_seconds |
No | Cost efficiency |
api_calls |
No | Mean API calls |
UNCERTAIN label: UNCERTAIN is accepted as a prediction label. UNCERTAIN predictions are treated as VALID for confusion-matrix metrics (conservative default) and excluded from AUROC/AUPRC. Prefer VALID or HALLUCINATED with calibrated confidence when possible.
Confidence semantics: confidence = P(your predicted label is correct). If you predict HALLUCINATED with 0.9, you claim 90% certainty it is hallucinated. If you predict VALID with 0.8, you claim 80% certainty it is valid. This is NOT P(HALLUCINATED).
| Type | Description | Example |
|---|---|---|
fabricated_doi |
DOI that doesn't resolve | doi = {10.9999/fake.2024.001} |
nonexistent_venue |
Invented journal/conference | booktitle = {Intl. Conf. on Advanced AI Systems} |
placeholder_authors |
Generic/fake author names | author = {John Doe and Jane Smith} |
future_date |
Publication year in the future | year = {2030} |
| Type | Description | Example |
|---|---|---|
chimeric_title |
Real author + fabricated title | Real authors, plausible but non-existent paper |
wrong_venue |
Real paper, wrong venue/year | Correct title but at ICML not NeurIPS |
author_mismatch |
Author list swapped or fabricated (data value: swapped_authors) |
Correct title, wrong author list |
preprint_as_published |
arXiv paper cited as venue paper | Correct paper, fabricated venue acceptance |
hybrid_fabrication |
Real DOI + fabricated metadata | Valid DOI resolves but authors/title don't match |
merged_citation |
Metadata from 2-3 papers merged | Authors from paper A, title from paper B |
partial_author_list |
Subset of real author list | First and last author only, middle dropped |
| Type | Description | Example |
|---|---|---|
near_miss_title |
Title off by 1-2 words | "Attention Is All You Want" vs "...Need" |
plausible_fabrication |
Entirely fabricated but realistic | Realistic author + plausible title |
arxiv_version_mismatch |
Mixed preprint/published metadata | arXiv ID with conference venue claim |
| Split | Valid | Hallucinated | Total | Purpose |
|---|---|---|---|---|
dev_public |
486 | 633 | 1,119 | Development and tuning |
test_public |
287 | 544 | 831 | Public leaderboard |
test_hidden |
— | — | 453 | Anti-gaming evaluation |
stress_test |
1 | 121 | 122 | Stress-test types depth |
stress_test design note: The
stress_testsplit is all-hallucinated by design. It contains challenging edge cases (merged citations, partial author lists, arXiv version mismatches) intended to stress-test detection robustness beyond the main splits. Because there are no valid entries, FPR and specificity are undefined for this split. Use detection rate as the primary metric when reportingstress_testresults.
Tier distribution per split: ~27% Tier 1, ~47% Tier 2, ~26% Tier 3 (hallucinated entries).
| Subtest | Definition |
|---|---|
doi_resolves |
DOI returns HTTP 200 from doi.org (redirects count as resolved) |
title_exists |
Title found in Semantic Scholar or DBLP via exact or fuzzy match (threshold 0.9) |
authors_match |
Author last names match the record retrieved via DOI or title lookup |
venue_correct |
The venue/journal is correct for this specific paper (not just "a real venue") |
fields_complete |
All standard BibTeX fields for this entry type are present and non-empty |
cross_db_agreement |
Metadata from DOI resolution matches metadata from title/author search in DBLP/S2 |
Each entry is a JSON object in JSONL format:
bibtex_key format: Keys are hex hashes (e.g.,
a3f9c2b1d4e7...), not human-readable keys. When writing predictions, always useentry.bibtex_keydirectly — do not construct keys manually.
{
"bibtex_key": "a3f9c2b1d4e76f85",
"bibtex_type": "inproceedings",
"fields": {
"title": "Attention Is All You Need",
"author": "Ashish Vaswani and Noam Shazeer and ...",
"year": "2017",
"booktitle": "NeurIPS",
"doi": "10.5555/3295222.3295349"
},
"label": "VALID",
"hallucination_type": null,
"difficulty_tier": null,
"explanation": "Valid entry scraped from DBLP and verified",
"subtests": {
"doi_resolves": true,
"title_exists": true,
"authors_match": true,
"venue_correct": true,
"fields_complete": true,
"cross_db_agreement": true
}
}| Metric | Description |
|---|---|
| Detection Rate (DR) | Recall on hallucinated entries |
| False Positive Rate (FPR) | Valid entries incorrectly flagged |
| F1-Hallucination | Harmonic mean of precision and recall on HALLUCINATED class |
| Tier-weighted F1 | F1 weighted by difficulty (Tier 3 = 3x weight) |
| ECE | Expected Calibration Error — measures confidence calibration quality |
| detect@k | Fraction detected using k verification strategies (deterministic and order-dependent, unlike the stochastic pass@k) |
| MCC | Matthews Correlation Coefficient — prevalence-invariant; use as primary metric when comparing results across splits |
The title_oracle baseline quantifies the ceiling of a perturbation-structure shortcut present in HALLMARK's design.
Because most HALLUCINATED entries are generated by perturbing real (VALID) papers, they inherit the original title.
This means a title that appears as VALID in the dev split almost certainly belongs to a perturbed — hence hallucinated — entry when it reappears in another split.
The oracle exploits this directly: if a blind entry's title matches any VALID title in the dev split, it predicts HALLUCINATED.
Empirical results on v1.0 data:
- ~33% of unique titles appear as both VALID and HALLUCINATED across dev/test splits.
- Applied to the hidden split: F1 = 0.389 at perfect precision (P = 1.0, recall = ~0.24).
- Titles absent from any valid pool are 100% HALLUCINATED in the dataset.
This is not a legitimate detection method — it requires access to dev ground-truth labels as a look-up table, which constitutes label leakage when evaluating on dev itself. Report it alongside real baselines to make the shortcut visible. Any real tool that achieves F1 below the title oracle on the hidden split is arguably exploiting benchmark structure rather than performing genuine citation verification.
from hallmark.baselines.title_oracle import run_title_oracle
from hallmark.dataset.loader import load_split
dev_entries = load_split("dev_public")
test_entries = load_split("test_public")
blind_test = [e.to_blind() for e in test_entries]
predictions = run_title_oracle(blind_test, reference_pool=dev_entries)| Baseline | Detection Rate | F1 | Tier-weighted F1 | FPR | ECE |
|---|---|---|---|---|---|
| HaRC* | 0.155 | 0.268 | 0.188 | 0.000 | 0.361 |
| bibtex-updater | 0.124 | 0.220 | 0.131 | 0.000 | 0.018 |
*Partial evaluation due to API rate limiting (HaRC: 521/1,063 entries completed).
HALLMARK also wraps several external citation verification tools as baselines:
| Baseline | Tool | Databases | Install |
|---|---|---|---|
| HaRC | harcx | Semantic Scholar, DBLP, Google Scholar, Open Library | pip install harcx |
| CiteVerifier | GhostCite | DBLP (local), Google Scholar, Google Search | Clone repo |
| hallucinator | hallucinator | CrossRef, arXiv, DBLP, Semantic Scholar, ACL Anthology, PubMed, OpenAlex | Clone repo |
| verify-citations | verify-citations | arXiv, ACL Anthology, Semantic Scholar, DBLP, Google Scholar, DuckDuckGo | pip install verify-citations |
| Baseline | Model | Provider | API Key Env Var |
|---|---|---|---|
llm_openai |
GPT-5.1 | OpenAI | OPENAI_API_KEY |
llm_anthropic |
Claude Sonnet 4.5 | Anthropic | ANTHROPIC_API_KEY |
llm_openrouter_deepseek_r1 |
DeepSeek R1 | OpenRouter | OPENROUTER_API_KEY |
llm_openrouter_deepseek_v3 |
DeepSeek V3.2 | OpenRouter | OPENROUTER_API_KEY |
llm_openrouter_qwen |
Qwen 3 235B | OpenRouter | OPENROUTER_API_KEY |
llm_openrouter_mistral |
Mistral Large | OpenRouter | OPENROUTER_API_KEY |
llm_openrouter_gemini_flash |
Gemini 2.5 Flash | OpenRouter | OPENROUTER_API_KEY |
# Use the baseline registry to discover and run any baseline
from hallmark.baselines.registry import list_baselines, check_available, run_baseline
from hallmark.dataset.loader import load_split
entries = load_split("dev_public")
# List all registered baselines (or just the free ones)
print(list_baselines(free_only=True))
# Check if a baseline's dependencies are installed
available, msg = check_available("harc")
# Run a baseline by name
predictions = run_baseline("harc", entries)See also:
- GhostCite paper — large-scale analysis of 2.2M citations across 56K papers
- HalluCitation paper — analysis of ~300 hallucinated papers in ACL conferences
- GPTZero Hallucination Detector — commercial API for citation verification
from hallmark.dataset.loader import load_split
from hallmark.evaluation.metrics import evaluate
from hallmark.dataset.schema import Prediction
# Load benchmark entries
entries = load_split("dev_public")
# Create predictions (your tool's output)
predictions = [
Prediction(bibtex_key=e.bibtex_key, label="VALID", confidence=0.5)
for e in entries
]
# Evaluate
result = evaluate(entries, predictions, tool_name="my-tool", split_name="dev_public")
print(f"F1: {result.f1_hallucination:.3f}")
print(f"Detection Rate: {result.detection_rate:.3f}")HALLMARK includes an ONEBench-inspired ranking system based on the Plackett-Luce model that handles incomplete evaluation data (not all tools evaluated on all entries):
from hallmark.evaluation.ranking import rank_tools_plackett_luce, rank_tools_mean_score
# Rank tools using Plackett-Luce (requires choix: pip install hallmark[ranking])
pl_ranking = rank_tools_plackett_luce(entry_keys, tool_names, matrix)
# Fallback: simple mean-score ranking (no extra dependencies)
mean_ranking = rank_tools_mean_score(entry_keys, tool_names, matrix)HALLMARK includes two GitHub Actions workflows:
tests.yml: Runs the full test suite across Python 3.10-3.13 on every push/PRbaselines.yml: Runs live free baselines (doi_only, verify_citations) weekly and on demand; harc and bibtexupdater use pre-computed result validation (checksum checks) instead of live re-execution due to API rate limiting
HALLMARK uses an ever-expanding pool inspired by ONEBench. To contribute new entries:
hallmark contribute --file my_entries.jsonl --contributor "Your Name"See CONTRIBUTING.md for details on entry format, validation requirements, and the review process.
hallmark/
├── hallmark/ # Python package
│ ├── dataset/ # Schema, loader, scraper, generator
│ ├── evaluation/ # Metrics, subtests, aggregator, temporal, ranking
│ ├── baselines/ # Registry + baselines (DOI-only, bibtex-updater, LLM×6, ensemble, HaRC, CiteVerifier, hallucinator, verify-citations)
│ │ └── registry.py # Central baseline discovery, availability, dispatch
│ ├── contribution/ # Pool manager, entry validation
│ └── cli.py # Command-line interface
├── data/
│ ├── v1.0/ # Benchmark splits (dev_public, test_public)
│ ├── hidden/ # Hidden test set (not public)
│ └── raw/ # Raw scraped/generated entries
├── scripts/
│ └── run_all_baselines.py # Batch orchestrator for baseline evaluation
├── .github/workflows/
│ ├── tests.yml # CI: test suite across Python versions
│ └── baselines.yml # CI: weekly free baseline evaluation
├── tests/ # Test suite (562 tests)
├── figures/ # Evaluation figures
└── examples/ # Usage examples
If you use HALLMARK in your research, please cite:
@misc{hallmark2026,
title={HALLMARK: A HALLucination benchMARK for Citation Verification},
author={Reizinger, Patrik},
year={2026},
url={https://github.com/rpatrik96/hallmark}
}MIT