HALLMARK

HALLucination benchMARK: A benchmark for evaluating citation hallucination detection tools.

Why HALLMARK?

The NeurIPS 2025 incident---where 53 papers were found to contain fabricated citations that passed peer review---exposed a critical gap: we have no standardized way to measure how well tools detect citation hallucinations. HALLMARK fills this gap.

HALLMARK draws on best practices from established benchmarks:

HumanEval: Multi-criteria sub-tests per entry (~6 checks per citation)
SWE-bench: Contamination awareness via temporal segmentation
LiveCodeBench: Continuous updates and post-cutoff evaluation
ONEBench: Sample-level atomic evaluation with ever-expanding pool

Features

Hallucination taxonomy: 14 types across 3 difficulty tiers (Easy / Medium / Hard)
2,525 annotated entries: 773 valid (from DBLP) + 1,177 hallucinated with ground truth (public splits)
6 sub-tests per entry: DOI resolution, title matching, author consistency, venue verification, field completeness, cross-database agreement
Evaluation metrics: Detection Rate, F1, tier-weighted F1, detect@k, ECE
Built-in baselines: DOI-only, bibtex-updater, LLM-based (OpenAI, Anthropic, OpenRouter), ensemble, HaRC, verify-citations (CiteVerifier and hallucinator are available as wrapper modules but not registered in the default registry)
Baseline registry: Central discovery, availability checking, and dispatch for all baselines (17 variants)
Plackett-Luce ranking: ONEBench-inspired ranking that handles incomplete evaluation data
Automated execution: Orchestrator script and CI workflow for batch baseline evaluation
Temporal analysis: Contamination detection via pre/post-cutoff comparison
Community contributions: ONEBench-style ever-expanding sample pool

Installation

# Recommended: clone and install in development mode
git clone https://github.com/rpatrik96/hallmark.git
cd hallmark
uv pip install -e ".[dev]"

# With LLM baseline SDKs (openai, anthropic)
uv pip install -e ".[baselines]"

# With ranking support (Plackett-Luce model via choix)
uv pip install -e ".[ranking]"

# All optional dependencies
uv pip install -e ".[all]"

Note: pip install hallmark is not yet published to PyPI. Use the clone + install path above.

Baseline Installation Guide

The [baselines] extra installs only the LLM SDKs (openai, anthropic). External CLI tools require separate installation due to a bibtexparser 1.x dependency conflict:

# HaRC
pipx install harcx

# bibtex-updater
pipx install bibtex-updater

# verify-citations
pipx install verify-citations

# CiteVerifier (GhostCite) — clone required
git clone https://github.com/NKU-AOSP-Lab/CiteVerifier

# hallucinator — clone required
git clone https://github.com/gianlucasb/hallucinator

Using pipx isolates each tool's bibtexparser 1.x from your project environment.

Quick Start

Evaluate a built-in baseline

# Run DOI-only baseline on the dev split
hallmark evaluate --split dev_public --baseline doi_only

# Run with custom predictions
hallmark evaluate --split dev_public --predictions my_predictions.jsonl --tool-name my-tool

Show dataset statistics

hallmark stats --split dev_public

Run all baselines at once

# Run all free baselines and generate leaderboard
python scripts/run_all_baselines.py --split dev_public --output-dir results/

# Run specific baselines in parallel
python scripts/run_all_baselines.py --baselines doi_only,bibtexupdater --parallel

# Run only free (no API key) baselines, skip unavailable
python scripts/run_all_baselines.py --baselines free --skip-unavailable

View the leaderboard

hallmark leaderboard --results-dir results/

See examples/ for full walkthroughs, including writing a custom baseline and per-type analysis.

Evaluate Your Tool

To evaluate any external tool against HALLMARK, produce a JSONL file with one prediction per line and run:

hallmark evaluate --predictions my_preds.jsonl --split dev_public

Each prediction must include:

{
  "bibtex_key": "a3f9c2b1...",
  "label": "HALLUCINATED",
  "confidence": 0.87,
  "reason": "DOI does not resolve",
  "subtest_results": {"doi_resolves": false},
  "api_sources_queried": ["crossref"],
  "wall_clock_seconds": 1.2,
  "api_calls": 1
}

bibtex_key format: Keys in the benchmark are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys like vaswani2017attention. Your predictions must use the exact keys from the loaded entries — use entry.bibtex_key when iterating over load_split() results.

See examples/03_custom_baseline.py for a complete end-to-end example.

Prediction Fields

Field	Required	Affects
`bibtex_key`	Yes	Entry matching
`label`	Yes	All metrics
`confidence`	Yes	ECE, AUROC, AUPRC
`reason`	No	Diagnose output
`subtest_results`	No	Subtest accuracy
`api_sources_queried`	No	Source-stratified metrics
`wall_clock_seconds`	No	Cost efficiency
`api_calls`	No	Mean API calls

UNCERTAIN label: UNCERTAIN is accepted as a prediction label. UNCERTAIN predictions are treated as VALID for confusion-matrix metrics (conservative default) and excluded from AUROC/AUPRC. Prefer VALID or HALLUCINATED with calibrated confidence when possible.

Confidence semantics: confidence = P(your predicted label is correct). If you predict HALLUCINATED with 0.9, you claim 90% certainty it is hallucinated. If you predict VALID with 0.8, you claim 80% certainty it is valid. This is NOT P(HALLUCINATED).

Hallucination Taxonomy

Tier 1: Easy (detectable by simple API lookup)

Type	Description	Example
`fabricated_doi`	DOI that doesn't resolve	`doi = {10.9999/fake.2024.001}`
`nonexistent_venue`	Invented journal/conference	`booktitle = {Intl. Conf. on Advanced AI Systems}`
`placeholder_authors`	Generic/fake author names	`author = {John Doe and Jane Smith}`
`future_date`	Publication year in the future	`year = {2030}`

Tier 2: Medium (requires cross-referencing metadata)

Type	Description	Example
`chimeric_title`	Real author + fabricated title	Real authors, plausible but non-existent paper
`wrong_venue`	Real paper, wrong venue/year	Correct title but at ICML not NeurIPS
`author_mismatch`	Author list swapped or fabricated (data value: `swapped_authors`)	Correct title, wrong author list
`preprint_as_published`	arXiv paper cited as venue paper	Correct paper, fabricated venue acceptance
`hybrid_fabrication`	Real DOI + fabricated metadata	Valid DOI resolves but authors/title don't match
`merged_citation`	Metadata from 2-3 papers merged	Authors from paper A, title from paper B
`partial_author_list`	Subset of real author list	First and last author only, middle dropped

Tier 3: Hard (requires deep verification)

Type	Description	Example
`near_miss_title`	Title off by 1-2 words	"Attention Is All You Want" vs "...Need"
`plausible_fabrication`	Entirely fabricated but realistic	Realistic author + plausible title
`arxiv_version_mismatch`	Mixed preprint/published metadata	arXiv ID with conference venue claim

Dataset

Splits

Split	Valid	Hallucinated	Total	Purpose
`dev_public`	486	633	1,119	Development and tuning
`test_public`	287	544	831	Public leaderboard
`test_hidden`	—	—	453	Anti-gaming evaluation
`stress_test`	1	121	122	Stress-test types depth

stress_test design note: The stress_test split is all-hallucinated by design. It contains challenging edge cases (merged citations, partial author lists, arXiv version mismatches) intended to stress-test detection robustness beyond the main splits. Because there are no valid entries, FPR and specificity are undefined for this split. Use detection rate as the primary metric when reporting stress_test results.

Tier distribution per split: ~27% Tier 1, ~47% Tier 2, ~26% Tier 3 (hallucinated entries).

Subtest Definitions

Subtest	Definition
`doi_resolves`	DOI returns HTTP 200 from doi.org (redirects count as resolved)
`title_exists`	Title found in Semantic Scholar or DBLP via exact or fuzzy match (threshold 0.9)
`authors_match`	Author last names match the record retrieved via DOI or title lookup
`venue_correct`	The venue/journal is correct for this specific paper (not just "a real venue")
`fields_complete`	All standard BibTeX fields for this entry type are present and non-empty
`cross_db_agreement`	Metadata from DOI resolution matches metadata from title/author search in DBLP/S2

Data Format

Each entry is a JSON object in JSONL format:

bibtex_key format: Keys are hex hashes (e.g., a3f9c2b1d4e7...), not human-readable keys. When writing predictions, always use entry.bibtex_key directly — do not construct keys manually.

{
  "bibtex_key": "a3f9c2b1d4e76f85",
  "bibtex_type": "inproceedings",
  "fields": {
    "title": "Attention Is All You Need",
    "author": "Ashish Vaswani and Noam Shazeer and ...",
    "year": "2017",
    "booktitle": "NeurIPS",
    "doi": "10.5555/3295222.3295349"
  },
  "label": "VALID",
  "hallucination_type": null,
  "difficulty_tier": null,
  "explanation": "Valid entry scraped from DBLP and verified",
  "subtests": {
    "doi_resolves": true,
    "title_exists": true,
    "authors_match": true,
    "venue_correct": true,
    "fields_complete": true,
    "cross_db_agreement": true
  }
}

Evaluation Metrics

Metric	Description
Detection Rate (DR)	Recall on hallucinated entries
False Positive Rate (FPR)	Valid entries incorrectly flagged
F1-Hallucination	Harmonic mean of precision and recall on HALLUCINATED class
Tier-weighted F1	F1 weighted by difficulty (Tier 3 = 3x weight)
ECE	Expected Calibration Error — measures confidence calibration quality
detect@k	Fraction detected using k verification strategies (deterministic and order-dependent, unlike the stochastic pass@k)
MCC	Matthews Correlation Coefficient — prevalence-invariant; use as primary metric when comparing results across splits

Title-Oracle Baseline (Diagnostic)

The title_oracle baseline quantifies the ceiling of a perturbation-structure shortcut present in HALLMARK's design. Because most HALLUCINATED entries are generated by perturbing real (VALID) papers, they inherit the original title. This means a title that appears as VALID in the dev split almost certainly belongs to a perturbed — hence hallucinated — entry when it reappears in another split.

The oracle exploits this directly: if a blind entry's title matches any VALID title in the dev split, it predicts HALLUCINATED.

Empirical results on v1.0 data:

~33% of unique titles appear as both VALID and HALLUCINATED across dev/test splits.
Applied to the hidden split: F1 = 0.389 at perfect precision (P = 1.0, recall = ~0.24).
Titles absent from any valid pool are 100% HALLUCINATED in the dataset.

This is not a legitimate detection method — it requires access to dev ground-truth labels as a look-up table, which constitutes label leakage when evaluating on dev itself. Report it alongside real baselines to make the shortcut visible. Any real tool that achieves F1 below the title oracle on the hidden split is arguably exploiting benchmark structure rather than performing genuine citation verification.

from hallmark.baselines.title_oracle import run_title_oracle
from hallmark.dataset.loader import load_split

dev_entries  = load_split("dev_public")
test_entries = load_split("test_public")
blind_test   = [e.to_blind() for e in test_entries]

predictions = run_title_oracle(blind_test, reference_pool=dev_entries)

Baseline Results (dev_public, 1,119 entries)

Baseline	Detection Rate	F1	Tier-weighted F1	FPR	ECE
HaRC*	0.155	0.268	0.188	0.000	0.361
bibtex-updater	0.124	0.220	0.131	0.000	0.018

*Partial evaluation due to API rate limiting (HaRC: 521/1,063 entries completed).

External Tool Baselines

HALLMARK also wraps several external citation verification tools as baselines:

Baseline	Tool	Databases	Install
HaRC	harcx	Semantic Scholar, DBLP, Google Scholar, Open Library	`pip install harcx`
CiteVerifier	GhostCite	DBLP (local), Google Scholar, Google Search	Clone repo
hallucinator	hallucinator	CrossRef, arXiv, DBLP, Semantic Scholar, ACL Anthology, PubMed, OpenAlex	Clone repo
verify-citations	verify-citations	arXiv, ACL Anthology, Semantic Scholar, DBLP, Google Scholar, DuckDuckGo	`pip install verify-citations`

LLM Baselines

Baseline	Model	Provider	API Key Env Var
`llm_openai`	GPT-5.1	OpenAI	`OPENAI_API_KEY`
`llm_anthropic`	Claude Sonnet 4.5	Anthropic	`ANTHROPIC_API_KEY`
`llm_openrouter_deepseek_r1`	DeepSeek R1	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_deepseek_v3`	DeepSeek V3.2	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_qwen`	Qwen 3 235B	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_mistral`	Mistral Large	OpenRouter	`OPENROUTER_API_KEY`
`llm_openrouter_gemini_flash`	Gemini 2.5 Flash	OpenRouter	`OPENROUTER_API_KEY`

# Use the baseline registry to discover and run any baseline
from hallmark.baselines.registry import list_baselines, check_available, run_baseline
from hallmark.dataset.loader import load_split

entries = load_split("dev_public")

# List all registered baselines (or just the free ones)
print(list_baselines(free_only=True))

# Check if a baseline's dependencies are installed
available, msg = check_available("harc")

# Run a baseline by name
predictions = run_baseline("harc", entries)

Python API

from hallmark.dataset.loader import load_split
from hallmark.evaluation.metrics import evaluate
from hallmark.dataset.schema import Prediction

# Load benchmark entries
entries = load_split("dev_public")

# Create predictions (your tool's output)
predictions = [
    Prediction(bibtex_key=e.bibtex_key, label="VALID", confidence=0.5)
    for e in entries
]

# Evaluate
result = evaluate(entries, predictions, tool_name="my-tool", split_name="dev_public")
print(f"F1: {result.f1_hallucination:.3f}")
print(f"Detection Rate: {result.detection_rate:.3f}")

Ranking

HALLMARK includes an ONEBench-inspired ranking system based on the Plackett-Luce model that handles incomplete evaluation data (not all tools evaluated on all entries):

from hallmark.evaluation.ranking import rank_tools_plackett_luce, rank_tools_mean_score

# Rank tools using Plackett-Luce (requires choix: pip install hallmark[ranking])
pl_ranking = rank_tools_plackett_luce(entry_keys, tool_names, matrix)

# Fallback: simple mean-score ranking (no extra dependencies)
mean_ranking = rank_tools_mean_score(entry_keys, tool_names, matrix)

CI/CD

HALLMARK includes two GitHub Actions workflows:

tests.yml: Runs the full test suite across Python 3.10-3.13 on every push/PR
baselines.yml: Runs live free baselines (doi_only, verify_citations) weekly and on demand; harc and bibtexupdater use pre-computed result validation (checksum checks) instead of live re-execution due to API rate limiting

Contributing Entries

HALLMARK uses an ever-expanding pool inspired by ONEBench. To contribute new entries:

hallmark contribute --file my_entries.jsonl --contributor "Your Name"

See CONTRIBUTING.md for details on entry format, validation requirements, and the review process.

Project Structure

hallmark/
├── hallmark/                  # Python package
│   ├── dataset/               # Schema, loader, scraper, generator
│   ├── evaluation/            # Metrics, subtests, aggregator, temporal, ranking
│   ├── baselines/             # Registry + baselines (DOI-only, bibtex-updater, LLM×6, ensemble, HaRC, CiteVerifier, hallucinator, verify-citations)
│   │   └── registry.py        # Central baseline discovery, availability, dispatch
│   ├── contribution/          # Pool manager, entry validation
│   └── cli.py                 # Command-line interface
├── data/
│   ├── v1.0/                  # Benchmark splits (dev_public, test_public)
│   ├── hidden/                # Hidden test set (not public)
│   └── raw/                   # Raw scraped/generated entries
├── scripts/
│   └── run_all_baselines.py   # Batch orchestrator for baseline evaluation
├── .github/workflows/
│   ├── tests.yml              # CI: test suite across Python versions
│   └── baselines.yml          # CI: weekly free baseline evaluation
├── tests/                     # Test suite (562 tests)
├── figures/                   # Evaluation figures
└── examples/                  # Usage examples

Citation

If you use HALLMARK in your research, please cite:

@misc{hallmark2026,
    title={HALLMARK: A HALLucination benchMARK for Citation Verification},
    author={Reizinger, Patrik},
    year={2026},
    url={https://github.com/rpatrik96/hallmark}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github/workflows		.github/workflows
data		data
examples		examples
figures		figures
hallmark		hallmark
notes		notes
paper		paper
results		results
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

HALLMARK

Why HALLMARK?

Features

Installation

Baseline Installation Guide

Quick Start

Evaluate a built-in baseline

Show dataset statistics

Run all baselines at once

View the leaderboard

Evaluate Your Tool

Prediction Fields

Hallucination Taxonomy

Tier 1: Easy (detectable by simple API lookup)

Tier 2: Medium (requires cross-referencing metadata)

Tier 3: Hard (requires deep verification)

Dataset

Splits

Subtest Definitions

Data Format

Evaluation Metrics

Title-Oracle Baseline (Diagnostic)

Baseline Results (dev_public, 1,119 entries)

External Tool Baselines

LLM Baselines

Python API

Ranking

CI/CD

Contributing Entries

Project Structure

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages