Stop trusting citations blindly. Verify them.
CiteSage is an agentic verification pipeline designed to ensure academic integrity throughout the submission and peer-review processes.
Starting from a manuscript (PDF or Word), the pipeline automatically extracts references, cross-validates metadata across global academic databases, and rigorously audits whether an author's claims are genuinely supported by the cited sources—ultimately generating a reproducible and robust evidence trail.
If you have ever submitted a paper with a non-existent citation, read an article with a dead reference link, or encountered an AI writing tool that hallucinated a DOI, CiteSage was engineered specifically to solve that problem.
Tip
Phase 19 (The Great Rust Renaissance): CiteSage v6.0.0 has been fully rebuilt from the ground up in Rust (1.94). We have moved beyond our legacy Python architecture to a highly robust, memory-safe, and concurrent pipeline orchestration model, utilizing a citesage-rs workspace to achieve unparalleled reliability in verification and data integrity.
CiteSage serves two core modes:
| Mode | Functionality |
|---|---|
| Verify Mode | You upload a manuscript (PDF/Word). CiteSage scrutinizes every citation: Does this paper exist? Is the metadata accurate? Does the cited text genuinely align with the author’s claim? |
| Curate Mode | You describe a research topic in natural language. CiteSage queries multiple academic databases, downloads authentic PDFs, and compiles a strictly grounded, structured evidence report. |
The Core Problem: LLMs fabricate citations. Researchers inadvertently miscopy metadata. Papers unknowingly cite retracted works. Traditional Reference Managers (e.g., Zotero, Mendeley) simply store and format data—they do not verify authenticity.
What CiteSage actually does:
- Verifies if a cited DOI authentically exists in CrossRef, OpenAlex, or Semantic Scholar.
- Actively detects retracted papers via OpenAlex and CrossRef metadata signals.
- Preemptively catches GROBID hallucinations (fictional DOIs generated by underlying PDF parsers).
- Semantically compares the author's embedded claim against the actual content of the cited source.
- Flags paywall HTML intercepts disguised as PDFs before wasting bandwidth and context.
- Identifies and issues warnings against predatory publishers (e.g., MDPI).
- Docker & Docker Compose (Providing background graph and parsing infrastructure)
- Rust 1.94+ (Cargo Toolchain)
- Ollama running locally (for verification models)
git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env # Configure your API keys (OpenAlex, Semantic Scholar)
docker compose up -d # Initializes GROBID + Neo4j + Ollama backend servicescd citesage-rs
cargo run --release -p citesage-cli -- research --file ../examples/your_paper.pdfCiteSage will autonomically:
- Extract all citation contexts from your manuscript (utilizing Dockerized GROBID).
- Look up each reference across multiple academic databases simultaneously.
- Flag missing, hallucinated, or fundamentally misrepresented citations.
- Auto-repair fractured citations leveraging LLM reasoning (via Ollama).
- Generate an interactive, exhaustive evidence report.
cd citesage-rs
cargo run --release -p citesage-cli -- research --query "large language models in accessibility design" -n 8CiteSage will autonomically:
- Concurrently query OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef.
- Re-rank results intelligently using a Cross-Encoder (
ms-marco-MiniLM-L-12-v2). - Traverse a 7-source acquisition chain to download real PDF documents.
- Extract primary claims and synthesize Evidence Cards.
- Export everything neatly into
output/<session>.tar.gz.
output/
└── verify_your_paper_20260401_154000/
├── reports/
│ └── report.md ← Detailed citation verification results
├── diagnostics/
│ └── doctor_audit.json ← Engineering pipeline health + LLM repair logs
└── research_data/
└── pdfs/ ← Downloaded primary sources (curate mode)
Each parsed citation receives a distinct verification grade:
- ✅ Verified — Exists globally, metadata synchronizes, content accurately aligns.
⚠️ Warning — Exists, but title/author metadata exhibits drift, or contextual support is theoretically uncertain.- ❌ Failed — Untraceable across all databases, DOI exhibits hallucination signatures, or the paper was aggressively retracted.
Your Manuscript (PDF)
│
├─ Layout-aware Data Parsing (IBM Docling)
│ └─ Secures and isolates the References section, avoiding multi-column bleed.
│
├─ GROBID Citation Extraction
│ └─ Hallucination Defense Module: Fake DOIs are instantly discarded.
│
├─ Multi-source Synchronization Lookup (CrossRef + OpenAlex + Semantic Scholar)
│ └─ Retraction Shield Protocol active + Predatory publisher screening.
│
├─ Semantic Content Alignment Verification (E5 + LLM Dual-layer Analysis)
│ └─ "Did the author actually leverage this source contextually?"
│
└─ Auto-repair Mechanism (Agentic Flow)
├─ Tier 1: Core structured APIs (Authoritative Grounding)
├─ Tier 2: Web scraping → Identifies candidates only (Never authoritative)
└─ Absolute validation sequence required before final resolution.
- Contextual Paragraph-level Verification (
--paragraph) — Audits each distinct in-text citation independently against its parent reference. - Double-Blind Review Proxy (
--role reviewer) — Intelligently anonymizes provenance data ensuring zero privacy leaks. - Deterministic Scientific Execution (
--deterministic) — Freezes random seeds for reproducible pipelines.
- 5-source Broad Search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef parallelization.
- 7-source PDF Defense Chain — Unpaywall → Semantic Scholar → CrossRef → Search Index → CORE → DOAJ → BASE.
- Bot-Resistance Routing — Utilizes
cloudscraperstealth capabilities to bypass publisher CDN roadblocks. - Paywall Traffic Detection — Pre-emptive HTTP HEAD validation blocking counterfeit HTML paywalls.
- Hybrid Retrieval System — Dense E5 (1024d) + BM25 Sparse Index + 3-way RRF fusion running on LanceDB.
- Intelligent Graph Matrix — Neo4j stores persistent paper–author–institution ontological relationships.
- Resilient Pipeline (Fail-Safe) — Exponential API backoffs and fallback routing ensuring complete fault coverage.
- Smart Rate Limiting (429 Fail-Fast) — Key-aware throttling that instantly triggers circuit breakers upon HTTP 429 to re-route requests without blocking pipelines.
- Deep Connection Pooling — Scaled HTTP/2 multiplexing (
pool=50) coupled with Brotli serialization to maximize throughput during large batch verifications. - Fast, Concurrent & Memory Safe — Built securely in Rust (
citesage-rs), maximizing multi-thread velocity while categorically eliminating system-level memory leaks and typist crashes.
Duplicate .env.example into .env and assign your credentials:
OPENALEX_EMAIL=[email protected] # Enter the Polite Pool for unrestricted rate limits
SEMANTIC_SCHOLAR_API_KEY=your_key # Mandatory for production volume queries
CORE_API_KEY=your_key # Discovers open access fallbacks# Verify the entire Rust workspace topology (citesage-cli, citesage-core, citesage-verify)
cargo test --workspace
# Apply stringent Linting & Formatting rules
cargo clippy --workspace -- -D warnings
cargo fmt --all| Phase | Delivered Scope | Status |
|---|---|---|
| 1–10 | Verification kernel, GROBID interface, format versatility, Docker CI orchestration. | ✅ Done |
| 11 | Hybrid RAG (BM25+RRF), structural chunk mapping, Cross-Encoder topological sorting. | ✅ Done |
| 12 | Precision Grounded Generation, Continuous Ragas testing, Cloudflare resistance setup. | ✅ Done |
| 13 | 5-way search hub, E5 geometry tracking, rigorous 7-source PDF supply chain. | ✅ Done |
| 14 | Architecture decomposition (CLI split, abstract Repair/Embedding), Docker isolation. | ✅ Done |
| 15 | Temporal Alignment Engine, graphical Evidence Cards, Web Portal (Chainlit). | ✅ Done |
| 16 | Output serialization standards, deterministic log aggregators, resolver stabilizers. | ✅ Done |
| 17 | Universal Code Audit, Gold Standard compliance metric, Dual-Push CI deployments. | ✅ Done |
| 17+ | LanceDB structural migration, Semantic Shared isolation limits. | ✅ Done |
| 18 | Multi-node asynchronous pipeline scaling. | ✅ Done |
| 19 | The Great Rust Renaissance: Total system evolution substituting Python orchestrators with native Rust (citesage-rs); integrating hyper-concurrent verification architectures. |
🔵 Active |
Should CiteSage augment your research execution, please attribute the framework using the citation below (Non-mandatory per MIT):
@software{citesage2026,
author = {Chiang, Chenwei},
title = {CiteSage: An Agentic Verification Framework for Academic Integrity},
year = {2026},
url = {https://github.com/chenweichiang/citesage},
note = {v6.0.0-alpha, Phase 19}
}Subject to the MIT License. Reference LICENSE.
CiteSage operates as a continuous academic construct. Priority and collaboration protocols are detailed at RESEARCH-NOTICE.md. The architectural logic and systemic history can be viewed at SYSTEM_PLAN.md.
