CiteSage

Stop trusting citations blindly. Verify them.

What is CiteSage?

CiteSage is an agentic verification pipeline designed to ensure academic integrity throughout the submission and peer-review processes.

Starting from a manuscript (PDF or Word), the pipeline automatically extracts references, cross-validates metadata across global academic databases, and rigorously audits whether an author's claims are genuinely supported by the cited sources—ultimately generating a reproducible and robust evidence trail.

If you have ever submitted a paper with a non-existent citation, read an article with a dead reference link, or encountered an AI writing tool that hallucinated a DOI, CiteSage was engineered specifically to solve that problem.

Tip

Phase 19 (The Great Rust Renaissance): CiteSage v6.0.0 has been fully rebuilt from the ground up in Rust (1.94). We have moved beyond our legacy Python architecture to a highly robust, memory-safe, and concurrent pipeline orchestration model, utilizing a citesage-rs workspace to achieve unparalleled reliability in verification and data integrity.

CiteSage serves two core modes:

Mode	Functionality
Verify Mode	You upload a manuscript (PDF/Word). CiteSage scrutinizes every citation: Does this paper exist? Is the metadata accurate? Does the cited text genuinely align with the author’s claim?
Curate Mode	You describe a research topic in natural language. CiteSage queries multiple academic databases, downloads authentic PDFs, and compiles a strictly grounded, structured evidence report.

Why CiteSage?

The Core Problem: LLMs fabricate citations. Researchers inadvertently miscopy metadata. Papers unknowingly cite retracted works. Traditional Reference Managers (e.g., Zotero, Mendeley) simply store and format data—they do not verify authenticity.

What CiteSage actually does:

Verifies if a cited DOI authentically exists in CrossRef, OpenAlex, or Semantic Scholar.
Actively detects retracted papers via OpenAlex and CrossRef metadata signals.
Preemptively catches GROBID hallucinations (fictional DOIs generated by underlying PDF parsers).
Semantically compares the author's embedded claim against the actual content of the cited source.
Flags paywall HTML intercepts disguised as PDFs before wasting bandwidth and context.
Identifies and issues warnings against predatory publishers (e.g., MDPI).

Quick Start

Prerequisites

Docker & Docker Compose (Providing background graph and parsing infrastructure)
Rust 1.94+ (Cargo Toolchain)
Ollama running locally (for verification models)

Setup

git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env        # Configure your API keys (OpenAlex, Semantic Scholar)
docker compose up -d        # Initializes GROBID + Neo4j + Ollama backend services

Verify a Paper's Citations

cd citesage-rs
cargo run --release -p citesage-cli -- research --file ../examples/your_paper.pdf

CiteSage will autonomically:

Extract all citation contexts from your manuscript (utilizing Dockerized GROBID).
Look up each reference across multiple academic databases simultaneously.
Flag missing, hallucinated, or fundamentally misrepresented citations.
Auto-repair fractured citations leveraging LLM reasoning (via Ollama).
Generate an interactive, exhaustive evidence report.

Curate Literature on a Topic

cd citesage-rs
cargo run --release -p citesage-cli -- research --query "large language models in accessibility design" -n 8

CiteSage will autonomically:

Concurrently query OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef.
Re-rank results intelligently using a Cross-Encoder (ms-marco-MiniLM-L-12-v2).
Traverse a 7-source acquisition chain to download real PDF documents.
Extract primary claims and synthesize Evidence Cards.
Export everything neatly into output/<session>.tar.gz.

Architecture of the Output

output/
└── verify_your_paper_20260401_154000/
    ├── reports/
    │   └── report.md           ← Detailed citation verification results
    ├── diagnostics/
    │   └── doctor_audit.json   ← Engineering pipeline health + LLM repair logs
    └── research_data/
        └── pdfs/               ← Downloaded primary sources (curate mode)

Each parsed citation receives a distinct verification grade:

✅ Verified — Exists globally, metadata synchronizes, content accurately aligns.
⚠️ Warning — Exists, but title/author metadata exhibits drift, or contextual support is theoretically uncertain.
❌ Failed — Untraceable across all databases, DOI exhibits hallucination signatures, or the paper was aggressively retracted.

How It Works

Your Manuscript (PDF)
  │
  ├─ Layout-aware Data Parsing (IBM Docling)
  │   └─ Secures and isolates the References section, avoiding multi-column bleed.
  │
  ├─ GROBID Citation Extraction
  │   └─ Hallucination Defense Module: Fake DOIs are instantly discarded.
  │
  ├─ Multi-source Synchronization Lookup (CrossRef + OpenAlex + Semantic Scholar)
  │   └─ Retraction Shield Protocol active + Predatory publisher screening.
  │
  ├─ Semantic Content Alignment Verification (E5 + LLM Dual-layer Analysis)
  │   └─ "Did the author actually leverage this source contextually?"
  │
  └─ Auto-repair Mechanism (Agentic Flow)
      ├─ Tier 1: Core structured APIs (Authoritative Grounding)
      ├─ Tier 2: Web scraping → Identifies candidates only (Never authoritative)
      └─ Absolute validation sequence required before final resolution.

Key Features

Absolute Confidence for Researchers

Contextual Paragraph-level Verification (--paragraph) — Audits each distinct in-text citation independently against its parent reference.
Double-Blind Review Proxy (--role reviewer) — Intelligently anonymizes provenance data ensuring zero privacy leaks.
Deterministic Scientific Execution (--deterministic) — Freezes random seeds for reproducible pipelines.

Systematic Literature Discovery

5-source Broad Search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef parallelization.
7-source PDF Defense Chain — Unpaywall → Semantic Scholar → CrossRef → Search Index → CORE → DOAJ → BASE.
Bot-Resistance Routing — Utilizes cloudscraper stealth capabilities to bypass publisher CDN roadblocks.
Paywall Traffic Detection — Pre-emptive HTTP HEAD validation blocking counterfeit HTML paywalls.

Under the Hood

Hybrid Retrieval System — Dense E5 (1024d) + BM25 Sparse Index + 3-way RRF fusion running on LanceDB.
Intelligent Graph Matrix — Neo4j stores persistent paper–author–institution ontological relationships.
Resilient Pipeline (Fail-Safe) — Exponential API backoffs and fallback routing ensuring complete fault coverage.
Smart Rate Limiting (429 Fail-Fast) — Key-aware throttling that instantly triggers circuit breakers upon HTTP 429 to re-route requests without blocking pipelines.
Deep Connection Pooling — Scaled HTTP/2 multiplexing (pool=50) coupled with Brotli serialization to maximize throughput during large batch verifications.
Fast, Concurrent & Memory Safe — Built securely in Rust (citesage-rs), maximizing multi-thread velocity while categorically eliminating system-level memory leaks and typist crashes.

Configuration

Duplicate .env.example into .env and assign your credentials:

OPENALEX_EMAIL=[email protected]          # Enter the Polite Pool for unrestricted rate limits
SEMANTIC_SCHOLAR_API_KEY=your_key      # Mandatory for production volume queries
CORE_API_KEY=your_key                  # Discovers open access fallbacks

Development Toolchain

# Verify the entire Rust workspace topology (citesage-cli, citesage-core, citesage-verify)
cargo test --workspace

# Apply stringent Linting & Formatting rules
cargo clippy --workspace -- -D warnings
cargo fmt --all

Technical Roadmap

Phase	Delivered Scope	Status
1–10	Verification kernel, GROBID interface, format versatility, Docker CI orchestration.	✅ Done
11	Hybrid RAG (BM25+RRF), structural chunk mapping, Cross-Encoder topological sorting.	✅ Done
12	Precision Grounded Generation, Continuous Ragas testing, Cloudflare resistance setup.	✅ Done
13	5-way search hub, E5 geometry tracking, rigorous 7-source PDF supply chain.	✅ Done
14	Architecture decomposition (CLI split, abstract Repair/Embedding), Docker isolation.	✅ Done
15	Temporal Alignment Engine, graphical Evidence Cards, Web Portal (Chainlit).	✅ Done
16	Output serialization standards, deterministic log aggregators, resolver stabilizers.	✅ Done
17	Universal Code Audit, Gold Standard compliance metric, Dual-Push CI deployments.	✅ Done
17+	LanceDB structural migration, Semantic Shared isolation limits.	✅ Done
18	Multi-node asynchronous pipeline scaling.	✅ Done
19	The Great Rust Renaissance: Total system evolution substituting Python orchestrators with native Rust (`citesage-rs`); integrating hyper-concurrent verification architectures.	🔵 Active

Academic Attribution

Should CiteSage augment your research execution, please attribute the framework using the citation below (Non-mandatory per MIT):

@software{citesage2026,
  author = {Chiang, Chenwei},
  title  = {CiteSage: An Agentic Verification Framework for Academic Integrity},
  year   = {2026},
  url    = {https://github.com/chenweichiang/citesage},
  note   = {v6.0.0-alpha, Phase 19}
}

License

Subject to the MIT License. Reference LICENSE.

CiteSage operates as a continuous academic construct. Priority and collaboration protocols are detailed at RESEARCH-NOTICE.md. The architectural logic and systemic history can be viewed at SYSTEM_PLAN.md.

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.agents/workflows		.agents/workflows
.github/workflows		.github/workflows
archive_v5_python		archive_v5_python
assets		assets
citesage-rs		citesage-rs
config		config
docs		docs
examples		examples
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
GEMINI.md		GEMINI.md
README.md		README.md
README_zh-TW.md		README_zh-TW.md
SYSTEM_PLAN.md		SYSTEM_PLAN.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CiteSage

What is CiteSage?

Why CiteSage?

Quick Start

Prerequisites

Setup

Verify a Paper's Citations

Curate Literature on a Topic

Architecture of the Output

How It Works

Key Features

Absolute Confidence for Researchers

Systematic Literature Discovery

Under the Hood

Configuration

Development Toolchain

Technical Roadmap

Academic Attribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CiteSage

What is CiteSage?

Why CiteSage?

Quick Start

Prerequisites

Setup

Verify a Paper's Citations

Curate Literature on a Topic

Architecture of the Output

How It Works

Key Features

Absolute Confidence for Researchers

Systematic Literature Discovery

Under the Hood

Configuration

Development Toolchain

Technical Roadmap

Academic Attribution

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages