Skip to content

chenweichiang/citesage

Repository files navigation

CiteSage Logo

CiteSage

Stop trusting citations blindly. Verify them.

繁體中文 · GitHub


What is CiteSage?

CiteSage is an agentic verification pipeline designed to ensure academic integrity throughout the submission and peer-review processes.

Starting from a manuscript (PDF or Word), the pipeline automatically extracts references, cross-validates metadata across global academic databases, and rigorously audits whether an author's claims are genuinely supported by the cited sources—ultimately generating a reproducible and robust evidence trail.

If you have ever submitted a paper with a non-existent citation, read an article with a dead reference link, or encountered an AI writing tool that hallucinated a DOI, CiteSage was engineered specifically to solve that problem.

Tip

Phase 19 (The Great Rust Renaissance): CiteSage v6.0.0 has been fully rebuilt from the ground up in Rust (1.94). We have moved beyond our legacy Python architecture to a highly robust, memory-safe, and concurrent pipeline orchestration model, utilizing a citesage-rs workspace to achieve unparalleled reliability in verification and data integrity.

CiteSage serves two core modes:

Mode Functionality
Verify Mode You upload a manuscript (PDF/Word). CiteSage scrutinizes every citation: Does this paper exist? Is the metadata accurate? Does the cited text genuinely align with the author’s claim?
Curate Mode You describe a research topic in natural language. CiteSage queries multiple academic databases, downloads authentic PDFs, and compiles a strictly grounded, structured evidence report.

Why CiteSage?

The Core Problem: LLMs fabricate citations. Researchers inadvertently miscopy metadata. Papers unknowingly cite retracted works. Traditional Reference Managers (e.g., Zotero, Mendeley) simply store and format data—they do not verify authenticity.

What CiteSage actually does:

  • Verifies if a cited DOI authentically exists in CrossRef, OpenAlex, or Semantic Scholar.
  • Actively detects retracted papers via OpenAlex and CrossRef metadata signals.
  • Preemptively catches GROBID hallucinations (fictional DOIs generated by underlying PDF parsers).
  • Semantically compares the author's embedded claim against the actual content of the cited source.
  • Flags paywall HTML intercepts disguised as PDFs before wasting bandwidth and context.
  • Identifies and issues warnings against predatory publishers (e.g., MDPI).

Quick Start

Prerequisites

  • Docker & Docker Compose (Providing background graph and parsing infrastructure)
  • Rust 1.94+ (Cargo Toolchain)
  • Ollama running locally (for verification models)

Setup

git clone https://github.com/chenweichiang/citesage.git
cd citesage
cp .env.example .env        # Configure your API keys (OpenAlex, Semantic Scholar)
docker compose up -d        # Initializes GROBID + Neo4j + Ollama backend services

Verify a Paper's Citations

cd citesage-rs
cargo run --release -p citesage-cli -- research --file ../examples/your_paper.pdf

CiteSage will autonomically:

  1. Extract all citation contexts from your manuscript (utilizing Dockerized GROBID).
  2. Look up each reference across multiple academic databases simultaneously.
  3. Flag missing, hallucinated, or fundamentally misrepresented citations.
  4. Auto-repair fractured citations leveraging LLM reasoning (via Ollama).
  5. Generate an interactive, exhaustive evidence report.

Curate Literature on a Topic

cd citesage-rs
cargo run --release -p citesage-cli -- research --query "large language models in accessibility design" -n 8

CiteSage will autonomically:

  1. Concurrently query OpenAlex, Semantic Scholar, arXiv, Europe PMC, and CrossRef.
  2. Re-rank results intelligently using a Cross-Encoder (ms-marco-MiniLM-L-12-v2).
  3. Traverse a 7-source acquisition chain to download real PDF documents.
  4. Extract primary claims and synthesize Evidence Cards.
  5. Export everything neatly into output/<session>.tar.gz.

Architecture of the Output

output/
└── verify_your_paper_20260401_154000/
    ├── reports/
    │   └── report.md           ← Detailed citation verification results
    ├── diagnostics/
    │   └── doctor_audit.json   ← Engineering pipeline health + LLM repair logs
    └── research_data/
        └── pdfs/               ← Downloaded primary sources (curate mode)

Each parsed citation receives a distinct verification grade:

  • Verified — Exists globally, metadata synchronizes, content accurately aligns.
  • ⚠️ Warning — Exists, but title/author metadata exhibits drift, or contextual support is theoretically uncertain.
  • Failed — Untraceable across all databases, DOI exhibits hallucination signatures, or the paper was aggressively retracted.

How It Works

Your Manuscript (PDF)
  │
  ├─ Layout-aware Data Parsing (IBM Docling)
  │   └─ Secures and isolates the References section, avoiding multi-column bleed.
  │
  ├─ GROBID Citation Extraction
  │   └─ Hallucination Defense Module: Fake DOIs are instantly discarded.
  │
  ├─ Multi-source Synchronization Lookup (CrossRef + OpenAlex + Semantic Scholar)
  │   └─ Retraction Shield Protocol active + Predatory publisher screening.
  │
  ├─ Semantic Content Alignment Verification (E5 + LLM Dual-layer Analysis)
  │   └─ "Did the author actually leverage this source contextually?"
  │
  └─ Auto-repair Mechanism (Agentic Flow)
      ├─ Tier 1: Core structured APIs (Authoritative Grounding)
      ├─ Tier 2: Web scraping → Identifies candidates only (Never authoritative)
      └─ Absolute validation sequence required before final resolution.

Key Features

Absolute Confidence for Researchers

  • Contextual Paragraph-level Verification (--paragraph) — Audits each distinct in-text citation independently against its parent reference.
  • Double-Blind Review Proxy (--role reviewer) — Intelligently anonymizes provenance data ensuring zero privacy leaks.
  • Deterministic Scientific Execution (--deterministic) — Freezes random seeds for reproducible pipelines.

Systematic Literature Discovery

  • 5-source Broad Search — OpenAlex + Semantic Scholar + arXiv + Europe PMC + CrossRef parallelization.
  • 7-source PDF Defense Chain — Unpaywall → Semantic Scholar → CrossRef → Search Index → CORE → DOAJ → BASE.
  • Bot-Resistance Routing — Utilizes cloudscraper stealth capabilities to bypass publisher CDN roadblocks.
  • Paywall Traffic Detection — Pre-emptive HTTP HEAD validation blocking counterfeit HTML paywalls.

Under the Hood

  • Hybrid Retrieval System — Dense E5 (1024d) + BM25 Sparse Index + 3-way RRF fusion running on LanceDB.
  • Intelligent Graph Matrix — Neo4j stores persistent paper–author–institution ontological relationships.
  • Resilient Pipeline (Fail-Safe) — Exponential API backoffs and fallback routing ensuring complete fault coverage.
  • Smart Rate Limiting (429 Fail-Fast) — Key-aware throttling that instantly triggers circuit breakers upon HTTP 429 to re-route requests without blocking pipelines.
  • Deep Connection Pooling — Scaled HTTP/2 multiplexing (pool=50) coupled with Brotli serialization to maximize throughput during large batch verifications.
  • Fast, Concurrent & Memory Safe — Built securely in Rust (citesage-rs), maximizing multi-thread velocity while categorically eliminating system-level memory leaks and typist crashes.

Configuration

Duplicate .env.example into .env and assign your credentials:

OPENALEX_EMAIL=[email protected]          # Enter the Polite Pool for unrestricted rate limits
SEMANTIC_SCHOLAR_API_KEY=your_key      # Mandatory for production volume queries
CORE_API_KEY=your_key                  # Discovers open access fallbacks

Development Toolchain

# Verify the entire Rust workspace topology (citesage-cli, citesage-core, citesage-verify)
cargo test --workspace

# Apply stringent Linting & Formatting rules
cargo clippy --workspace -- -D warnings
cargo fmt --all

Technical Roadmap

Phase Delivered Scope Status
1–10 Verification kernel, GROBID interface, format versatility, Docker CI orchestration. ✅ Done
11 Hybrid RAG (BM25+RRF), structural chunk mapping, Cross-Encoder topological sorting. ✅ Done
12 Precision Grounded Generation, Continuous Ragas testing, Cloudflare resistance setup. ✅ Done
13 5-way search hub, E5 geometry tracking, rigorous 7-source PDF supply chain. ✅ Done
14 Architecture decomposition (CLI split, abstract Repair/Embedding), Docker isolation. ✅ Done
15 Temporal Alignment Engine, graphical Evidence Cards, Web Portal (Chainlit). ✅ Done
16 Output serialization standards, deterministic log aggregators, resolver stabilizers. ✅ Done
17 Universal Code Audit, Gold Standard compliance metric, Dual-Push CI deployments. ✅ Done
17+ LanceDB structural migration, Semantic Shared isolation limits. ✅ Done
18 Multi-node asynchronous pipeline scaling. ✅ Done
19 The Great Rust Renaissance: Total system evolution substituting Python orchestrators with native Rust (citesage-rs); integrating hyper-concurrent verification architectures. 🔵 Active

Academic Attribution

Should CiteSage augment your research execution, please attribute the framework using the citation below (Non-mandatory per MIT):

@software{citesage2026,
  author = {Chiang, Chenwei},
  title  = {CiteSage: An Agentic Verification Framework for Academic Integrity},
  year   = {2026},
  url    = {https://github.com/chenweichiang/citesage},
  note   = {v6.0.0-alpha, Phase 19}
}

License

Subject to the MIT License. Reference LICENSE.

CiteSage operates as a continuous academic construct. Priority and collaboration protocols are detailed at RESEARCH-NOTICE.md. The architectural logic and systemic history can be viewed at SYSTEM_PLAN.md.

About

Automated citation verification system with Agentic self-learning. Validates academic metadata and semantic consistency using ML and multi-source databases. 具備 Agentic 自學能力的自動化學術引用驗證系統。整合多方資料庫與 ML 技術,精準確保文獻元資料及語義內容的一致性。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors