You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Detect references that appear in the reference list but are never cited in the manuscript body. This is the highest-priority v4 heuristic — high value, low complexity, immediately testable.
Background
Sneaked references are a paper mill signature. Adding uncited references inflates citation counts for specific papers (often other paper mill products) or establishes fake citation networks. A managing editor currently has to manually cross-check the reference list against in-text citations — tedious and error-prone for a 40+ reference manuscript.
See roadmap/v4-features.md (Priority 1) for full specification.
Policy Constraint: Manuscript Confidentiality
Elsevier (which publishes AWHONN's journals — JOGNN, MCN, Nursing for Women's Health) has instructed editorial teams not to use publicly available AI chatbots for reference checking. This policy is aimed at consumer LLM interfaces (claude.ai, ChatGPT, etc.) and reflects legitimate confidentiality concerns around pre-publication manuscripts.
The reference list itself is low-sensitivity — it's metadata about already-published work. The auditor's current reference-list-only design is probably defensible under this policy, though editorial teams should confirm with their publisher.
The manuscript body is high-sensitivity. Sneaked-reference detection requires cross-referencing in-text citations against the reference list, which means touching manuscript content. This creates a policy tension that the implementation must address.
Implementation Options
Option A: Local Script (Recommended for Now)
Sneaked-reference detection is a mechanical cross-check, not a forensic judgment call. It doesn't need AI at all.
Ship a lightweight Python script that runs locally on the editor's machine:
Accepts manuscript body text (PDF, DOCX, or pasted text)
Parses in-text citation markers
Compares against the reference list
Reports orphaned references (in reference list but never cited)
Reports phantom citations (cited in text but not in reference list)
Pros: Zero policy concerns — nothing leaves the editor's machine. Simple to build, simple to deploy. Separates the mechanical task from the forensic task cleanly.
Cons: Two tools instead of one. The auditor doesn't know about sneaked references, so it can't combine that signal with other heuristics (e.g., sneaked + shadow paper = very high confidence fabrication).
Option B: Extract-and-Discard
A local preprocessing step extracts only the in-text citation markers from the manuscript body — (Smith, 2024), [14], etc. — and sends that extracted list (not the manuscript text) to the auditor alongside the reference list.
Pros: The auditor can integrate sneaked-reference signals with other heuristics. Minimal exposure surface — citation markers contain no manuscript content.
Cons: Requires a local preprocessing tool. The extracted citation list is still derived from the manuscript, so policy teams may have opinions about it. Adds friction to the workflow.
Option C: Full Manuscript via API with DPA
In a productized deployment, the publisher signs a data processing agreement (DPA) with the API provider. Full manuscript access, full sneaked-reference detection, fully compliant.
Pros: Cleanest integration. Full heuristic interaction. No workflow friction.
Cons: Requires business relationships, legal agreements, and a product that doesn't exist yet. This is the production-grade path, not the current-state path.
Recommended Sequencing
Now: Build Option A (local Python script) as a standalone companion tool. Ships fast, zero policy risk, immediately useful.
If productizing: Build Option B into the workflow — local extraction feeds the auditor. Test whether editorial teams find the two-step process acceptable.
At scale: Option C with API access and a DPA. The tool handles everything.
Acceptance Criteria (Option A — Local Script)
Python script accepts manuscript as PDF, DOCX, or plain text
Summary
Detect references that appear in the reference list but are never cited in the manuscript body. This is the highest-priority v4 heuristic — high value, low complexity, immediately testable.
Background
Sneaked references are a paper mill signature. Adding uncited references inflates citation counts for specific papers (often other paper mill products) or establishes fake citation networks. A managing editor currently has to manually cross-check the reference list against in-text citations — tedious and error-prone for a 40+ reference manuscript.
See
roadmap/v4-features.md(Priority 1) for full specification.Policy Constraint: Manuscript Confidentiality
Elsevier (which publishes AWHONN's journals — JOGNN, MCN, Nursing for Women's Health) has instructed editorial teams not to use publicly available AI chatbots for reference checking. This policy is aimed at consumer LLM interfaces (claude.ai, ChatGPT, etc.) and reflects legitimate confidentiality concerns around pre-publication manuscripts.
The reference list itself is low-sensitivity — it's metadata about already-published work. The auditor's current reference-list-only design is probably defensible under this policy, though editorial teams should confirm with their publisher.
The manuscript body is high-sensitivity. Sneaked-reference detection requires cross-referencing in-text citations against the reference list, which means touching manuscript content. This creates a policy tension that the implementation must address.
Implementation Options
Option A: Local Script (Recommended for Now)
Sneaked-reference detection is a mechanical cross-check, not a forensic judgment call. It doesn't need AI at all.
Ship a lightweight Python script that runs locally on the editor's machine:
Pros: Zero policy concerns — nothing leaves the editor's machine. Simple to build, simple to deploy. Separates the mechanical task from the forensic task cleanly.
Cons: Two tools instead of one. The auditor doesn't know about sneaked references, so it can't combine that signal with other heuristics (e.g., sneaked + shadow paper = very high confidence fabrication).
Option B: Extract-and-Discard
A local preprocessing step extracts only the in-text citation markers from the manuscript body —
(Smith, 2024),[14], etc. — and sends that extracted list (not the manuscript text) to the auditor alongside the reference list.Pros: The auditor can integrate sneaked-reference signals with other heuristics. Minimal exposure surface — citation markers contain no manuscript content.
Cons: Requires a local preprocessing tool. The extracted citation list is still derived from the manuscript, so policy teams may have opinions about it. Adds friction to the workflow.
Option C: Full Manuscript via API with DPA
In a productized deployment, the publisher signs a data processing agreement (DPA) with the API provider. Full manuscript access, full sneaked-reference detection, fully compliant.
Pros: Cleanest integration. Full heuristic interaction. No workflow friction.
Cons: Requires business relationships, legal agreements, and a product that doesn't exist yet. This is the production-grade path, not the current-state path.
Recommended Sequencing
Acceptance Criteria (Option A — Local Script)
Acceptance Criteria (Option B — Extract-and-Discard, Future)
Test Cases Needed
Dependencies