GraphRAG-ISAW: Knowledge-Graph Extension for the AI Librarian

A working extension to the IEEE paper project (“AI Librarian”) that adds a Neo4j knowledge graph on top of the RAG stack. It ingests ISAW Papers into Article/Chunk nodes, ingests Pleiades places into Place nodes, and links chunks to places with MENTIONS edges. Designed so you can bulk-ingest documents first, then iterate on linking without touching documents. Wikidata alignment is implemented via the Wikidata Query Service (SPARQL) using property P1584 (Pleiades ID).

Context: why these sources together

ISAW Papers is an open-access journal from NYU’s Institute for the Study of the Ancient World, published on the web (HTML) with permissive licensing and stable URLs—good for clean text extraction and citation.

Pleiades is the community-curated gazetteer of ancient places, providing stable identifiers (pleiadesId), canonical titles, multilingual alternate names, place types, and URIs—exactly what you want to normalize messy place mentions in text.

Wikidata is a general, CC0 knowledge base that cross-links to many authority files (including Pleiades), adding identifiers, aliases, and statements you can leverage for enrichment and disambiguation.

Putting them together is sound because:

Your text (ISAW chunks) produces ambiguous surface forms.
Pleiades gives you the authoritative place entity backbone.
Wikidata extends those entities with broader graph context and cross-IDs for downstream linking.

What this is

A graph-augmented RAG pipeline:

Your existing vector/RAG system (“AI Librarian”).
A Neo4j knowledge graph that models Articles, Chunks, People, Concepts, and Places from Pleiades.
A deterministic linker that connects Chunk → Place via surface forms (title + altNames) using full-text shortlist + regex boundaries.

Goal: Add entity-level structure and repeatable linking over the same corpus used by the RAG backend, so you can explore, audit, and enrich relationships that matter to Ancient World research.

Wikidata enrichment: how it works

A small Python job (wd_enrich_places.py) batches all Place.pleiadesId values and queries the Wikidata Query Service (WDQS) with SPARQL. For each hit where ?item wdt:P1584 ?pleiadesId, we:

MERGE a (:WikidataEntity {qid}) and set uri, label, optional instanceOf (P31), and lat/lon from P625.
MERGE (p:Place {pleiadesId})-[:SAME_AS {property:'P1584', source:'wikidata', matchedBy:'pleiadesId'}]->(w:WikidataEntity).

It’s idempotent and safe to re-run. Batching and polite delays are included to respect WDQS rate limits.

Data sources

ISAW Papers (articles & chunks) → :Article, :Chunk
Pleiades (gazetteer of ancient places) → :Place (with pleiadesId, title, altNames, etc.)
Wikidata (live enrichment) → :WikidataEntity nodes created by resolving Place.pleiadesId through P1584; we add [:SAME_AS {property:'P1584', source:'wikidata', matchedBy:'pleiadesId'}] from Place → WikidataEntity and optionally store label, instanceOf (P31), and coordinates (P625 → lat,lon).

Current graph content (as loaded)

Nodes: 42,577
Labels used: Article, Chunk, Concept, Person, Place
Relationships: 32,057
Types used: AUTHORED, CONNECTED, HAS_CHUNK, MENTIONS, NEXT, PART_OF

After running the Wikidata job you will also see WikidataEntity nodes and SAME_AS relationships. Counts will depend on how many batches you’ve completed.

Property keys (seen in use):
aliases, altNames, articleId, associationCertainty, by, chunkId, connectionType, corresponding, data, description, id, journal, languages, matched, name, nodes, order, placeTypes, pleiadesId, relationships, review_state, role, seq, source, style, subject, text, textEmbedding, title, uri, url, visualisation, year

Exact schema (derived from your current Neo4j)

“Required” = property present on all nodes/relationships of that label/type in your DB snapshot (not an enforced constraint unless you add one). “Optional” = present on a subset.

Node labels

Place — 42,139 nodes

Required:
- pleiadesId (string)
- uri (string, Pleiades place URL)
- source (string, e.g., "Pleiades")
Optional:
- title (string) — present on ~41,777
- altNames (list) — ~41,777
- description (string) — ~41,777
- placeTypes (list) — ~41,777
- subject (list) — ~41,777
- languages (list) — ~41,777
- review_state (string) — ~41,777

Article — 2 nodes

Required:
- articleId (string)
- title (string)
- year (int)
- journal (string)
- url (string)

Chunk — 218 nodes

Required:
- chunkId (string)
- seq (int)
- text (string)
- textEmbedding (list)

Person — 213 nodes

Required:
- name (string)
- aliases (list)

Concept — 5 nodes

Required:
- name (string)

Tech stack

Neo4j 5.x / 4.x
Neo4j Driver for Python
APOC optional (linker provided without APOC)
Python 3.10+ for ingestion & linking scripts
Requests + python-dotenv (for WDQS job)
Vector/RAG: your AI Librarian stack (OpenAI embeddings + Chroma, LangChain RetrievalQA, GPT-4) remains unchanged

Repository layout (suggested)

.
├─ data/
│  ├─ articles/               # raw article exports (optional)
│  └─ chunks/                 # JSON arrays of strings, one file per article
├─ graph/
│  ├─ constraints.cypher      # uniqueness constraints & indexes
│  ├─ ingest_articles.py      # upsert Article + Chunk + HAS_CHUNK + NEXT
│  ├─ ingest_pleiades.py      # upsert Place nodes from Pleiades dump
│  ├─ build_fulltext.cypher   # full-text index on Chunk(text)
│  ├─ linker_places.py        # Chunk -> Place MENTIONS creation
│  └─ sanity.cypher           # verification queries
├─ tools/
│  ├─ to_jsonl_fix_unicode.py # robust JSONL converter for article chunks
│  └─ wd_enrich_places.py     # Wikidata enrichment job (P1584 → SAME_AS)
└─ README.md

Quick start

Create constraints (once)

CREATE CONSTRAINT article_id IF NOT EXISTS
FOR (a:Article) REQUIRE a.articleId IS UNIQUE;

CREATE CONSTRAINT chunk_id IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE;

CREATE CONSTRAINT place_pid IF NOT EXISTS
FOR (p:Place) REQUIRE p.pleiadesId IS UNIQUE;

CREATE CONSTRAINT wd_qid IF NOT EXISTS
FOR (w:WikidataEntity) REQUIRE w.qid IS UNIQUE;

Full-text index for candidate retrieval (once)

CALL db.index.fulltext.createNodeIndex(
  'chunkText',
  ['Chunk'],
  ['text']
);

Convert article text → JSONL

Input is a JSON array of strings (each string = a chunk). The converter normalizes Unicode and writes JSONL with articleId, chunkId, seq, text.

python tools/to_jsonl_fix_unicode.py   data/chunks/isaw2.txt   data/chunks/isaw2.jsonl   --article-id isaw-papers-2-2012

Ingest Articles & Chunks

Upsert both and wire HAS_CHUNK & NEXT. Your script may already do this; keep the idempotent MERGE pattern.

Ingest Pleiades

Upsert :Place with at least:

pleiadesId
title
altNames (array)

Keep other Pleiades fields if available.

Run the linker

The provided linker_places.py does:

Full-text shortlist on Chunk.text
Boundary regex on matched name
MERGE (c)-[:MENTIONS {matched, source:'fulltext+regex'}]->(p)

It is idempotent and safe to re-run after tuning.

Run Wikidata enrichment

Resolve Place → Wikidata via P1584 and add SAME_AS:

# Make sure NEO4J_URI / NEO4J_USER / NEO4J_PASSWORD (and NEO4J_DATABASE if non-default) are set
python tools/wd_enrich_places.py

Linker (minimal working version)

from neo4j import GraphDatabase
import re

URI = "bolt://localhost:7687"
USER = "neo4j"
PASSWORD = "your_password"

driver = GraphDatabase.driver(URI, auth=(USER, PASSWORD))

def fetch_places(tx):
    q = """
    MATCH (p:Place)
    RETURN p.pleiadesId AS pid,
           p.title      AS title,
           coalesce(p.altNames, []) AS altNames
    """
    return list(tx.run(q))

def link_name_to_place(tx, pid, name, rx):
    q = """
    CALL db.index.fulltext.queryNodes('chunkText', $q) YIELD node, score
    WITH node WHERE node.text =~ $rx
    MATCH (p:Place {pleiadesId:$pid})
    MERGE (node)-[r:MENTIONS {matched:$name, source:'fulltext+regex'}]->(p)
    RETURN count(r) AS created
    """
    return tx.run(
        q,
        q=f'"{name}"',
        rx=rx,
        pid=pid,
        name=name
    ).single()[0]

def compile_boundary_regex(term):
    esc = re.escape(term)
    return f"(?i)(^|[^A-Za-z0-9_]){esc}([^A-Za-z0-9_]|$)"

def main():
    with driver.session() as s:
        places = s.execute_read(fetch_places)
        for rec in places:
            pid = rec["pid"]
            names = [rec["title"], *(rec["altNames"] or [])]
            seen = set()
            names = [n.strip() for n in names if n and len(n.strip()) >= 3]
            names = [n for n in names if not (n.lower() in seen or seen.add(n.lower()))]
            for name in names:
                rx = compile_boundary_regex(name)
                s.execute_write(link_name_to_place, pid, name, rx)
    driver.close()

if __name__ == "__main__":
    main()

Verification queries

Counts

MATCH (a:Article) RETURN count(a) AS articles;
MATCH (c:Chunk)   RETURN count(c) AS chunks;
MATCH (p:Place)   RETURN count(p) AS places;
MATCH ()-[r:MENTIONS]->(:Place) RETURN count(r) AS mentions;

MATCH (w:WikidataEntity) RETURN count(w) AS wd_items;
MATCH (:Place)-[r:SAME_AS {property:'P1584'}]->(:WikidataEntity) RETURN count(r) AS same_as_p1584;

Coverage per article

MATCH (a:Article)-[:HAS_CHUNK]->(c)
OPTIONAL MATCH (c)-[m:MENTIONS]->(p:Place)
RETURN a.articleId,
       count(DISTINCT c) AS chunks,
       count(DISTINCT p) AS places,
       count(DISTINCT m) AS links
ORDER BY links DESC
LIMIT 50;

Spot results

MATCH (:Article {articleId:$aid})-[:HAS_CHUNK]->(c)-[r:MENTIONS]->(p:Place)
RETURN c.chunkId, p.pleiadesId, p.title, r.matched
ORDER BY c.chunkId, p.title
LIMIT 100;

Check Wikidata joins for a sample place:

MATCH (p:Place {pleiadesId:$pid})-[:SAME_AS {property:'P1584'}]->(w:WikidataEntity)
RETURN p.pleiadesId, p.title, w.qid, w.label, w.instanceOf, w.lat, w.lon;

How this interacts with the RAG app

The RAG app continues to use vector search over chunk text to ground answers with citations.

The graph adds:

Explicit MENTIONS (Chunk → Place) you can facet or filter by.
Navigation across Article / Chunk / Place / Person / Concept via Cypher.
A stable substrate to experiment with entity resolution and cross-corpus links (e.g., DCAA / AWDL later).
Wikidata-backed enrichment for Place via SAME_AS, enabling cross-IDs and additional metadata (type, coords).

You can expose graph-powered filters in your UI (for example: “restrict to chunks mentioning Babylon”).

Roadmap

Extend Wikidata enrichment: beyond places (P1584), add person and concept alignment (e.g., VIAF/ORCID/ULAN, topical items), and pull selected statements (e.g., P625, P279, P131, P17) for analysis.
Disambiguation: optional context cues (e.g., Mesopotamia | Euphrates | Assyria nearby) for high-ambiguity names.
Accent-folding & fuzzy: accent-insensitive matching and cautious fuzzy matching for long names (≥ 6–7 chars).
People/Concept linkers: parallel pipelines for Person and Concept.

Provenance and licensing

ISAW Papers: open access, CC-BY (site-hosted).
Pleiades: open, with clear attribution requirements; keep pleiadesId and source uri in your nodes.
Wikidata: CC0; store QIDs and source edges.

Contributing

Keep MERGE-idempotent Cypher; don’t introduce write patterns that duplicate nodes.
Treat linkers as pure functions over existing nodes: re-runnable, measurable, and auditable.
Add sanity queries for every new entity type and every new linker.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
docs/chroma_hybrid		docs/chroma_hybrid
utilities		utilities
.gitignore		.gitignore
README.md		README.md
chunks_isaw_papers_all.txt		chunks_isaw_papers_all.txt
clean_persons_keep_authors.py		clean_persons_keep_authors.py
failed_qa.csv		failed_qa.csv
generate_meta_stubs.py		generate_meta_stubs.py
graph_schema_check.py		graph_schema_check.py
graphrag_eval.py		graphrag_eval.py
graphrag_eval_csv.py		graphrag_eval_csv.py
ground_truth.txt		ground_truth.txt
hybrid_results_ground_truth.csv		hybrid_results_ground_truth.csv
hybrid_results_ground_truth_first.csv		hybrid_results_ground_truth_first.csv
ingest_all_from_meta.py		ingest_all_from_meta.py
ingest_articles.py		ingest_articles.py
ingest_pleiades.py		ingest_pleiades.py
link_chunks_to_places.py		link_chunks_to_places.py
rebuild_concepts.py		rebuild_concepts.py
score_calc.py		score_calc.py
to_jsonl_fix_unicode.py		to_jsonl_fix_unicode.py
wd_enrich_places.py		wd_enrich_places.py
wd_link_label_entities.py		wd_link_label_entities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GraphRAG-ISAW: Knowledge-Graph Extension for the AI Librarian

Context: why these sources together

What this is

Wikidata enrichment: how it works

Data sources

Current graph content (as loaded)

Exact schema (derived from your current Neo4j)

Node labels

Tech stack

Repository layout (suggested)

Quick start

Create constraints (once)

Full-text index for candidate retrieval (once)

Convert article text → JSONL

Ingest Articles & Chunks

Ingest Pleiades

Run the linker

Run Wikidata enrichment

Linker (minimal working version)

Verification queries

Counts

Coverage per article

Spot results

How this interacts with the RAG app

Roadmap

Provenance and licensing

Contributing

About

Uh oh!

Releases

Packages

Languages

federicodip/GraphRag

Folders and files

Latest commit

History

Repository files navigation

GraphRAG-ISAW: Knowledge-Graph Extension for the AI Librarian

Context: why these sources together

What this is

Wikidata enrichment: how it works

Data sources

Current graph content (as loaded)

Exact schema (derived from your current Neo4j)

Node labels

Tech stack

Repository layout (suggested)

Quick start

Create constraints (once)

Full-text index for candidate retrieval (once)

Convert article text → JSONL

Ingest Articles & Chunks

Ingest Pleiades

Run the linker

Run Wikidata enrichment

Linker (minimal working version)

Verification queries

Counts

Coverage per article

Spot results

How this interacts with the RAG app

Roadmap

Provenance and licensing

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages