Skip to content

Graph Extension Research #192

@Deodat-Lawson

Description

@Deodat-Lawson

[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval

Summary

Integrate Neo4j as an optional knowledge graph backend to:

  1. Document processing: Update the knowledge graph when documents are processed (when Neo4j is available)
  2. RAG retrieval: Use Neo4j for graph-based retrieval during document Q&A (when Neo4j is available)

The integration should be optional and gracefully degradable—the system continues to work without Neo4j, falling back to the existing PostgreSQL-based knowledge graph and BM25+Vector ensemble.


Background / Current State

Existing Knowledge Graph (PostgreSQL)

The codebase already has a PostgreSQL-based knowledge graph:

Component Location Purpose
Schema src/server/db/schema/knowledge-graph.ts kg_entities, kg_entity_mentions, kg_relationships
Entity extraction src/lib/ingestion/entity-extraction.ts Calls sidecar /extract-entities, stores in PostgreSQL
Graph retriever src/lib/tools/rag/retrievers/graph-retriever.ts Traverses 1–2 hops via Drizzle/SQL
Pipeline hook src/lib/tools/doc-ingestion/index.ts maybeExtractEntities() runs after storeDocument() when sidecar is available

Current RAG Retrieval

  • Ensemble: BM25 + Vector only (weights [0.4, 0.6])
  • Graph retriever: Implemented but not wired into the ensemble
  • Reranking: Optional sidecar cross-encoder when SIDECAR_URL is set

Document Processing Flow

Upload → Ingest → Chunk → Embed → Store (pgvector) → [Optional] Extract Entities → [Proposed] Sync to Neo4j

Motivation

  1. Graph-native traversal: Neo4j excels at multi-hop graph traversal and path queries; PostgreSQL uses recursive CTEs which can be slower on large graphs.
  2. Cypher expressiveness: Cypher allows concise, readable graph queries (e.g. variable-length paths, pattern matching).
  3. Scalability: For companies with large document corpora (100K+ entities), Neo4j can provide better query performance.
  4. Future extensibility: Enables graph algorithms (PageRank, community detection) and richer relationship types without schema changes.

Proposed Solution

1. Document Processing: Sync to Neo4j When Available

Trigger: After entity extraction completes (PostgreSQL write), if NEO4J_URI is configured, sync entities and relationships to Neo4j.

Data flow:

  • Read from kg_entities, kg_entityMentions, kg_relationships (or use the in-memory result from extraction)
  • Batch-write to Neo4j using MERGE for idempotency
  • Link nodes to documentId and companyId via properties for scoping

Cypher model (conceptual):

MERGE (e:Entity {name: $name, label: $label, companyId: $companyId})
ON CREATE SET e.displayName = $displayName, e.confidence = $confidence
MERGE (s:Section {id: $sectionId, documentId: $documentId})
MERGE (e)-[:MENTIONED_IN {confidence: $conf}]->(s)
MERGE (e1)-[r:CO_OCCURS]->(e2)
ON CREATE SET r.weight = 0.5, r.evidenceCount = 1

Implementation:

  • Add src/lib/graph/neo4j-client.ts — Neo4j driver wrapper, connection pooling, health check
  • Add src/lib/graph/neo4j-sync.ts — maps entities/relationships to Cypher MERGE statements
  • In maybeExtractEntities() or new step maybeSyncToNeo4j(), call sync after PostgreSQL write
  • Keep PostgreSQL as source of truth; Neo4j as optional read-optimized layer

2. RAG Retrieval: Neo4j-Aware Graph Retrieval

Trigger: When performing ensemble search, if NEO4J_URI and ENABLE_GRAPH_RETRIEVAL are set, include a graph retriever (Neo4j or PostgreSQL fallback).

Flow:

  1. Extract query terms from user question
  2. If Neo4j available: run Cypher traversal in Neo4j
  3. Else if graph enabled: use existing PostgreSQL GraphRetriever
  4. Else: skip graph retrieval
  5. Fuse graph results with BM25 + Vector via RRF (Reciprocal Rank Fusion)
  6. Optional reranking via sidecar

Example Cypher for retrieval:

MATCH (e:Entity)
WHERE e.companyId = $companyId AND toLower(e.name) CONTAINS toLower($term)
MATCH (e)-[:MENTIONED_IN]->(s:Section)
WHERE s.documentId IN $documentIds
WITH s LIMIT $topK
RETURN s.id AS sectionId

Implementation:

  • Add src/lib/tools/rag/retrievers/neo4j-graph-retriever.tsNeo4jGraphRetriever extending LangChain BaseRetriever
  • Same interface as GraphRetriever: _getRelevantDocuments(query) returns Document[]
  • Section content fetched from PostgreSQL documentSections (Neo4j stores section IDs only)
  • Wire into createDocumentEnsembleRetriever, createCompanyEnsembleRetriever, createMultiDocEnsembleRetriever with configurable weight (e.g. [0.3, 0.5, 0.2] for BM25, Vector, Graph)

3. Configuration

Env var Required Description
NEO4J_URI No Neo4j connection URI (e.g. neo4j://localhost:7687). If unset, Neo4j features are disabled.
NEO4J_USER No Neo4j username (default: neo4j)
NEO4J_PASSWORD No Neo4j password
ENABLE_GRAPH_RETRIEVAL No If true, include graph retriever in ensemble. Default: false until validated.

Implementation Tasks

  • Add neo4j-driver dependency
  • Create src/lib/graph/neo4j-client.ts (driver, health check, connection handling)
  • Create src/lib/graph/neo4j-sync.ts (entity/relationship sync from PostgreSQL to Neo4j)
  • Add maybeSyncToNeo4j() step in document ingestion pipeline (after maybeExtractEntities)
  • Create src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts
  • Extend ensemble search to optionally include graph retriever (Neo4j or PostgreSQL)
  • Add integration tests for Neo4j sync and retrieval (with testcontainers or mock)
  • Document Neo4j setup in README or docs
  • Add metrics/logging for Neo4j sync and retrieval latency

Acceptance Criteria

  • When NEO4J_URI is set and a document is processed with entity extraction, entities and relationships are synced to Neo4j
  • When Neo4j is unavailable, document processing completes successfully (no failure)
  • When NEO4J_URI and ENABLE_GRAPH_RETRIEVAL are set, RAG queries include graph-based results in the ensemble
  • When Neo4j is unavailable at query time, retrieval falls back to PostgreSQL graph retriever or BM25+Vector only
  • No regression in existing RAG behavior when Neo4j is not configured

Alternatives Considered

Approach Pros Cons
Stay with PostgreSQL only No new infra, schema exists Traversal may slow on large graphs; no Cypher
Apache AGE Cypher in PostgreSQL, single DB Less mature, operational unknowns
Neo4j as primary graph Best graph performance Migration from PostgreSQL KG, more infra
Neo4j as optional sync (chosen) Gradual adoption, fallback to PG Dual-write complexity, eventual consistency

Risks and Trade-offs

Operational Complexity

  • Risk: Neo4j is another service to deploy, monitor, backup.
  • Mitigation: Make Neo4j strictly optional. System works without it.

Data Consistency

  • Risk: Dual-write (PostgreSQL + Neo4j) can diverge if Neo4j write fails.
  • Mitigation: PostgreSQL is source of truth. Consider async sync via queue for resilience.

Graph Retriever Not Yet in Ensemble

  • Note: The existing GraphRetriever is not wired into the ensemble. Phase 1 could be: wire PostgreSQL GraphRetriever first, measure impact, then add Neo4j.

Entity Extraction Quality

  • Note: Graph retrieval is only as good as extracted entities. Current extraction uses NER + CO_OCCURS. Improving relation extraction may yield more benefit than Neo4j alone.

Phased Rollout (Recommended)

  1. Phase 1: Wire existing PostgreSQL GraphRetriever into ensemble. Measure recall/latency. Validate graph retrieval adds value.
  2. Phase 2: Add Neo4j sync in document processing. Keep PostgreSQL as source of truth.
  3. Phase 3: Implement Neo4jGraphRetriever; use when NEO4J_URI is set.
  4. Phase 4: Consider improving entity/relationship extraction before scaling.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions