-
Notifications
You must be signed in to change notification settings - Fork 111
Open
1 / 21 of 2 issues completedDescription
[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval
Summary
Integrate Neo4j as an optional knowledge graph backend to:
- Document processing: Update the knowledge graph when documents are processed (when Neo4j is available)
- RAG retrieval: Use Neo4j for graph-based retrieval during document Q&A (when Neo4j is available)
The integration should be optional and gracefully degradable—the system continues to work without Neo4j, falling back to the existing PostgreSQL-based knowledge graph and BM25+Vector ensemble.
Background / Current State
Existing Knowledge Graph (PostgreSQL)
The codebase already has a PostgreSQL-based knowledge graph:
| Component | Location | Purpose |
|---|---|---|
| Schema | src/server/db/schema/knowledge-graph.ts |
kg_entities, kg_entity_mentions, kg_relationships |
| Entity extraction | src/lib/ingestion/entity-extraction.ts |
Calls sidecar /extract-entities, stores in PostgreSQL |
| Graph retriever | src/lib/tools/rag/retrievers/graph-retriever.ts |
Traverses 1–2 hops via Drizzle/SQL |
| Pipeline hook | src/lib/tools/doc-ingestion/index.ts |
maybeExtractEntities() runs after storeDocument() when sidecar is available |
Current RAG Retrieval
- Ensemble: BM25 + Vector only (weights
[0.4, 0.6]) - Graph retriever: Implemented but not wired into the ensemble
- Reranking: Optional sidecar cross-encoder when
SIDECAR_URLis set
Document Processing Flow
Upload → Ingest → Chunk → Embed → Store (pgvector) → [Optional] Extract Entities → [Proposed] Sync to Neo4j
Motivation
- Graph-native traversal: Neo4j excels at multi-hop graph traversal and path queries; PostgreSQL uses recursive CTEs which can be slower on large graphs.
- Cypher expressiveness: Cypher allows concise, readable graph queries (e.g. variable-length paths, pattern matching).
- Scalability: For companies with large document corpora (100K+ entities), Neo4j can provide better query performance.
- Future extensibility: Enables graph algorithms (PageRank, community detection) and richer relationship types without schema changes.
Proposed Solution
1. Document Processing: Sync to Neo4j When Available
Trigger: After entity extraction completes (PostgreSQL write), if NEO4J_URI is configured, sync entities and relationships to Neo4j.
Data flow:
- Read from
kg_entities,kg_entityMentions,kg_relationships(or use the in-memory result from extraction) - Batch-write to Neo4j using
MERGEfor idempotency - Link nodes to
documentIdandcompanyIdvia properties for scoping
Cypher model (conceptual):
MERGE (e:Entity {name: $name, label: $label, companyId: $companyId})
ON CREATE SET e.displayName = $displayName, e.confidence = $confidence
MERGE (s:Section {id: $sectionId, documentId: $documentId})
MERGE (e)-[:MENTIONED_IN {confidence: $conf}]->(s)
MERGE (e1)-[r:CO_OCCURS]->(e2)
ON CREATE SET r.weight = 0.5, r.evidenceCount = 1Implementation:
- Add
src/lib/graph/neo4j-client.ts— Neo4j driver wrapper, connection pooling, health check - Add
src/lib/graph/neo4j-sync.ts— maps entities/relationships to CypherMERGEstatements - In
maybeExtractEntities()or new stepmaybeSyncToNeo4j(), call sync after PostgreSQL write - Keep PostgreSQL as source of truth; Neo4j as optional read-optimized layer
2. RAG Retrieval: Neo4j-Aware Graph Retrieval
Trigger: When performing ensemble search, if NEO4J_URI and ENABLE_GRAPH_RETRIEVAL are set, include a graph retriever (Neo4j or PostgreSQL fallback).
Flow:
- Extract query terms from user question
- If Neo4j available: run Cypher traversal in Neo4j
- Else if graph enabled: use existing PostgreSQL
GraphRetriever - Else: skip graph retrieval
- Fuse graph results with BM25 + Vector via RRF (Reciprocal Rank Fusion)
- Optional reranking via sidecar
Example Cypher for retrieval:
MATCH (e:Entity)
WHERE e.companyId = $companyId AND toLower(e.name) CONTAINS toLower($term)
MATCH (e)-[:MENTIONED_IN]->(s:Section)
WHERE s.documentId IN $documentIds
WITH s LIMIT $topK
RETURN s.id AS sectionIdImplementation:
- Add
src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts—Neo4jGraphRetrieverextending LangChainBaseRetriever - Same interface as
GraphRetriever:_getRelevantDocuments(query)returnsDocument[] - Section content fetched from PostgreSQL
documentSections(Neo4j stores section IDs only) - Wire into
createDocumentEnsembleRetriever,createCompanyEnsembleRetriever,createMultiDocEnsembleRetrieverwith configurable weight (e.g.[0.3, 0.5, 0.2]for BM25, Vector, Graph)
3. Configuration
| Env var | Required | Description |
|---|---|---|
NEO4J_URI |
No | Neo4j connection URI (e.g. neo4j://localhost:7687). If unset, Neo4j features are disabled. |
NEO4J_USER |
No | Neo4j username (default: neo4j) |
NEO4J_PASSWORD |
No | Neo4j password |
ENABLE_GRAPH_RETRIEVAL |
No | If true, include graph retriever in ensemble. Default: false until validated. |
Implementation Tasks
- Add
neo4j-driverdependency - Create
src/lib/graph/neo4j-client.ts(driver, health check, connection handling) - Create
src/lib/graph/neo4j-sync.ts(entity/relationship sync from PostgreSQL to Neo4j) - Add
maybeSyncToNeo4j()step in document ingestion pipeline (aftermaybeExtractEntities) - Create
src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts - Extend ensemble search to optionally include graph retriever (Neo4j or PostgreSQL)
- Add integration tests for Neo4j sync and retrieval (with testcontainers or mock)
- Document Neo4j setup in README or docs
- Add metrics/logging for Neo4j sync and retrieval latency
Acceptance Criteria
- When
NEO4J_URIis set and a document is processed with entity extraction, entities and relationships are synced to Neo4j - When Neo4j is unavailable, document processing completes successfully (no failure)
- When
NEO4J_URIandENABLE_GRAPH_RETRIEVALare set, RAG queries include graph-based results in the ensemble - When Neo4j is unavailable at query time, retrieval falls back to PostgreSQL graph retriever or BM25+Vector only
- No regression in existing RAG behavior when Neo4j is not configured
Alternatives Considered
| Approach | Pros | Cons |
|---|---|---|
| Stay with PostgreSQL only | No new infra, schema exists | Traversal may slow on large graphs; no Cypher |
| Apache AGE | Cypher in PostgreSQL, single DB | Less mature, operational unknowns |
| Neo4j as primary graph | Best graph performance | Migration from PostgreSQL KG, more infra |
| Neo4j as optional sync (chosen) | Gradual adoption, fallback to PG | Dual-write complexity, eventual consistency |
Risks and Trade-offs
Operational Complexity
- Risk: Neo4j is another service to deploy, monitor, backup.
- Mitigation: Make Neo4j strictly optional. System works without it.
Data Consistency
- Risk: Dual-write (PostgreSQL + Neo4j) can diverge if Neo4j write fails.
- Mitigation: PostgreSQL is source of truth. Consider async sync via queue for resilience.
Graph Retriever Not Yet in Ensemble
- Note: The existing
GraphRetrieveris not wired into the ensemble. Phase 1 could be: wire PostgreSQLGraphRetrieverfirst, measure impact, then add Neo4j.
Entity Extraction Quality
- Note: Graph retrieval is only as good as extracted entities. Current extraction uses NER + CO_OCCURS. Improving relation extraction may yield more benefit than Neo4j alone.
Phased Rollout (Recommended)
- Phase 1: Wire existing PostgreSQL
GraphRetrieverinto ensemble. Measure recall/latency. Validate graph retrieval adds value. - Phase 2: Add Neo4j sync in document processing. Keep PostgreSQL as source of truth.
- Phase 3: Implement
Neo4jGraphRetriever; use whenNEO4J_URIis set. - Phase 4: Consider improving entity/relationship extraction before scaling.
References
- Neo4j JavaScript Driver
- LangChain Graph Retriever
- Existing schema:
src/server/db/schema/knowledge-graph.ts - Existing graph retriever:
src/lib/tools/rag/retrievers/graph-retriever.ts
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels