Graph Extension Research

# [Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval

## Summary

Integrate Neo4j as an optional knowledge graph backend to:
1. **Document processing:** Update the knowledge graph when documents are processed (when Neo4j is available)
2. **RAG retrieval:** Use Neo4j for graph-based retrieval during document Q&A (when Neo4j is available)

The integration should be **optional** and **gracefully degradable**—the system continues to work without Neo4j, falling back to the existing PostgreSQL-based knowledge graph and BM25+Vector ensemble.

---

## Background / Current State

### Existing Knowledge Graph (PostgreSQL)

The codebase already has a PostgreSQL-based knowledge graph:

| Component | Location | Purpose |
|-----------|----------|---------|
| Schema | `src/server/db/schema/knowledge-graph.ts` | `kg_entities`, `kg_entity_mentions`, `kg_relationships` |
| Entity extraction | `src/lib/ingestion/entity-extraction.ts` | Calls sidecar `/extract-entities`, stores in PostgreSQL |
| Graph retriever | `src/lib/tools/rag/retrievers/graph-retriever.ts` | Traverses 1–2 hops via Drizzle/SQL |
| Pipeline hook | `src/lib/tools/doc-ingestion/index.ts` | `maybeExtractEntities()` runs after `storeDocument()` when sidecar is available |

### Current RAG Retrieval

- **Ensemble:** BM25 + Vector only (weights `[0.4, 0.6]`)
- **Graph retriever:** Implemented but **not wired into the ensemble**
- **Reranking:** Optional sidecar cross-encoder when `SIDECAR_URL` is set

### Document Processing Flow

```
Upload → Ingest → Chunk → Embed → Store (pgvector) → [Optional] Extract Entities → [Proposed] Sync to Neo4j
```

---

## Motivation

1. **Graph-native traversal:** Neo4j excels at multi-hop graph traversal and path queries; PostgreSQL uses recursive CTEs which can be slower on large graphs.
2. **Cypher expressiveness:** Cypher allows concise, readable graph queries (e.g. variable-length paths, pattern matching).
3. **Scalability:** For companies with large document corpora (100K+ entities), Neo4j can provide better query performance.
4. **Future extensibility:** Enables graph algorithms (PageRank, community detection) and richer relationship types without schema changes.

---

## Proposed Solution

### 1. Document Processing: Sync to Neo4j When Available

**Trigger:** After entity extraction completes (PostgreSQL write), if `NEO4J_URI` is configured, sync entities and relationships to Neo4j.

**Data flow:**
- Read from `kg_entities`, `kg_entityMentions`, `kg_relationships` (or use the in-memory result from extraction)
- Batch-write to Neo4j using `MERGE` for idempotency
- Link nodes to `documentId` and `companyId` via properties for scoping

**Cypher model (conceptual):**
```cypher
MERGE (e:Entity {name: $name, label: $label, companyId: $companyId})
ON CREATE SET e.displayName = $displayName, e.confidence = $confidence
MERGE (s:Section {id: $sectionId, documentId: $documentId})
MERGE (e)-[:MENTIONED_IN {confidence: $conf}]->(s)
MERGE (e1)-[r:CO_OCCURS]->(e2)
ON CREATE SET r.weight = 0.5, r.evidenceCount = 1
```

**Implementation:**
- Add `src/lib/graph/neo4j-client.ts` — Neo4j driver wrapper, connection pooling, health check
- Add `src/lib/graph/neo4j-sync.ts` — maps entities/relationships to Cypher `MERGE` statements
- In `maybeExtractEntities()` or new step `maybeSyncToNeo4j()`, call sync after PostgreSQL write
- Keep PostgreSQL as **source of truth**; Neo4j as optional read-optimized layer

### 2. RAG Retrieval: Neo4j-Aware Graph Retrieval

**Trigger:** When performing ensemble search, if `NEO4J_URI` and `ENABLE_GRAPH_RETRIEVAL` are set, include a graph retriever (Neo4j or PostgreSQL fallback).

**Flow:**
1. Extract query terms from user question
2. If Neo4j available: run Cypher traversal in Neo4j
3. Else if graph enabled: use existing PostgreSQL `GraphRetriever`
4. Else: skip graph retrieval
5. Fuse graph results with BM25 + Vector via RRF (Reciprocal Rank Fusion)
6. Optional reranking via sidecar

**Example Cypher for retrieval:**
```cypher
MATCH (e:Entity)
WHERE e.companyId = $companyId AND toLower(e.name) CONTAINS toLower($term)
MATCH (e)-[:MENTIONED_IN]->(s:Section)
WHERE s.documentId IN $documentIds
WITH s LIMIT $topK
RETURN s.id AS sectionId
```

**Implementation:**
- Add `src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts` — `Neo4jGraphRetriever` extending LangChain `BaseRetriever`
- Same interface as `GraphRetriever`: `_getRelevantDocuments(query)` returns `Document[]`
- Section content fetched from PostgreSQL `documentSections` (Neo4j stores section IDs only)
- Wire into `createDocumentEnsembleRetriever`, `createCompanyEnsembleRetriever`, `createMultiDocEnsembleRetriever` with configurable weight (e.g. `[0.3, 0.5, 0.2]` for BM25, Vector, Graph)

### 3. Configuration

| Env var | Required | Description |
|---------|----------|-------------|
| `NEO4J_URI` | No | Neo4j connection URI (e.g. `neo4j://localhost:7687`). If unset, Neo4j features are disabled. |
| `NEO4J_USER` | No | Neo4j username (default: `neo4j`) |
| `NEO4J_PASSWORD` | No | Neo4j password |
| `ENABLE_GRAPH_RETRIEVAL` | No | If `true`, include graph retriever in ensemble. Default: `false` until validated. |

---

## Implementation Tasks

- [x] Add `neo4j-driver` dependency
- [x] Create `src/lib/graph/neo4j-client.ts` (driver, health check, connection handling)
- [x] Create `src/lib/graph/neo4j-sync.ts` (entity/relationship sync from PostgreSQL to Neo4j)
- [x] Add `maybeSyncToNeo4j()` step in document ingestion pipeline (after `maybeExtractEntities`)
- [x] Create `src/lib/tools/rag/retrievers/neo4j-graph-retriever.ts`
- [x] Extend ensemble search to optionally include graph retriever (Neo4j or PostgreSQL)
- [ ] Add integration tests for Neo4j sync and retrieval (with testcontainers or mock)
- [x] Document Neo4j setup in README or docs
- [ ] Add metrics/logging for Neo4j sync and retrieval latency

---

## Acceptance Criteria

- [ ] When `NEO4J_URI` is set and a document is processed with entity extraction, entities and relationships are synced to Neo4j
- [ ] When Neo4j is unavailable, document processing completes successfully (no failure)
- [ ] When `NEO4J_URI` and `ENABLE_GRAPH_RETRIEVAL` are set, RAG queries include graph-based results in the ensemble
- [ ] When Neo4j is unavailable at query time, retrieval falls back to PostgreSQL graph retriever or BM25+Vector only
- [ ] No regression in existing RAG behavior when Neo4j is not configured

---

## Alternatives Considered

| Approach | Pros | Cons |
|----------|------|------|
| **Stay with PostgreSQL only** | No new infra, schema exists | Traversal may slow on large graphs; no Cypher |
| **Apache AGE** | Cypher in PostgreSQL, single DB | Less mature, operational unknowns |
| **Neo4j as primary graph** | Best graph performance | Migration from PostgreSQL KG, more infra |
| **Neo4j as optional sync (chosen)** | Gradual adoption, fallback to PG | Dual-write complexity, eventual consistency |

---

## Risks and Trade-offs

### Operational Complexity
- **Risk:** Neo4j is another service to deploy, monitor, backup.
- **Mitigation:** Make Neo4j strictly optional. System works without it.

### Data Consistency
- **Risk:** Dual-write (PostgreSQL + Neo4j) can diverge if Neo4j write fails.
- **Mitigation:** PostgreSQL is source of truth. Consider async sync via queue for resilience.

### Graph Retriever Not Yet in Ensemble
- **Note:** The existing `GraphRetriever` is not wired into the ensemble. Phase 1 could be: wire PostgreSQL `GraphRetriever` first, measure impact, then add Neo4j.

### Entity Extraction Quality
- **Note:** Graph retrieval is only as good as extracted entities. Current extraction uses NER + CO_OCCURS. Improving relation extraction may yield more benefit than Neo4j alone.

---

## Phased Rollout (Recommended)

1. **Phase 1:** Wire existing PostgreSQL `GraphRetriever` into ensemble. Measure recall/latency. Validate graph retrieval adds value.
2. **Phase 2:** Add Neo4j sync in document processing. Keep PostgreSQL as source of truth.
3. **Phase 3:** Implement `Neo4jGraphRetriever`; use when `NEO4J_URI` is set.
4. **Phase 4:** Consider improving entity/relationship extraction before scaling.

---

## References

- [Neo4j JavaScript Driver](https://neo4j.com/docs/javascript-manual/current/)
- [LangChain Graph Retriever](https://js.langchain.com/docs/modules/data_connection/retrievers/)
- Existing schema: `src/server/db/schema/knowledge-graph.ts`
- Existing graph retriever: `src/lib/tools/rag/retrievers/graph-retriever.ts`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph Extension Research #192

[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval

Summary

Background / Current State

Existing Knowledge Graph (PostgreSQL)

Current RAG Retrieval

Document Processing Flow

Motivation

Proposed Solution

1. Document Processing: Sync to Neo4j When Available

2. RAG Retrieval: Neo4j-Aware Graph Retrieval

3. Configuration

Implementation Tasks

Acceptance Criteria

Alternatives Considered

Risks and Trade-offs

Operational Complexity

Data Consistency

Graph Retriever Not Yet in Ensemble

Entity Extraction Quality

Phased Rollout (Recommended)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Location	Purpose
Schema	`src/server/db/schema/knowledge-graph.ts`	`kg_entities`, `kg_entity_mentions`, `kg_relationships`
Entity extraction	`src/lib/ingestion/entity-extraction.ts`	Calls sidecar `/extract-entities`, stores in PostgreSQL
Graph retriever	`src/lib/tools/rag/retrievers/graph-retriever.ts`	Traverses 1–2 hops via Drizzle/SQL
Pipeline hook	`src/lib/tools/doc-ingestion/index.ts`	`maybeExtractEntities()` runs after `storeDocument()` when sidecar is available

Env var	Required	Description
`NEO4J_URI`	No	Neo4j connection URI (e.g. `neo4j://localhost:7687`). If unset, Neo4j features are disabled.
`NEO4J_USER`	No	Neo4j username (default: `neo4j`)
`NEO4J_PASSWORD`	No	Neo4j password
`ENABLE_GRAPH_RETRIEVAL`	No	If `true`, include graph retriever in ensemble. Default: `false` until validated.

Approach	Pros	Cons
Stay with PostgreSQL only	No new infra, schema exists	Traversal may slow on large graphs; no Cypher
Apache AGE	Cypher in PostgreSQL, single DB	Less mature, operational unknowns
Neo4j as primary graph	Best graph performance	Migration from PostgreSQL KG, more infra
Neo4j as optional sync (chosen)	Gradual adoption, fallback to PG	Dual-write complexity, eventual consistency

Graph Extension Research #192

Description

[Feature] Neo4j Knowledge Graph Integration for Document Processing and RAG Retrieval

Summary

Background / Current State

Existing Knowledge Graph (PostgreSQL)

Current RAG Retrieval

Document Processing Flow

Motivation

Proposed Solution

1. Document Processing: Sync to Neo4j When Available

2. RAG Retrieval: Neo4j-Aware Graph Retrieval

3. Configuration

Implementation Tasks

Acceptance Criteria

Alternatives Considered

Risks and Trade-offs

Operational Complexity

Data Consistency

Graph Retriever Not Yet in Ensemble

Entity Extraction Quality

Phased Rollout (Recommended)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions