Skip to content

Refine Entity/Relationship Extraction & Implement Query-Side Extraction #241

@EricYiming

Description

@EricYiming

Summary

To maximize the value of the upcoming Neo4j graph integration (and the existing PostgreSQL graph), we need to improve the quality of the data entering the graph and how we query it. This sub-issue focuses on two key enhancements:

  • Refining Document Ingestion: Moving beyond simple CO_OCCURS relationships by identifying richer, semantic relationships between entities, and refining/deduplicating the entities themselves.
  • Query-Side Entity Extraction: Applying Named Entity Recognition (NER) to the user's query at runtime to identify precise seed nodes for graph traversal, improving retrieval accuracy.

Proposed Solution

1. Refine Entity & Relationship Extraction (Ingestion)

  • Semantic Relationships: Update the sidecar /extract-entities logic (or the LLM prompt powering it) to identify specific relationship types rather than just co-occurrence. Examples include BELONGS_TO, DEPENDS_ON, REPORTS_TO, SIMILAR_TO, etc.
  • Entity Refinement & Resolution: Introduce a deduplication or canonicalization step during ingestion. For example, resolving "AWS" and "Amazon Web Services" to the same underlying entity, or merging entities with high embedding similarity before syncing to the database/Neo4j.

2. Query-Side Entity Extraction (Retrieval)

  • User Query NER: Before hitting the RAG ensemble, pass the user query through a lightweight extraction step (either a fast LLM call or a dedicated NER model in the sidecar) to identify key entities.
  • Targeted Graph Traversal: Pass these extracted entities to the GraphRetriever and the proposed Neo4jGraphRetriever. Use these exact entity names/labels as the starting nodes for multi-hop graph traversals.

Implementation Tasks

  • Update the ingestion extraction prompt/logic in src/lib/ingestion/entity-extraction.ts (and the sidecar) to output semantic relationship types.
  • Add entity canonicalization logic to merge synonymous entities before writing to kg_entities.
  • Create an extractQueryEntities(query: string) utility function to process user queries.
  • Update the RAG retrieval pipeline (src/lib/tools/rag/retrievers/graph-retriever.ts and the future Neo4j retriever) to accept extracted entities as search parameters.
  • Update existing Cypher/SQL traversal queries to anchor their searches on the newly extracted query entities.

Acceptance Criteria

  • When a document is ingested, relationships other than CO_OCCURS are successfully identified and stored in PostgreSQL/Neo4j.
  • When a user submits a RAG question, entities are actively extracted from the query text.
  • Graph retrieval uses the extracted query entities as starting points for graph traversal, resulting in more relevant document section retrieval.
  • Entities representing the exact same concept are deduplicated/merged during the ingestion phase.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions