-
Notifications
You must be signed in to change notification settings - Fork 65
docs: Add an example of kgrag - knowledge graph-enhanced RAG #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+105,149
−3
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
feat: add an example of KGRag
adapt the knowledge graph to the theme of mellea
Rewrite the mellea style code
…_cmp Update the mellea style code and replace the original run_* scripts
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Contributor
|
Closing as wont-merge after connecting on Slack re: how to factor this code. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
KGRag - Knowledge Graph-Enhanced RAG with Mellea
This example demonstrates a Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) system built with the Mellea framework, adapted from the Bidirection project for temporal reasoning over movie domain knowledge. It has been rewritten to follow Mellea's design patterns and modern Python best practices.
Overview
KGRag combines the power of Knowledge Graphs with Large Language Models to answer complex questions that require multi-hop reasoning over structured knowledge. The system uses a Neo4j graph database to store and query entities, relationships, and temporal information, enabling more accurate and explainable answers compared to traditional RAG approaches.
Documentation:
What Problem Does It Solve?
Traditional LLMs and RAG systems struggle with:
KGRag addresses these challenges by:
Architecture
The system consists of several key components:
Key Components
Core Modules:
Configuration Models (Pydantic):
Run Scripts:
Utilities:
Data Preparation Scripts:
Prerequisites
System Requirements
Required Software
Neo4j Database
Python Dependencies
Neo4j Configuration
After starting Neo4j, you need to create vector indices:
Setup
1. Environment Configuration
Create a
.envfile in thekgragdirectory based on the .env_template2. Dataset Preparation
This example uses the CRAG (Comprehensive RAG) Benchmark for evaluation. The knowledge graph is built from movie domain data including structured databases and question-answer pairs.
Download CRAG Benchmark and Mock API
Dataset Structure
After setup, your dataset directory should contain:
JSONL Dataset Format: Each line in
crag_movie_dev.jsonlcontains:domain: "movie"query: The question to answerquery_time: Timestamp of the querysearch_results: List of web pages with contentanswer: Ground truth answerinteraction_id: Unique identifierMock API Format: The
*_db.jsonfiles contain structured knowledge graph data:movie_db.json: Movie entities with properties (title, release date, cast, awards, etc.)person_db.json: Person entities (actors, directors, producers, etc.)year_db.json: Temporal information and year-specific eventsCreating a Demo Dataset (Optional but Recommended)
The full database is quite large (225MB+). For faster demos and testing, create a smaller focused dataset:
Document Truncation for Faster Processing
For even faster KG updates during development, truncate long documents to reduce processing time:
# Truncate documents to 50k characters (88.9% size reduction) python3 run/create_truncated_dataset.py \ --input dataset/crag_movie_tiny.jsonl.bz2 \ --output dataset/crag_movie_tiny_truncated.jsonl.bz2 \ --max-chars 50000Recommended settings:
3. Knowledge Graph Construction
Build the knowledge graph from the dataset:
Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.
Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.
Usage
Running Question Answering
After building the knowledge graph, run QA inference:
Parameters:
--dataset: Path to dataset file (default: uses KG_BASE_DIRECTORY)--domain: Knowledge domain (default: movie)--num-workers: Number of parallel workers for inference (default: 128)--queue-size: Size of the data loading queue (default: 128)--split: Dataset split index (default: 0)--config: Override model configuration (e.g.,route=5 width=30 depth=3)route: Number of solving routes to explore (default: 5)width: Maximum number of relations to consider at each step (default: 30)depth: Maximum graph traversal depth (default: 3)--prefix: Prefix for output file names--postfix: Postfix for output file names--keep: Keep progress file after completion--eval-batch-size: Batch size for evaluation (default: 64)--eval-method: Evaluation method (default: llama)--verboseor-v: Enable verbose loggingUsing the Convenience Script
# Edit run.sh to uncomment the desired step bash run.shInteractive Demo (Optional)
For a quick demonstration of the KGRag pipeline with example queries:
# Run the interactive demo uv run --with mellea python demo/demo.pyNote: The demo is a standalone demonstration tool separate from the main QA evaluation pipeline. It's useful for:
For production use and benchmark evaluation, use
run/run_qa.pyinstead.How It Works
The KGRag system follows a multi-step reasoning pipeline:
1. Question Breakdown
The system breaks down complex questions into multiple solving routes:
2. Topic Entity Extraction
Extract relevant entities from the question considering entity types:
3. Entity Alignment
Align extracted entities with knowledge graph entities using:
4. Multi-Hop Graph Traversal
For each aligned entity, traverse the graph to find relevant information:
At each depth:
5. Answer Synthesis
Synthesize the final answer using:
Output Format
Results are saved to
results/*_results.json:[ { "accuracy": 0.85, "inf_prompt_tokens": 125000, "inf_completion_tokens": 15000, "eval_prompt_tokens": 50000, "eval_completion_tokens": 5000 }, { "id": 0, "query": "Which animated film won the best animated feature Oscar in 2024?", "query_time": "03/19/2024, 23:49:30 PT", "ans": "The Boy and the Heron", "prediction": "The Boy and the Heron", "processing_time": 12.34, "token_usage": { "prompt_tokens": 2500, "completion_tokens": 150 }, "score": 1.0, "explanation": "The prediction correctly identifies the winner..." } ]