Skip to content

Conversation

@ydzhu98
Copy link

@ydzhu98 ydzhu98 commented Jan 7, 2026

KGRag - Knowledge Graph-Enhanced RAG with Mellea

This example demonstrates a Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) system built with the Mellea framework, adapted from the Bidirection project for temporal reasoning over movie domain knowledge. It has been rewritten to follow Mellea's design patterns and modern Python best practices.

Overview

KGRag combines the power of Knowledge Graphs with Large Language Models to answer complex questions that require multi-hop reasoning over structured knowledge. The system uses a Neo4j graph database to store and query entities, relationships, and temporal information, enabling more accurate and explainable answers compared to traditional RAG approaches.

Documentation:

What Problem Does It Solve?

Traditional LLMs and RAG systems struggle with:

  • Multi-hop reasoning: Questions requiring multiple inference steps
  • Temporal reasoning: Questions involving time-sensitive information
  • Structured relationships: Understanding complex entity relationships
  • Knowledge provenance: Providing explainable reasoning paths

KGRag addresses these challenges by:

  1. Knowledge Graph Construction: Building a structured graph from unstructured documents
  2. Bidirectional Search: Traversing relationships in both forward and backward directions
  3. Temporal-Aware Reasoning: Incorporating query time and temporal constraints
  4. Multi-Route Exploration: Breaking down complex questions into multiple solving routes

Architecture

The system consists of several key components:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                            │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              KGModel (kg_model.py)                          │
│  • Question breakdown into solving routes                   │
│  • Topic entity extraction                                  │
│  • Entity alignment with KG                                 │
│  • Multi-hop graph traversal                                │
│  • Answer synthesis and validation                          │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              Neo4j Knowledge Graph                          │
│  • Entities (Movies, Awards, Persons, etc.)                 │
│  • Relations (WON, NOMINATED_FOR, PRODUCED, etc.)           │
│  • Properties (temporal info, descriptions)                 │
│  • Vector embeddings for similarity search                  │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                 Answer + Reasoning Path                      │
└─────────────────────────────────────────────────────────────┘

Key Components

Core Modules:

Configuration Models (Pydantic):

Run Scripts:

Utilities:

Data Preparation Scripts:

Prerequisites

System Requirements

  • Python 3.9+
  • Neo4j 5.x or later
  • 8GB+ RAM (16GB+ recommended)
  • GPU recommended for faster embedding generation

Required Software

  1. Neo4j Database

    # Install Neo4j Desktop or use Docker
    docker run \
        --name neo4j \
        -p7474:7474 -p7687:7687 \
        -e NEO4J_AUTH=neo4j/your_password \
        -e NEO4J_PLUGINS='["apoc"]' \
        neo4j:latest
  2. Python Dependencies

    # Install Mellea and dependencies
    uv sync --all-extras --all-groups
    
    # Or install specific dependencies
    pip install neo4j python-dotenv beautifulsoup4 trafilatura
    pip install sentence-transformers  # For local embeddings

Neo4j Configuration

After starting Neo4j, you need to create vector indices:

// Create vector index for entity embeddings
CREATE VECTOR INDEX entity_embedding IF NOT EXISTS
FOR (n:Entity)
ON n.embedding
OPTIONS {indexConfig: {
  `vector.dimensions`: 512,
  `vector.similarity_function`: 'cosine'
}};

// Create index for entity names (for fuzzy search)
CREATE INDEX entity_name IF NOT EXISTS FOR (n:Entity) ON (n.name);

Setup

1. Environment Configuration

Create a .env file in the kgrag directory based on the .env_template

2. Dataset Preparation

This example uses the CRAG (Comprehensive RAG) Benchmark for evaluation. The knowledge graph is built from movie domain data including structured databases and question-answer pairs.

Download CRAG Benchmark and Mock API

# Navigate to the kgrag directory
cd docs/examples/kgrag

# Clone the CRAG Benchmark repository
# Note: You may need to install Git LFS to properly download all datasets
git lfs install
git clone https://github.com/facebookresearch/CRAG.git

# Copy the mock_api folder to the dataset directory
# The mock_api contains the knowledge graph databases (movie_db.json, person_db.json, year_db.json)
# These files are essential for building the knowledge graph
cp -r CRAG/mock_api/movie dataset/movie

# Download the CRAG movie dataset (questions and answers)
cd dataset
# The dataset file should be named crag_movie_dev.jsonl or crag_movie_dev.jsonl.bz2
# If compressed, extract it:
bunzip2 crag_movie_dev.jsonl.bz2  # if .bz2 format

Dataset Structure

After setup, your dataset directory should contain:

dataset/
├── crag_movie_dev.jsonl          # Questions and answers
└── movie/                         # Mock API databases
    ├── movie_db.json             # Movie entity database
    ├── person_db.json            # Person entity database
    └── year_db.json              # Year/temporal database

JSONL Dataset Format: Each line in crag_movie_dev.jsonl contains:

  • domain: "movie"
  • query: The question to answer
  • query_time: Timestamp of the query
  • search_results: List of web pages with content
  • answer: Ground truth answer
  • interaction_id: Unique identifier

Mock API Format: The *_db.json files contain structured knowledge graph data:

  • movie_db.json: Movie entities with properties (title, release date, cast, awards, etc.)
  • person_db.json: Person entities (actors, directors, producers, etc.)
  • year_db.json: Temporal information and year-specific events

Creating a Demo Dataset (Optional but Recommended)

The full database is quite large (225MB+). For faster demos and testing, create a smaller focused dataset:

# Create a demo dataset with ~100 recent movies (2020-2024)
cd docs/examples/kgrag
uv run python run/create_demo_dataset.py \
    --year-start 2020 \
    --year-end 2024 \
    --max-movies 100 \
    --topics "oscar,academy award" \
    --include-related

# Switch to the demo dataset
mv dataset/movie dataset/movie_full
mv dataset/movie_demo dataset/movie

Document Truncation for Faster Processing

For even faster KG updates during development, truncate long documents to reduce processing time:

# Truncate documents to 50k characters (88.9% size reduction)
python3 run/create_truncated_dataset.py \
  --input dataset/crag_movie_tiny.jsonl.bz2 \
  --output dataset/crag_movie_tiny_truncated.jsonl.bz2 \
  --max-chars 50000

Recommended settings:

Dataset max-chars Use Case
Tiny (10 docs) 30k-50k Quick testing, debugging
Dev (565 docs) 50k-100k Development, experimentation
Full dataset 100k-200k Production (or no truncation)

3. Knowledge Graph Construction

Build the knowledge graph from the dataset:

# Set up environment
cd docs/examples/kgrag
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export KG_BASE_DIRECTORY="$(pwd)/dataset"

# Step 1: Preprocess documents and extract entities/relations
uv run --with mellea run/run_kg_preprocess.py

# Step 2: Generate embeddings for entities
uv run --with mellea run/run_kg_embed.py

# Step 3: Update the knowledge graph with extracted information
uv run --with mellea run/run_kg_update.py --num-workers 4 --queue-size 10

Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.
Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.

Usage

Running Question Answering

After building the knowledge graph, run QA inference:

# Run with default settings
uv run --with mellea run/run_qa.py --num-workers 4 --queue-size 10

# Run with custom configuration
uv run --with mellea run/run_qa.py \
    --num-workers 8 \
    --queue-size 16 \
    --config route=3 width=20 depth=2 \
    --prefix my_experiment \
    --postfix v1 \
    --verbose

# Run with specific dataset
uv run --with mellea run/run_qa.py \
    --dataset dataset/custom_questions.jsonl \
    --domain movie \
    --eval-batch-size 64 \
    --eval-method llama

Parameters:

  • --dataset: Path to dataset file (default: uses KG_BASE_DIRECTORY)
  • --domain: Knowledge domain (default: movie)
  • --num-workers: Number of parallel workers for inference (default: 128)
  • --queue-size: Size of the data loading queue (default: 128)
  • --split: Dataset split index (default: 0)
  • --config: Override model configuration (e.g., route=5 width=30 depth=3)
    • route: Number of solving routes to explore (default: 5)
    • width: Maximum number of relations to consider at each step (default: 30)
    • depth: Maximum graph traversal depth (default: 3)
  • --prefix: Prefix for output file names
  • --postfix: Postfix for output file names
  • --keep: Keep progress file after completion
  • --eval-batch-size: Batch size for evaluation (default: 64)
  • --eval-method: Evaluation method (default: llama)
  • --verbose or -v: Enable verbose logging

Using the Convenience Script

# Edit run.sh to uncomment the desired step
bash run.sh

Interactive Demo (Optional)

For a quick demonstration of the KGRag pipeline with example queries:

# Run the interactive demo
uv run --with mellea python demo/demo.py

Note: The demo is a standalone demonstration tool separate from the main QA evaluation pipeline. It's useful for:

  • Understanding how KGRag works with example queries
  • Testing the system with custom questions interactively
  • Debugging and exploring the reasoning process

For production use and benchmark evaluation, use run/run_qa.py instead.

How It Works

The KGRag system follows a multi-step reasoning pipeline:

1. Question Breakdown

The system breaks down complex questions into multiple solving routes:

Question: "Which animated film won the best animated feature Oscar in 2024?"

Routes:
1. ["Identify 2024 Oscars best animated feature award", "Find the winner"]
2. ["List 2024 Oscar nominees", "Filter animated features", "Identify winner"]
3. ["Search for 2024 Oscar results", "Extract best animated feature winner"]

2. Topic Entity Extraction

Extract relevant entities from the question considering entity types:

Extracted: ["2024 Oscars best animated feature award"]
Entity Type: Award

3. Entity Alignment

Align extracted entities with knowledge graph entities using:

  • Fuzzy string matching for exact name matches
  • Vector similarity search for semantic matching

4. Multi-Hop Graph Traversal

For each aligned entity, traverse the graph to find relevant information:

Depth 0: Start entity → (Award: 2024 OSCARS BEST ANIMATED FEATURE)
Depth 1: Find relations → [WON, NOMINATED_FOR]
Depth 2: Follow WON relation → (Movie: THE BOY AND THE HERON)

At each depth:

  • Relation Pruning: Select relevant relation types using LLM
  • Triplet Pruning: Score individual relation instances
  • Relevance Scoring: Rank entities and relations by relevance

5. Answer Synthesis

Synthesize the final answer using:

  • Retrieved entities and relations
  • Multi-route validation for consensus
  • Temporal alignment verification

Output Format

Results are saved to results/*_results.json:

[
  {
    "accuracy": 0.85,
    "inf_prompt_tokens": 125000,
    "inf_completion_tokens": 15000,
    "eval_prompt_tokens": 50000,
    "eval_completion_tokens": 5000
  },
  {
    "id": 0,
    "query": "Which animated film won the best animated feature Oscar in 2024?",
    "query_time": "03/19/2024, 23:49:30 PT",
    "ans": "The Boy and the Heron",
    "prediction": "The Boy and the Heron",
    "processing_time": 12.34,
    "token_usage": {
      "prompt_tokens": 2500,
      "completion_tokens": 150
    },
    "score": 1.0,
    "explanation": "The prediction correctly identifies the winner..."
  }
]

@mergify
Copy link

mergify bot commented Jan 7, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@ydzhu98 ydzhu98 changed the title Add an example of kgrag feat: Add an example of kgrag Jan 7, 2026
@ydzhu98 ydzhu98 changed the title feat: Add an example of kgrag docs: Add an example of kgrag - Knowledge Graph-Enhanced RAG Jan 7, 2026
@ydzhu98 ydzhu98 changed the title docs: Add an example of kgrag - Knowledge Graph-Enhanced RAG docs: Add an example of kgrag - knowledge graph-enhanced RAG Jan 7, 2026
@nrfulton
Copy link
Contributor

nrfulton commented Jan 8, 2026

Closing as wont-merge after connecting on Slack re: how to factor this code.

@nrfulton nrfulton closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants