docs: Add an example of kgrag - knowledge graph-enhanced RAG #290

ydzhu98 · 2026-01-07T01:24:49Z

KGRag - Knowledge Graph-Enhanced RAG with Mellea

This example demonstrates a Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) system built with the Mellea framework, adapted from the Bidirection project for temporal reasoning over movie domain knowledge. It has been rewritten to follow Mellea's design patterns and modern Python best practices.

Overview

KGRag combines the power of Knowledge Graphs with Large Language Models to answer complex questions that require multi-hop reasoning over structured knowledge. The system uses a Neo4j graph database to store and query entities, relationships, and temporal information, enabling more accurate and explainable answers compared to traditional RAG approaches.

Documentation:

MELLEA_INTEGRATION.md - Mellea patterns showcase with code examples for all pipeline components
DEVELOPMENT_SUMMARY.md - Complete development history, bug fixes, and migration details
REFACTORING_GUIDE.md - Comprehensive refactoring patterns and best practices

What Problem Does It Solve?

Traditional LLMs and RAG systems struggle with:

Multi-hop reasoning: Questions requiring multiple inference steps
Temporal reasoning: Questions involving time-sensitive information
Structured relationships: Understanding complex entity relationships
Knowledge provenance: Providing explainable reasoning paths

KGRag addresses these challenges by:

Knowledge Graph Construction: Building a structured graph from unstructured documents
Bidirectional Search: Traversing relationships in both forward and backward directions
Temporal-Aware Reasoning: Incorporating query time and temporal constraints
Multi-Route Exploration: Breaking down complex questions into multiple solving routes

Architecture

The system consists of several key components:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                            │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              KGModel (kg_model.py)                          │
│  • Question breakdown into solving routes                   │
│  • Topic entity extraction                                  │
│  • Entity alignment with KG                                 │
│  • Multi-hop graph traversal                                │
│  • Answer synthesis and validation                          │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              Neo4j Knowledge Graph                          │
│  • Entities (Movies, Awards, Persons, etc.)                 │
│  • Relations (WON, NOMINATED_FOR, PRODUCED, etc.)           │
│  • Properties (temporal info, descriptions)                 │
│  • Vector embeddings for similarity search                  │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                 Answer + Reasoning Path                      │
└─────────────────────────────────────────────────────────────┘

Key Components

Core Modules:

kg_model.py: Core KGModel class implementing the reasoning pipeline
kg/kg_driver.py: Neo4j database driver for graph operations (refactored with emb_session support)
kg/kg_preprocessor.py: Refactored entity and relation extraction with Pydantic models
kg/kg_embedder.py: Refactored embedding generation with batch processing
kg/kg_updater.py: Incremental graph updates with robust error handling
kg/kg_rag.py: RAG component following Mellea patterns

Configuration Models (Pydantic):

kg/kg_entity_models.py: Type-safe entity models (Movie, Person, Award, etc.)
kg/kg_embed_models.py: Embedding configuration and validation
kg/kg_updater_models.py: Updater configuration models
kg/kg_qa_models.py: QA configuration models

Run Scripts:

run/run_kg_preprocess.py: Preprocessing with modern async patterns
run/run_kg_embed.py: Embedding generation script
run/run_kg_update.py: Graph update with comprehensive CLI
run/run_qa.py: QA evaluation with factory functions and proper exit codes

Utilities:

dataset/movie_dataset.py: Movie domain dataset loader
demo/demo.py: Complete demo showing KGRag usage

Data Preparation Scripts:

run/create_demo_dataset.py: Create smaller demo KG database
run/create_tiny_dataset.py: Create tiny document dataset for testing
run/create_truncated_dataset.py: Truncate documents for faster processing

Prerequisites

System Requirements

Python 3.9+
Neo4j 5.x or later
8GB+ RAM (16GB+ recommended)
GPU recommended for faster embedding generation

Required Software

Neo4j Database

# Install Neo4j Desktop or use Docker
docker run \
    --name neo4j \
    -p7474:7474 -p7687:7687 \
    -e NEO4J_AUTH=neo4j/your_password \
    -e NEO4J_PLUGINS='["apoc"]' \
    neo4j:latest

Python Dependencies

# Install Mellea and dependencies
uv sync --all-extras --all-groups

# Or install specific dependencies
pip install neo4j python-dotenv beautifulsoup4 trafilatura
pip install sentence-transformers  # For local embeddings

Neo4j Configuration

After starting Neo4j, you need to create vector indices:

// Create vector index for entity embeddings
CREATE VECTOR INDEX entity_embedding IF NOT EXISTS
FOR (n:Entity)
ON n.embedding
OPTIONS {indexConfig: {
  `vector.dimensions`: 512,
  `vector.similarity_function`: 'cosine'
}};

// Create index for entity names (for fuzzy search)
CREATE INDEX entity_name IF NOT EXISTS FOR (n:Entity) ON (n.name);

Setup

1. Environment Configuration

Create a .env file in the kgrag directory based on the .env_template

2. Dataset Preparation

This example uses the CRAG (Comprehensive RAG) Benchmark for evaluation. The knowledge graph is built from movie domain data including structured databases and question-answer pairs.

Download CRAG Benchmark and Mock API

# Navigate to the kgrag directory
cd docs/examples/kgrag

# Clone the CRAG Benchmark repository
# Note: You may need to install Git LFS to properly download all datasets
git lfs install
git clone https://github.com/facebookresearch/CRAG.git

# Copy the mock_api folder to the dataset directory
# The mock_api contains the knowledge graph databases (movie_db.json, person_db.json, year_db.json)
# These files are essential for building the knowledge graph
cp -r CRAG/mock_api/movie dataset/movie

# Download the CRAG movie dataset (questions and answers)
cd dataset
# The dataset file should be named crag_movie_dev.jsonl or crag_movie_dev.jsonl.bz2
# If compressed, extract it:
bunzip2 crag_movie_dev.jsonl.bz2  # if .bz2 format

Dataset Structure

After setup, your dataset directory should contain:

dataset/
├── crag_movie_dev.jsonl          # Questions and answers
└── movie/                         # Mock API databases
    ├── movie_db.json             # Movie entity database
    ├── person_db.json            # Person entity database
    └── year_db.json              # Year/temporal database

JSONL Dataset Format: Each line in crag_movie_dev.jsonl contains:

domain: "movie"
query: The question to answer
query_time: Timestamp of the query
search_results: List of web pages with content
answer: Ground truth answer
interaction_id: Unique identifier

Mock API Format: The *_db.json files contain structured knowledge graph data:

movie_db.json: Movie entities with properties (title, release date, cast, awards, etc.)
person_db.json: Person entities (actors, directors, producers, etc.)
year_db.json: Temporal information and year-specific events

Creating a Demo Dataset (Optional but Recommended)

The full database is quite large (225MB+). For faster demos and testing, create a smaller focused dataset:

# Create a demo dataset with ~100 recent movies (2020-2024)
cd docs/examples/kgrag
uv run python run/create_demo_dataset.py \
    --year-start 2020 \
    --year-end 2024 \
    --max-movies 100 \
    --topics "oscar,academy award" \
    --include-related

# Switch to the demo dataset
mv dataset/movie dataset/movie_full
mv dataset/movie_demo dataset/movie

Document Truncation for Faster Processing

For even faster KG updates during development, truncate long documents to reduce processing time:

# Truncate documents to 50k characters (88.9% size reduction)
python3 run/create_truncated_dataset.py \
  --input dataset/crag_movie_tiny.jsonl.bz2 \
  --output dataset/crag_movie_tiny_truncated.jsonl.bz2 \
  --max-chars 50000

Recommended settings:

Dataset	max-chars	Use Case
Tiny (10 docs)	30k-50k	Quick testing, debugging
Dev (565 docs)	50k-100k	Development, experimentation
Full dataset	100k-200k	Production (or no truncation)

3. Knowledge Graph Construction

Build the knowledge graph from the dataset:

# Set up environment
cd docs/examples/kgrag
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export KG_BASE_DIRECTORY="$(pwd)/dataset"

# Step 1: Preprocess documents and extract entities/relations
uv run --with mellea run/run_kg_preprocess.py

# Step 2: Generate embeddings for entities
uv run --with mellea run/run_kg_embed.py

# Step 3: Update the knowledge graph with extracted information
uv run --with mellea run/run_kg_update.py --num-workers 4 --queue-size 10

Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.
Note: The preprocessing and graph construction can take several hours depending on dataset size and hardware.

Usage

Running Question Answering

After building the knowledge graph, run QA inference:

# Run with default settings
uv run --with mellea run/run_qa.py --num-workers 4 --queue-size 10

# Run with custom configuration
uv run --with mellea run/run_qa.py \
    --num-workers 8 \
    --queue-size 16 \
    --config route=3 width=20 depth=2 \
    --prefix my_experiment \
    --postfix v1 \
    --verbose

# Run with specific dataset
uv run --with mellea run/run_qa.py \
    --dataset dataset/custom_questions.jsonl \
    --domain movie \
    --eval-batch-size 64 \
    --eval-method llama

Parameters:

--dataset: Path to dataset file (default: uses KG_BASE_DIRECTORY)
--domain: Knowledge domain (default: movie)
--num-workers: Number of parallel workers for inference (default: 128)
--queue-size: Size of the data loading queue (default: 128)
--split: Dataset split index (default: 0)
--config: Override model configuration (e.g., route=5 width=30 depth=3)
- route: Number of solving routes to explore (default: 5)
- width: Maximum number of relations to consider at each step (default: 30)
- depth: Maximum graph traversal depth (default: 3)
--prefix: Prefix for output file names
--postfix: Postfix for output file names
--keep: Keep progress file after completion
--eval-batch-size: Batch size for evaluation (default: 64)
--eval-method: Evaluation method (default: llama)
--verbose or -v: Enable verbose logging

Using the Convenience Script

# Edit run.sh to uncomment the desired step
bash run.sh

Interactive Demo (Optional)

For a quick demonstration of the KGRag pipeline with example queries:

# Run the interactive demo
uv run --with mellea python demo/demo.py

Note: The demo is a standalone demonstration tool separate from the main QA evaluation pipeline. It's useful for:

Understanding how KGRag works with example queries
Testing the system with custom questions interactively
Debugging and exploring the reasoning process

For production use and benchmark evaluation, use run/run_qa.py instead.

How It Works

The KGRag system follows a multi-step reasoning pipeline:

1. Question Breakdown

The system breaks down complex questions into multiple solving routes:

Question: "Which animated film won the best animated feature Oscar in 2024?"

Routes:
1. ["Identify 2024 Oscars best animated feature award", "Find the winner"]
2. ["List 2024 Oscar nominees", "Filter animated features", "Identify winner"]
3. ["Search for 2024 Oscar results", "Extract best animated feature winner"]

2. Topic Entity Extraction

Extract relevant entities from the question considering entity types:

Extracted: ["2024 Oscars best animated feature award"]
Entity Type: Award

3. Entity Alignment

Align extracted entities with knowledge graph entities using:

Fuzzy string matching for exact name matches
Vector similarity search for semantic matching

4. Multi-Hop Graph Traversal

For each aligned entity, traverse the graph to find relevant information:

Depth 0: Start entity → (Award: 2024 OSCARS BEST ANIMATED FEATURE)
Depth 1: Find relations → [WON, NOMINATED_FOR]
Depth 2: Follow WON relation → (Movie: THE BOY AND THE HERON)

At each depth:

Relation Pruning: Select relevant relation types using LLM
Triplet Pruning: Score individual relation instances
Relevance Scoring: Rank entities and relations by relevance

5. Answer Synthesis

Synthesize the final answer using:

Retrieved entities and relations
Multi-route validation for consensus
Temporal alignment verification

Output Format

Results are saved to results/*_results.json:

[
  {
    "accuracy": 0.85,
    "inf_prompt_tokens": 125000,
    "inf_completion_tokens": 15000,
    "eval_prompt_tokens": 50000,
    "eval_completion_tokens": 5000
  },
  {
    "id": 0,
    "query": "Which animated film won the best animated feature Oscar in 2024?",
    "query_time": "03/19/2024, 23:49:30 PT",
    "ans": "The Boy and the Heron",
    "prediction": "The Boy and the Heron",
    "processing_time": 12.34,
    "token_usage": {
      "prompt_tokens": 2500,
      "completion_tokens": 150
    },
    "score": 1.0,
    "explanation": "The prediction correctly identifies the winner..."
  }
]

feat: add an example of KGRag

…ally cleaned

adapt the knowledge graph to the theme of mellea

…al versions.

Rewrite the mellea style code

…nal run_* files

…_cmp Update the mellea style code and replace the original run_* scripts

mergify · 2026-01-07T01:25:30Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

nrfulton · 2026-01-08T12:48:32Z

Closing as wont-merge after connecting on Slack re: how to factor this code.

yzhu added 30 commits December 9, 2025 19:37

feat: add an example of KGRag

c32b28b

add the run_kg_preprocess and run_kg_embed

14e9f2a

Add run_kg_update

cf9581d

Added the read.me file

31855af

update the todo.md to reflect the progress

306a2cb

update data description

c312e04

remove two temp files

1052c7e

Convert constant.py to .env

7e34a74

Merge pull request generative-computing#1 from yzhu/yzhu/kgrag

7563d9c

feat: add an example of KGRag

adapt the knowledge graph to the theme of mellea

17a1323

Add some temp files that are useful for next steps but will be eventu…

47c8b87

…ally cleaned

refactor preprocessor

cfca7d2

convert embedding

6d40b2b

refactor the kg_updator

d2f8e2d

convert the run_qa

31025ba

Some change tracking files

fe72150

clean the codebase

36cd1b3

add demo dataset

e3e6d90

update the run.sh

0ee6938

add the run_eval.py and update the run.sh

002c0fd

Merge pull request generative-computing#2 from yzhu/yzhu/kgrag_clean

efc8d47

adapt the knowledge graph to the theme of mellea

A little cleaning

72e35a8

create the run_qa_mellea.py

e2a7466

add run_kg_update_mellea

94bd211

Fix a kg_updater error

495ae44

add the embedding_mellea and preprocess_mellea files

87d0302

add the eval_mellea

7f63986

shrink the dataset size

a801c38

fix some bugs so that the code can run

f47eb7c

update preprocess, embedding, updater to align better with the origin…

a4bd85a

…al versions.

YADA ZHU added 6 commits January 4, 2026 16:26

Merge pull request generative-computing#3 from yzhu/yzhu/kgrag_merge

ef54c9f

Rewrite the mellea style code

align the mellea version to the original prompt and replace the origi…

9375bc3

…nal run_* files

Clean the residule files

0b9d09d

Merge pull request generative-computing#4 from yzhu/yzhu/side_by_side…

2077c8c

…_cmp Update the mellea style code and replace the original run_* scripts

Fix some bugs after cleaning the code.

c179b6e

remove one additional file

e06e9b5

yzhu and others added 6 commits January 6, 2026 21:17

remove the change to simple rag

844740d

remove the dependence that are not required any more

ed226f3

Merge branch 'generative-computing:main' into yzhu/kgrag_mellea

dc85b43

Update the readme to fix the incorrect wording of demo.py

46334d1

Update all the md files

4aa4880

add the md file information in readme.md

28f48ef

ydzhu98 changed the title ~~Add an example of kgrag~~ feat: Add an example of kgrag Jan 7, 2026

ydzhu98 changed the title ~~feat: Add an example of kgrag~~ docs: Add an example of kgrag - Knowledge Graph-Enhanced RAG Jan 7, 2026

ydzhu98 changed the title ~~docs: Add an example of kgrag - Knowledge Graph-Enhanced RAG~~ docs: Add an example of kgrag - knowledge graph-enhanced RAG Jan 7, 2026

nrfulton closed this Jan 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: Add an example of kgrag - knowledge graph-enhanced RAG #290

docs: Add an example of kgrag - knowledge graph-enhanced RAG #290

ydzhu98 commented Jan 7, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

nrfulton commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

docs: Add an example of kgrag - knowledge graph-enhanced RAG #290

docs: Add an example of kgrag - knowledge graph-enhanced RAG #290

Conversation

ydzhu98 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KGRag - Knowledge Graph-Enhanced RAG with Mellea

Overview

What Problem Does It Solve?

Architecture

Key Components

Prerequisites

System Requirements

Required Software

Neo4j Configuration

Setup

1. Environment Configuration

2. Dataset Preparation

Download CRAG Benchmark and Mock API

Dataset Structure

Creating a Demo Dataset (Optional but Recommended)

Document Truncation for Faster Processing

3. Knowledge Graph Construction

Usage

Running Question Answering

Using the Convenience Script

Interactive Demo (Optional)

How It Works

1. Question Breakdown

2. Topic Entity Extraction

3. Entity Alignment

4. Multi-Hop Graph Traversal

5. Answer Synthesis

Output Format

Uh oh!

mergify bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

nrfulton commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ydzhu98 commented Jan 7, 2026 •

edited

Loading

mergify bot commented Jan 7, 2026 •

edited

Loading