Graph-Based-Data-Modeling-and-Query-System

<<<<<<< HEAD

Graph-Based-Data-Modeling-and-Query-System

Graph-based SAP Order-to-Cash data explorer with natural language querying powered by Groq LLM and Cytoscape.js visualization

Graph Query System

A context graph system with an LLM-powered natural language query interface for business operations data (sales orders, deliveries, billing, payments).

Live Demo

Add your deployed URL here

Architecture

┌─────────────────────────────────────────────────────────┐
│                     React Frontend                       │
│  Cytoscape.js graph viz │ Chat panel │ Node inspector   │
└────────────────────┬────────────────────────────────────┘
                     │ REST (JSON)
┌────────────────────▼────────────────────────────────────┐
│                  Node.js / Express API                   │
│  /api/graph  /api/chat  /api/node/:type/:id             │
└──────┬──────────────┬───────────────────────────────────┘
       │              │
┌──────▼──────┐  ┌────▼──────────────────────────────────┐
│  SQLite DB  │  │         Gemini 1.5 Flash               │
│  (graph.db) │  │  NL → SQL → execute → NL answer       │
└─────────────┘  └────────────────────────────────────────┘

Tech Stack

Layer	Technology	Reason
Backend	Node.js + Express	Lightweight, same language as frontend
Database	SQLite (better-sqlite3)	Zero-setup, file-based, perfect for structured query generation
Graph (server)	In-memory adjacency built from FK joins	Simple, fast, no separate graph DB needed
Graph (UI)	Cytoscape.js	Battle-tested graph viz, good layout algorithms
LLM	Gemini 1.5 Flash (free tier)	Fast, generous free tier, excellent SQL generation
Frontend	React + Vite	Fast DX, component model works well for split-panel UI

Why SQLite over a native graph database?

The LLM's most powerful capability here is generating SQL — it's a language the model knows extremely well from training. SQLite gives us:

Schema the LLM can reason about directly
Fast ad-hoc queries without a running server
Easy deployment (single file)
The graph structure is derived from FK relationships, not stored separately

A graph DB like Neo4j would add complexity without enough benefit at this dataset scale.

Graph Model

Nodes

Entity	Color	Description
Customer	Purple	The buyer
Sales Order	Teal	The purchase agreement
Sales Order Item	Light Teal	Line items within an order
Delivery	Blue	Physical shipment
Billing Document	Amber	Invoice raised
Journal Entry	Coral	Financial posting
Material	Pink	Product/SKU
Payment	Green	Settlement of billing doc

Edges (Relationships)

Customer ──placed──► Sales Order
Sales Order ──contains──► Sales Order Item
Sales Order Item ──references──► Material
Sales Order ──fulfilled by──► Delivery
Sales Order ──billed via──► Billing Document
Delivery ──triggers──► Billing Document
Billing Document ──posts to──► Journal Entry
Billing Document ──settled by──► Payment

Full Business Flow

Customer → Sales Order → Delivery → Billing Document → Journal Entry
                     ↘                              ↘
                      Sales Order Item → Material    Payment

LLM Prompting Strategy

Two-step prompting

Step 1 — Intent resolution: Send the user message with the full schema. Gemini decides whether to generate SQL or answer directly.
Step 2 — Synthesis: After SQL execution, send the raw results back to Gemini to produce a plain-English answer with referenced entity IDs.

System prompt design

The system prompt includes:

Full SQLite schema with FK relationships
Strict domain restriction instruction
JSON response format spec ({"action":"query","sql":"..."} or {"action":"answer","text":"..."})
The GUARDRAIL sentinel for off-topic detection
Instruction to include referenced_ids for graph highlighting

Guardrails (two layers)

Layer 1 — Pattern matching (fast, pre-LLM):

Regex patterns for off-topic requests (general knowledge, coding help, creative writing, jailbreak attempts)
Dataset keyword whitelist — if the message mentions sales orders, deliveries, etc., it bypasses the pattern check
Runs in <1ms, saves LLM quota

Layer 2 — LLM self-rejection:

The system prompt instructs Gemini to respond with the exact string GUARDRAIL for unrelated queries
The server detects this sentinel and returns the standard rejection message

Example prompt/response cycle

User: "Which products are associated with the most billing documents?"

→ Gemini Step 1:
  {"action":"query","sql":"SELECT m.name, COUNT(bd.id) as billing_count
   FROM materials m
   JOIN sales_order_items soi ON soi.material_id = m.id
   JOIN billing_documents bd ON bd.sales_order_id = soi.sales_order_id
   GROUP BY m.id, m.name ORDER BY billing_count DESC LIMIT 10"}

→ Backend executes SQL, gets results

→ Gemini Step 2:
  {"action":"answer","text":"The top product is Widget A with 42 billing documents...","referenced_ids":["M001","M002"]}

→ Frontend highlights M001 and M002 nodes in the graph

Setup & Running

Prerequisites

Node.js 18+
A Google AI Studio API key (free)

1. Clone and install

git clone <your-repo-url>
cd graph-query-system

# Install backend dependencies
cd backend && npm install

# Install frontend dependencies
cd ../frontend && npm install

2. Configure environment

cd backend
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

3. Ingest the dataset

cd backend

# Create the data directory and copy your CSV/XLSX files into it
mkdir -p data
# Copy your dataset files to backend/data/

# Run ingestion
npm run ingest

The ingestion script auto-maps file/sheet names to tables using fuzzy name matching. Supported mappings:

File/sheet name contains	Maps to table
sales_order, order	sales_orders
delivery, deliveries	deliveries
billing, invoice	billing_documents
journal, fi_document	journal_entries
customer	customers
material, product	materials
payment	payments

4. Start the backend

cd backend
npm run dev
# Runs on http://localhost:3001

5. Start the frontend

cd frontend
npm run dev
# Runs on http://localhost:5173

API Reference

Method	Endpoint	Description
GET	`/api/health`	Health check
GET	`/api/graph`	Full graph (nodes + edges, up to 50 per type)
GET	`/api/stats`	Entity counts per table
GET	`/api/node/:type/:id`	Single node detail
GET	`/api/node/:type/:id/expand`	1-hop neighbours
POST	`/api/chat`	`{ message, history[] }` → NL answer + SQL + referenced_ids

Deployment

Render (recommended — free tier)

Backend:

New Web Service → connect repo → set root to backend/
Build command: npm install
Start command: npm start
Add env var: GEMINI_API_KEY
Add a persistent disk mounted at /opt/render/project/src/backend/data for the SQLite file

Frontend:

New Static Site → connect repo → set root to frontend/
Build command: npm run build
Publish directory: dist
Add env var: VITE_API_URL=https://your-backend.onrender.com

Update frontend/src/api.js baseURL to use import.meta.env.VITE_API_URL for production.

Example Queries

Try these in the chat interface:

"Which products are associated with the highest number of billing documents?"
"Trace the full flow of billing document [ID]"
"Find sales orders that were delivered but never billed"
"Find orders that were billed without a delivery"
"Which customers have placed the most orders?"
"What is the total payment amount by currency?"
"Show me all unpaid billing documents"
"Which plant handles the most deliveries?"

AI Coding Session Logs

This directory contains session transcripts and summaries from Claude.ai sessions used during development of the Graph Query System.

Each file documents:

The prompts used
What the AI generated
What was manually changed and why
Key architectural decisions made during the session

Session Index

#	Session	Key Topic	Outcome
01	Schema Design & DB Choice	SQLite vs Neo4j tradeoffs, normalized schema from SAP columns	`schema.sql` with 8 tables + FK indexes
02	Data Ingestion Script	Fuzzy filename matching, FK-ordered insertion, error handling	`ingest.js` with dry-run support
03	In-Memory Graph Construction	FK → Cytoscape node/edge derivation, 1-hop expansion API	`graph-builder.js`, `/api/graph`, `/api/node/:id/expand`
04	LLM Prompting Strategy	Two-step NL→SQL→NL pipeline, GUARDRAIL sentinel, result truncation	`prompts.js`, `groq-client.js`, `chat.js`
05	React Frontend + Cytoscape	Split layout, node highlighting, collapsible SQL display	Full React frontend
06	Guardrails	Regex pre-filter + LLM sentinel, whitelist design, 20-case test suite	`guardrail.js`

How These Logs Were Generated

Primary tool: Claude.ai (claude.ai chat interface)
Secondary: Claude Code for file-level edits and debugging iterations

Sessions were conducted iteratively — each session built on the working output of the previous one. The logs document the actual prompts used, AI outputs, and manual modifications made during development.

Prompt Quality Observations

Things that improved AI output quality during this project:

Providing the full schema in context — SQL generation quality jumped significantly when the model had all table/column names
Asking for tradeoffs before asking for code — Session 01 and 04 both started with architecture questions, which led to better-justified decisions
Sending error messages back to the LLM — The FK constraint bug in Session 02 was fixed in one follow-up by pasting the exact error
Explicit format constraints — Specifying "return ONLY JSON, no backticks, no markdown" eliminated a whole class of parsing errors
Testing edge cases with the AI — Session 06's 20-case test prompt caught the whitelist-ordering bug before it shipped

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Session 05 -Frontend: Cytoscape.js Graph + Chat Panel.md		Session 05 -Frontend: Cytoscape.js Graph + Chat Panel.md
Session-01- schema-design.md		Session-01- schema-design.md
Session-02 -data-ingestion-script.md		Session-02 -data-ingestion-script.md
Session-06-guardrails.md		Session-06-guardrails.md
render.yaml		render.yaml
session -03 -in-memory-graph-construction.md		session -03 -in-memory-graph-construction.md
session -04 -LLM -prompting.md		session -04 -LLM -prompting.md

Folders and files

Latest commit

History

Repository files navigation

Graph-Based-Data-Modeling-and-Query-System

Graph-based SAP Order-to-Cash data explorer with natural language querying powered by Groq LLM and Cytoscape.js visualization

Graph Query System

Live Demo

Architecture

Tech Stack

Why SQLite over a native graph database?

Graph Model

Nodes

Edges (Relationships)

Full Business Flow

LLM Prompting Strategy

Two-step prompting

System prompt design

Guardrails (two layers)

Example prompt/response cycle

Setup & Running

Prerequisites

1. Clone and install

2. Configure environment

3. Ingest the dataset

4. Start the backend

5. Start the frontend

API Reference

Deployment

Render (recommended — free tier)

Example Queries

AI Coding Session Logs

Session Index

How These Logs Were Generated

Prompt Quality Observations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages