Date: 2026-03-24
Tool: Claude.ai
Duration: ~45 mins
Goal: Decide on storage layer and design a normalized SQLite schema from the raw dataset
After downloading and opening the dataset, I had 12+ sheets/files covering sales orders, deliveries, billing documents, journal entries, payments, customers, and materials. I needed to decide: graph DB, relational DB, or something else?
I have a dataset from an SAP Order-to-Cash process. It has entities like sales orders, deliveries, billing documents, journal entries, payments, customers, and materials. Each entity has IDs that reference others (e.g. a delivery references a sales order ID). I want to build a system where users can query this in natural language. The LLM should generate structured queries dynamically. Should I use Neo4j, SQLite, or something else? Give me tradeoffs.
The AI laid out three options:
| Option | Pro | Con |
|---|---|---|
| Neo4j | Native graph traversal, Cypher is expressive | Requires a running server, LLMs generate weaker Cypher than SQL |
| SQLite | LLMs know SQL deeply, zero setup, single file | No native graph ops, JOINs needed for traversal |
| In-memory graph + SQLite | Best of both — SQL for queries, graph for viz | Two layers to maintain |
Key insight the AI gave me:
"LLMs have seen vastly more SQL in training than Cypher or Gremlin. If your primary use case is natural language → structured query, SQLite will give you more reliable query generation with fewer hallucinations."
Went with SQLite + in-memory graph derived from FK relationships. The graph is for visualization; SQL is for querying. This separation made sense.
Here is a sample of my dataset columns for each entity. Design a normalized SQLite schema with proper foreign keys. I'll paste the column headers now.
sales_orders: VBELN, KUNNR, ERDAT, NETWR, WAERK, VKORG, VTWEG, SPART
deliveries: VBELN, TKNUM, KUNNR, LFART, WADAT_IST, LGORT, WERKS
billing_documents: VBELN, FKART, FKDAT, NETWR, WAERK, KUNRG
journal_entries: BELNR, BUKRS, GJAHR, BLDAT, WAERS, DMBTR
payments: BELNR, BUKRS, AUGDT, DMBTR, WAERS
customers: KUNNR, NAME1, ORT01, LAND1, REGIO
materials: MATNR, MAKTX, MEINS, MTART, MATKL
Generated a full schema. Key parts:
CREATE TABLE customers (
id TEXT PRIMARY KEY, -- KUNNR
name TEXT,
city TEXT,
country TEXT,
region TEXT
);
CREATE TABLE sales_orders (
id TEXT PRIMARY KEY, -- VBELN
customer_id TEXT REFERENCES customers(id),
created_date TEXT,
net_value REAL,
currency TEXT,
sales_org TEXT,
distribution_channel TEXT,
division TEXT
);
CREATE TABLE deliveries (
id TEXT PRIMARY KEY,
sales_order_id TEXT REFERENCES sales_orders(id),
customer_id TEXT REFERENCES customers(id),
delivery_type TEXT,
actual_delivery_date TEXT,
storage_location TEXT,
plant TEXT
);
CREATE TABLE billing_documents (
id TEXT PRIMARY KEY,
sales_order_id TEXT REFERENCES sales_orders(id),
delivery_id TEXT REFERENCES deliveries(id),
billing_type TEXT,
billing_date TEXT,
net_value REAL,
currency TEXT,
payer_id TEXT REFERENCES customers(id)
);
CREATE TABLE journal_entries (
id TEXT PRIMARY KEY,
billing_document_id TEXT REFERENCES billing_documents(id),
company_code TEXT,
fiscal_year TEXT,
posting_date TEXT,
currency TEXT,
amount REAL
);
CREATE TABLE payments (
id TEXT PRIMARY KEY,
billing_document_id TEXT REFERENCES billing_documents(id),
company_code TEXT,
clearing_date TEXT,
amount REAL,
currency TEXT
);
CREATE TABLE materials (
id TEXT PRIMARY KEY, -- MATNR
name TEXT,
unit TEXT,
type TEXT,
material_class TEXT
);
CREATE TABLE sales_order_items (
id TEXT PRIMARY KEY,
sales_order_id TEXT REFERENCES sales_orders(id),
material_id TEXT REFERENCES materials(id),
quantity REAL,
net_value REAL
);- AI used
INTEGER PRIMARY KEYfor some tables — I changed all toTEXTsince SAP IDs are alphanumeric (e.g.0000012345) - Added
sales_order_itemsas a junction table the AI initially missed - Renamed columns to be more readable (
KUNNR→customer_id) — the AI suggested keeping SAP names, but I overrode this for LLM-friendliness
Now write the CREATE TABLE statements as a single migration script, and add indexes on all foreign key columns for query performance.
Produced schema.sql with all tables + indexes like:
CREATE INDEX idx_sales_orders_customer ON sales_orders(customer_id);
CREATE INDEX idx_deliveries_sales_order ON deliveries(sales_order_id);
CREATE INDEX idx_billing_sales_order ON billing_documents(sales_order_id);
-- etc.I accepted this mostly as-is and saved it to backend/src/db/schema.sql.
- SQLite over Neo4j — LLM SQL generation is more reliable than Cypher
- Readable column names — better for LLM prompt injection
- All IDs as TEXT — SAP IDs are not integers
- FK indexes on all join columns — the AI proactively added these, which I kept
backend/src/db/schema.sql