Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ A web application that converts natural language queries to SQL using AI, built
## Features

- 🗣️ Natural language to SQL conversion using OpenAI or Anthropic
- 📁 Drag-and-drop file upload (.csv and .json)
- 📁 Drag-and-drop file upload (.csv, .json, and .jsonl)
- 📊 Interactive table results display
- 🔒 SQL injection protection
- ⚡ Fast development with Vite and uv
Expand Down Expand Up @@ -59,6 +59,7 @@ Use the provided script to start both services:
Press `Ctrl+C` to stop both services.

The script will:

- Check that `.env` exists in `app/server/`
- Start the backend on http://localhost:8000
- Start the frontend on http://localhost:5173
Expand All @@ -67,13 +68,15 @@ The script will:
## Manual Start (Alternative)

### Backend

```bash
cd app/server
# .env is loaded automatically by python-dotenv
uv run python server.py
```

### Frontend

```bash
cd app/client
npm run dev
Expand All @@ -83,7 +86,7 @@ npm run dev

1. **Upload Data**: Click "Upload Data" to open the modal
- Use sample data buttons for quick testing
- Or drag and drop your own .csv or .json files
- Or drag and drop your own .csv, .json, or .jsonl files
- Uploading a file with the same name will overwrite the existing table
2. **Query Your Data**: Type a natural language query like "Show me all users who signed up last week"
- Press `Cmd+Enter` (Mac) or `Ctrl+Enter` (Windows/Linux) to run the query
Expand All @@ -93,6 +96,7 @@ npm run dev
## Development

### Backend Commands

```bash
cd app/server
uv run python server.py # Start server with hot reload
Expand All @@ -103,6 +107,7 @@ uv sync --all-extras # Sync all extras
```

### Frontend Commands

```bash
cd app/client
npm run dev # Start dev server
Expand Down Expand Up @@ -135,7 +140,7 @@ npm run preview # Preview production build

## API Endpoints

- `POST /api/upload` - Upload CSV/JSON file
- `POST /api/upload` - Upload CSV/JSON/JSONL file
- `POST /api/query` - Process natural language query
- `GET /api/schema` - Get database schema
- `POST /api/insights` - Generate column insights
Expand All @@ -148,18 +153,21 @@ npm run preview # Preview production build
The application implements comprehensive SQL injection protection through multiple layers:

1. **Centralized Security Module** (`core/sql_security.py`):

- Identifier validation for table and column names
- Safe query execution with parameterized queries
- Proper escaping for identifiers using SQLite's square bracket notation
- Dangerous operation detection and blocking

2. **Input Validation**:

- All table and column names are validated against a whitelist pattern
- SQL keywords cannot be used as identifiers
- File names are sanitized before creating tables
- User queries are validated for dangerous operations

3. **Query Execution Safety**:

- Parameterized queries used wherever possible
- Identifiers (table/column names) are properly escaped
- Multiple statement execution is blocked
Expand All @@ -174,6 +182,7 @@ The application implements comprehensive SQL injection protection through multip
### Security Best Practices for Development

When adding new SQL functionality:

1. Always use the `sql_security` module functions
2. Never concatenate user input directly into SQL strings
3. Use `execute_query_safely()` for all database operations
Expand All @@ -183,29 +192,32 @@ When adding new SQL functionality:
### Testing Security

Run the comprehensive security tests:

```bash
cd app/server
uv run pytest tests/test_sql_injection.py -v
```


### Additional Security Features

- CORS configured for local development only
- File upload validation (CSV and JSON only)
- File upload validation (CSV, JSON, and JSONL only)
- Comprehensive error logging without exposing sensitive data
- Database operations are isolated with proper connection handling

## Troubleshooting

**Backend won't start:**

- Check Python version: `python --version` (requires 3.12+)
- Verify API keys are set: `echo $OPENAI_API_KEY`

**Frontend errors:**

- Clear node_modules: `rm -rf node_modules && npm install`
- Check Node version: `node --version` (requires 18+)

**CORS issues:**

- Ensure backend is running on port 8000
- Check vite.config.ts proxy settings
- Check vite.config.ts proxy settings
1 change: 1 addition & 0 deletions app/client/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions app/server/app/client/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions app/server/core/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
"""
Configuration constants for the application.

This module contains reusable constants used across the application,
particularly for file processing and data transformation operations.
"""

# Delimiter used when flattening nested JSON objects into flat column names
# Example: {"user": {"name": "John"}} becomes {"user__name": "John"}
NESTED_FIELD_DELIMITER = "__"

# Delimiter used when creating column names for array indices
# Example: {"tags": ["python", "api"]} becomes {"tags_0": "python", "tags_1": "api"}
ARRAY_INDEX_DELIMITER = "_"
184 changes: 182 additions & 2 deletions app/server/core/file_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
import sqlite3
import io
import re
from typing import Dict, Any, List
from typing import Dict, Any, List, Set
from .sql_security import (
execute_query_safely,
validate_identifier,
SQLSecurityError
)
from .constants import NESTED_FIELD_DELIMITER, ARRAY_INDEX_DELIMITER

def sanitize_table_name(table_name: str) -> str:
"""
Expand Down Expand Up @@ -171,4 +172,183 @@ def convert_json_to_sqlite(json_content: bytes, table_name: str) -> Dict[str, An
}

except Exception as e:
raise Exception(f"Error converting JSON to SQLite: {str(e)}")
raise Exception(f"Error converting JSON to SQLite: {str(e)}")

def flatten_json_record(obj: Any, parent_key: str = "") -> Dict[str, Any]:
"""
Recursively flatten a nested JSON object into a flat dictionary.

- Nested dictionaries are flattened using NESTED_FIELD_DELIMITER (e.g., "user__name")
- Nested lists are flattened using ARRAY_INDEX_DELIMITER with index notation (e.g., "tags_0", "tags_1")
- Primitive values (strings, numbers, booleans, None) are kept as-is

Args:
obj: The object to flatten (dict, list, or primitive value)
parent_key: The parent key path (used for recursion)

Returns:
A flat dictionary with concatenated keys
"""
items = {}

if isinstance(obj, dict):
# Handle nested dictionaries
for key, value in obj.items():
new_key = f"{parent_key}{NESTED_FIELD_DELIMITER}{key}" if parent_key else key
# Recursively flatten
flattened = flatten_json_record(value, new_key)
items.update(flattened)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Delimiter collision causes silent data loss during flattening

The flatten_json_record function uses items.update() to merge flattened results, which silently overwrites values when key collisions occur. If a JSON record contains a field name that already includes the delimiter (like "user__name") alongside a nested structure that flattens to the same key (like {"user": {"name": "value"}}), the later value overwrites the earlier one without warning. Similarly, fields like "items_0" will collide with {"items": ["value"]}. This can cause silent data loss when processing JSONL files with field names containing __ or _N patterns.

Fix in Cursor Fix in Web


elif isinstance(obj, list):
# Handle nested lists with index notation
for idx, item in enumerate(obj):
new_key = f"{parent_key}{ARRAY_INDEX_DELIMITER}{idx}"
# Recursively flatten each list item
flattened = flatten_json_record(item, new_key)
items.update(flattened)

else:
# Base case: primitive value (string, number, boolean, None)
items[parent_key] = obj

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Flattening top-level primitives creates empty-string column name

The flatten_json_record function handles top-level primitive values (non-dict, non-list) by storing them with parent_key as the key. When called with a top-level primitive and empty parent_key (e.g., a JSONL line containing just 123 or "hello"), this creates a dictionary entry with an empty string key {"": value}. This empty column name could cause database issues or unexpected behavior. While JSONL typically contains objects, the code doesn't validate this assumption before flattening, allowing malformed JSONL files to produce problematic schemas.

Fix in Cursor Fix in Web

return items

def discover_jsonl_schema(jsonl_content: bytes) -> Set[str]:
"""
Scan through entire JSONL file to discover all possible field names.
This handles schema evolution where different records may have different fields.

Args:
jsonl_content: The raw JSONL file content as bytes

Returns:
A set of all unique flattened field names found across all records

Raises:
ValueError: If no valid JSON records are found or if parsing fails
"""
all_fields = set()
lines = jsonl_content.decode('utf-8').strip().split('\n')
valid_records = 0

for line_num, line in enumerate(lines, 1):
line = line.strip()
if not line:
continue

try:
record = json.loads(line)
flattened = flatten_json_record(record)
all_fields.update(flattened.keys())
valid_records += 1
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON on line {line_num}: {str(e)}")

if valid_records == 0:
raise ValueError("No valid JSON records found in JSONL file")

return all_fields

def convert_jsonl_to_sqlite(jsonl_content: bytes, table_name: str) -> Dict[str, Any]:
"""
Convert JSONL (JSON Lines) file content to SQLite table.

JSONL files contain one JSON object per line. This function:
1. Discovers all possible fields across all records (handles schema evolution)
2. Flattens nested structures using configurable delimiters
3. Creates a pandas DataFrame with all discovered columns
4. Writes to SQLite database

Args:
jsonl_content: The raw JSONL file content as bytes
table_name: The desired name for the SQLite table

Returns:
Dictionary containing:
- table_name: The sanitized table name
- schema: Dictionary mapping column names to SQLite types
- row_count: Number of rows inserted
- sample_data: List of sample records (up to 5)

Raises:
Exception: If parsing or database operations fail
"""
try:
# Sanitize table name
table_name = sanitize_table_name(table_name)

# First pass: Discover all possible fields across all records
all_fields = discover_jsonl_schema(jsonl_content)

# Second pass: Parse and flatten all records
records = []
lines = jsonl_content.decode('utf-8').strip().split('\n')

for line in lines:
line = line.strip()
if not line:
continue

record = json.loads(line)
flattened = flatten_json_record(record)

# Ensure all discovered fields are present (fill missing with None)
complete_record = {field: flattened.get(field) for field in all_fields}
records.append(complete_record)

if not records:
raise ValueError("No valid records found in JSONL file")

# Convert to pandas DataFrame
df = pd.DataFrame(records)

# Clean column names (lowercase, replace spaces/dashes with underscores)
df.columns = [col.lower().replace(' ', '_').replace('-', '_') for col in df.columns]

# Connect to SQLite database
conn = sqlite3.connect("db/database.db")

# Write DataFrame to SQLite
df.to_sql(table_name, conn, if_exists='replace', index=False)

# Get schema information using safe query execution
cursor_info = execute_query_safely(
conn,
"PRAGMA table_info({table})",
identifier_params={'table': table_name}
)
columns_info = cursor_info.fetchall()

schema = {}
for col in columns_info:
schema[col[1]] = col[2] # column_name: data_type

# Get sample data using safe query execution
cursor_sample = execute_query_safely(
conn,
"SELECT * FROM {table} LIMIT 5",
identifier_params={'table': table_name}
)
sample_rows = cursor_sample.fetchall()
column_names = [col[1] for col in columns_info]
sample_data = [dict(zip(column_names, row)) for row in sample_rows]

# Get row count using safe query execution
cursor_count = execute_query_safely(
conn,
"SELECT COUNT(*) FROM {table}",
identifier_params={'table': table_name}
)
row_count = cursor_count.fetchone()[0]

conn.close()

return {
'table_name': table_name,
'schema': schema,
'row_count': row_count,
'sample_data': sample_data
}

except Exception as e:
raise Exception(f"Error converting JSONL to SQLite: {str(e)}")
Loading