diff --git a/README.md b/README.md index d05df9b..4414206 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ A web application that converts natural language queries to SQL using AI, built ## Features - 🗣️ Natural language to SQL conversion using OpenAI or Anthropic -- 📁 Drag-and-drop file upload (.csv and .json) +- 📁 Drag-and-drop file upload (.csv, .json, and .jsonl) - 📊 Interactive table results display - 🔒 SQL injection protection - ⚡ Fast development with Vite and uv @@ -59,6 +59,7 @@ Use the provided script to start both services: Press `Ctrl+C` to stop both services. The script will: + - Check that `.env` exists in `app/server/` - Start the backend on http://localhost:8000 - Start the frontend on http://localhost:5173 @@ -67,6 +68,7 @@ The script will: ## Manual Start (Alternative) ### Backend + ```bash cd app/server # .env is loaded automatically by python-dotenv @@ -74,6 +76,7 @@ uv run python server.py ``` ### Frontend + ```bash cd app/client npm run dev @@ -83,7 +86,7 @@ npm run dev 1. **Upload Data**: Click "Upload Data" to open the modal - Use sample data buttons for quick testing - - Or drag and drop your own .csv or .json files + - Or drag and drop your own .csv, .json, or .jsonl files - Uploading a file with the same name will overwrite the existing table 2. **Query Your Data**: Type a natural language query like "Show me all users who signed up last week" - Press `Cmd+Enter` (Mac) or `Ctrl+Enter` (Windows/Linux) to run the query @@ -93,6 +96,7 @@ npm run dev ## Development ### Backend Commands + ```bash cd app/server uv run python server.py # Start server with hot reload @@ -103,6 +107,7 @@ uv sync --all-extras # Sync all extras ``` ### Frontend Commands + ```bash cd app/client npm run dev # Start dev server @@ -135,7 +140,7 @@ npm run preview # Preview production build ## API Endpoints -- `POST /api/upload` - Upload CSV/JSON file +- `POST /api/upload` - Upload CSV/JSON/JSONL file - `POST /api/query` - Process natural language query - `GET /api/schema` - Get database schema - `POST /api/insights` - Generate column insights @@ -148,18 +153,21 @@ npm run preview # Preview production build The application implements comprehensive SQL injection protection through multiple layers: 1. **Centralized Security Module** (`core/sql_security.py`): + - Identifier validation for table and column names - Safe query execution with parameterized queries - Proper escaping for identifiers using SQLite's square bracket notation - Dangerous operation detection and blocking 2. **Input Validation**: + - All table and column names are validated against a whitelist pattern - SQL keywords cannot be used as identifiers - File names are sanitized before creating tables - User queries are validated for dangerous operations 3. **Query Execution Safety**: + - Parameterized queries used wherever possible - Identifiers (table/column names) are properly escaped - Multiple statement execution is blocked @@ -174,6 +182,7 @@ The application implements comprehensive SQL injection protection through multip ### Security Best Practices for Development When adding new SQL functionality: + 1. Always use the `sql_security` module functions 2. Never concatenate user input directly into SQL strings 3. Use `execute_query_safely()` for all database operations @@ -183,29 +192,32 @@ When adding new SQL functionality: ### Testing Security Run the comprehensive security tests: + ```bash cd app/server uv run pytest tests/test_sql_injection.py -v ``` - ### Additional Security Features - CORS configured for local development only -- File upload validation (CSV and JSON only) +- File upload validation (CSV, JSON, and JSONL only) - Comprehensive error logging without exposing sensitive data - Database operations are isolated with proper connection handling ## Troubleshooting **Backend won't start:** + - Check Python version: `python --version` (requires 3.12+) - Verify API keys are set: `echo $OPENAI_API_KEY` **Frontend errors:** + - Clear node_modules: `rm -rf node_modules && npm install` - Check Node version: `node --version` (requires 18+) **CORS issues:** + - Ensure backend is running on port 8000 -- Check vite.config.ts proxy settings \ No newline at end of file +- Check vite.config.ts proxy settings diff --git a/app/client/package-lock.json b/app/client/package-lock.json index e14905a..424f598 100644 --- a/app/client/package-lock.json +++ b/app/client/package-lock.json @@ -827,6 +827,7 @@ "integrity": "sha512-M7BAV6Rlcy5u+m6oPhAPFgJTzAioX/6B0DxyvDlo9l8+T3nLKbrczg2WLUyzd45L8RqfUMyGPzekbMvX2Ldkwg==", "dev": true, "license": "MIT", + "peer": true, "engines": { "node": ">=12" }, diff --git a/app/server/app/client/package-lock.json b/app/server/app/client/package-lock.json new file mode 100644 index 0000000..ac6f59a --- /dev/null +++ b/app/server/app/client/package-lock.json @@ -0,0 +1,6 @@ +{ + "name": "client", + "lockfileVersion": 3, + "requires": true, + "packages": {} +} diff --git a/app/server/core/constants.py b/app/server/core/constants.py new file mode 100644 index 0000000..8e352fa --- /dev/null +++ b/app/server/core/constants.py @@ -0,0 +1,14 @@ +""" +Configuration constants for the application. + +This module contains reusable constants used across the application, +particularly for file processing and data transformation operations. +""" + +# Delimiter used when flattening nested JSON objects into flat column names +# Example: {"user": {"name": "John"}} becomes {"user__name": "John"} +NESTED_FIELD_DELIMITER = "__" + +# Delimiter used when creating column names for array indices +# Example: {"tags": ["python", "api"]} becomes {"tags_0": "python", "tags_1": "api"} +ARRAY_INDEX_DELIMITER = "_" diff --git a/app/server/core/file_processor.py b/app/server/core/file_processor.py index cc7c525..c1564b0 100644 --- a/app/server/core/file_processor.py +++ b/app/server/core/file_processor.py @@ -3,12 +3,13 @@ import sqlite3 import io import re -from typing import Dict, Any, List +from typing import Dict, Any, List, Set from .sql_security import ( execute_query_safely, validate_identifier, SQLSecurityError ) +from .constants import NESTED_FIELD_DELIMITER, ARRAY_INDEX_DELIMITER def sanitize_table_name(table_name: str) -> str: """ @@ -171,4 +172,183 @@ def convert_json_to_sqlite(json_content: bytes, table_name: str) -> Dict[str, An } except Exception as e: - raise Exception(f"Error converting JSON to SQLite: {str(e)}") \ No newline at end of file + raise Exception(f"Error converting JSON to SQLite: {str(e)}") + +def flatten_json_record(obj: Any, parent_key: str = "") -> Dict[str, Any]: + """ + Recursively flatten a nested JSON object into a flat dictionary. + + - Nested dictionaries are flattened using NESTED_FIELD_DELIMITER (e.g., "user__name") + - Nested lists are flattened using ARRAY_INDEX_DELIMITER with index notation (e.g., "tags_0", "tags_1") + - Primitive values (strings, numbers, booleans, None) are kept as-is + + Args: + obj: The object to flatten (dict, list, or primitive value) + parent_key: The parent key path (used for recursion) + + Returns: + A flat dictionary with concatenated keys + """ + items = {} + + if isinstance(obj, dict): + # Handle nested dictionaries + for key, value in obj.items(): + new_key = f"{parent_key}{NESTED_FIELD_DELIMITER}{key}" if parent_key else key + # Recursively flatten + flattened = flatten_json_record(value, new_key) + items.update(flattened) + + elif isinstance(obj, list): + # Handle nested lists with index notation + for idx, item in enumerate(obj): + new_key = f"{parent_key}{ARRAY_INDEX_DELIMITER}{idx}" + # Recursively flatten each list item + flattened = flatten_json_record(item, new_key) + items.update(flattened) + + else: + # Base case: primitive value (string, number, boolean, None) + items[parent_key] = obj + + return items + +def discover_jsonl_schema(jsonl_content: bytes) -> Set[str]: + """ + Scan through entire JSONL file to discover all possible field names. + This handles schema evolution where different records may have different fields. + + Args: + jsonl_content: The raw JSONL file content as bytes + + Returns: + A set of all unique flattened field names found across all records + + Raises: + ValueError: If no valid JSON records are found or if parsing fails + """ + all_fields = set() + lines = jsonl_content.decode('utf-8').strip().split('\n') + valid_records = 0 + + for line_num, line in enumerate(lines, 1): + line = line.strip() + if not line: + continue + + try: + record = json.loads(line) + flattened = flatten_json_record(record) + all_fields.update(flattened.keys()) + valid_records += 1 + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON on line {line_num}: {str(e)}") + + if valid_records == 0: + raise ValueError("No valid JSON records found in JSONL file") + + return all_fields + +def convert_jsonl_to_sqlite(jsonl_content: bytes, table_name: str) -> Dict[str, Any]: + """ + Convert JSONL (JSON Lines) file content to SQLite table. + + JSONL files contain one JSON object per line. This function: + 1. Discovers all possible fields across all records (handles schema evolution) + 2. Flattens nested structures using configurable delimiters + 3. Creates a pandas DataFrame with all discovered columns + 4. Writes to SQLite database + + Args: + jsonl_content: The raw JSONL file content as bytes + table_name: The desired name for the SQLite table + + Returns: + Dictionary containing: + - table_name: The sanitized table name + - schema: Dictionary mapping column names to SQLite types + - row_count: Number of rows inserted + - sample_data: List of sample records (up to 5) + + Raises: + Exception: If parsing or database operations fail + """ + try: + # Sanitize table name + table_name = sanitize_table_name(table_name) + + # First pass: Discover all possible fields across all records + all_fields = discover_jsonl_schema(jsonl_content) + + # Second pass: Parse and flatten all records + records = [] + lines = jsonl_content.decode('utf-8').strip().split('\n') + + for line in lines: + line = line.strip() + if not line: + continue + + record = json.loads(line) + flattened = flatten_json_record(record) + + # Ensure all discovered fields are present (fill missing with None) + complete_record = {field: flattened.get(field) for field in all_fields} + records.append(complete_record) + + if not records: + raise ValueError("No valid records found in JSONL file") + + # Convert to pandas DataFrame + df = pd.DataFrame(records) + + # Clean column names (lowercase, replace spaces/dashes with underscores) + df.columns = [col.lower().replace(' ', '_').replace('-', '_') for col in df.columns] + + # Connect to SQLite database + conn = sqlite3.connect("db/database.db") + + # Write DataFrame to SQLite + df.to_sql(table_name, conn, if_exists='replace', index=False) + + # Get schema information using safe query execution + cursor_info = execute_query_safely( + conn, + "PRAGMA table_info({table})", + identifier_params={'table': table_name} + ) + columns_info = cursor_info.fetchall() + + schema = {} + for col in columns_info: + schema[col[1]] = col[2] # column_name: data_type + + # Get sample data using safe query execution + cursor_sample = execute_query_safely( + conn, + "SELECT * FROM {table} LIMIT 5", + identifier_params={'table': table_name} + ) + sample_rows = cursor_sample.fetchall() + column_names = [col[1] for col in columns_info] + sample_data = [dict(zip(column_names, row)) for row in sample_rows] + + # Get row count using safe query execution + cursor_count = execute_query_safely( + conn, + "SELECT COUNT(*) FROM {table}", + identifier_params={'table': table_name} + ) + row_count = cursor_count.fetchone()[0] + + conn.close() + + return { + 'table_name': table_name, + 'schema': schema, + 'row_count': row_count, + 'sample_data': sample_data + } + + except Exception as e: + raise Exception(f"Error converting JSONL to SQLite: {str(e)}") \ No newline at end of file diff --git a/app/server/server.py b/app/server/server.py index 6db2f67..d9c937a 100644 --- a/app/server/server.py +++ b/app/server/server.py @@ -37,7 +37,7 @@ ) # Import core modules (to be implemented) -from core.file_processor import convert_csv_to_sqlite, convert_json_to_sqlite +from core.file_processor import convert_csv_to_sqlite, convert_json_to_sqlite, convert_jsonl_to_sqlite from core.llm_processor import generate_sql from core.sql_processor import execute_sql_safely, get_database_schema from core.insights import generate_insights @@ -71,21 +71,23 @@ @app.post("/api/upload", response_model=FileUploadResponse) async def upload_file(file: UploadFile = File(...)) -> FileUploadResponse: - """Upload and convert .json or .csv file to SQLite table""" + """Upload and convert .json, .jsonl or .csv file to SQLite table""" try: # Validate file type - if not file.filename.endswith(('.csv', '.json')): - raise HTTPException(400, "Only .csv and .json files are supported") - + if not file.filename.endswith(('.csv', '.json', '.jsonl')): + raise HTTPException(400, "Only .csv, .json, and .jsonl files are supported") + # Generate table name from filename table_name = file.filename.rsplit('.', 1)[0].lower().replace(' ', '_') - + # Read file content content = await file.read() - + # Convert to SQLite based on file type if file.filename.endswith('.csv'): result = convert_csv_to_sqlite(content, table_name) + elif file.filename.endswith('.jsonl'): + result = convert_jsonl_to_sqlite(content, table_name) else: result = convert_json_to_sqlite(content, table_name) diff --git a/app/server/tests/assets/test_events.jsonl b/app/server/tests/assets/test_events.jsonl new file mode 100644 index 0000000..aeb3907 --- /dev/null +++ b/app/server/tests/assets/test_events.jsonl @@ -0,0 +1,3 @@ +{"id": 1, "event": "login", "user": "john", "timestamp": "2024-01-01T10:00:00"} +{"id": 2, "event": "logout", "user": "jane", "timestamp": "2024-01-01T11:00:00"} +{"id": 3, "event": "purchase", "user": "john", "timestamp": "2024-01-01T12:00:00", "amount": 99.99} diff --git a/app/server/tests/assets/test_nested.jsonl b/app/server/tests/assets/test_nested.jsonl new file mode 100644 index 0000000..f5ddd9c --- /dev/null +++ b/app/server/tests/assets/test_nested.jsonl @@ -0,0 +1,3 @@ +{"id": 1, "user": {"name": "John Doe", "email": "john@example.com", "address": {"city": "NYC", "zip": "10001"}}, "tags": ["vip", "early-adopter"]} +{"id": 2, "user": {"name": "Jane Smith", "email": "jane@example.com", "address": {"city": "LA", "zip": "90001"}}, "tags": ["new"]} +{"id": 3, "user": {"name": "Bob Johnson", "email": "bob@example.com"}, "tags": ["standard", "verified", "active"]} diff --git a/app/server/tests/core/test_file_processor.py b/app/server/tests/core/test_file_processor.py index 81ce7b6..c6fc272 100644 --- a/app/server/tests/core/test_file_processor.py +++ b/app/server/tests/core/test_file_processor.py @@ -6,7 +6,12 @@ import io from pathlib import Path from unittest.mock import patch -from core.file_processor import convert_csv_to_sqlite, convert_json_to_sqlite +from core.file_processor import ( + convert_csv_to_sqlite, + convert_json_to_sqlite, + convert_jsonl_to_sqlite, + flatten_json_record +) @pytest.fixture @@ -157,8 +162,167 @@ def test_convert_json_to_sqlite_empty_array(self): # Test with empty JSON array json_data = b'[]' table_name = "test_table" - + with pytest.raises(Exception) as exc_info: convert_json_to_sqlite(json_data, table_name) - - assert "JSON array is empty" in str(exc_info.value) \ No newline at end of file + + assert "JSON array is empty" in str(exc_info.value) + + # JSONL Tests + + def test_flatten_json_record_simple_nested_object(self): + # Test simple nested object + obj = {"user": {"name": "John", "age": 30}} + result = flatten_json_record(obj) + assert result == {"user__name": "John", "user__age": 30} + + def test_flatten_json_record_nested_array(self): + # Test nested array + obj = {"tags": ["python", "api"]} + result = flatten_json_record(obj) + assert result == {"tags_0": "python", "tags_1": "api"} + + def test_flatten_json_record_deeply_nested(self): + # Test deeply nested structure + obj = {"a": {"b": {"c": 1}}} + result = flatten_json_record(obj) + assert result == {"a__b__c": 1} + + def test_flatten_json_record_mixed_nesting(self): + # Test mixed nesting: objects containing arrays + obj = {"user": {"tags": ["admin", "user"]}} + result = flatten_json_record(obj) + assert result == {"user__tags_0": "admin", "user__tags_1": "user"} + + def test_flatten_json_record_empty_objects(self): + # Test empty object + obj = {} + result = flatten_json_record(obj) + assert result == {} + + def test_flatten_json_record_null_values(self): + # Test null values + obj = {"name": "John", "age": None} + result = flatten_json_record(obj) + assert result == {"name": "John", "age": None} + + def test_convert_jsonl_to_sqlite_success(self, test_db, test_assets_dir): + # Load real JSONL file + jsonl_file = test_assets_dir / "test_events.jsonl" + with open(jsonl_file, 'rb') as f: + jsonl_data = f.read() + + table_name = "events" + result = convert_jsonl_to_sqlite(jsonl_data, table_name) + + # Verify return structure + assert result['table_name'] == table_name + assert 'schema' in result + assert 'row_count' in result + assert 'sample_data' in result + + # Test the returned data + assert result['row_count'] == 3 # 3 events in test file + assert len(result['sample_data']) == 3 + + # Verify schema has expected columns (including 'amount' from line 3) + assert 'id' in result['schema'] + assert 'event' in result['schema'] + assert 'user' in result['schema'] + assert 'timestamp' in result['schema'] + assert 'amount' in result['schema'] # Schema evolution: only in record 3 + + # Verify sample data structure and content + login_data = next((item for item in result['sample_data'] if item['event'] == 'login'), None) + assert login_data is not None + assert login_data['id'] == 1 + assert login_data['user'] == 'john' + assert login_data['amount'] is None # This record doesn't have amount + + # Verify record with amount + purchase_data = next((item for item in result['sample_data'] if item['event'] == 'purchase'), None) + assert purchase_data is not None + assert purchase_data['amount'] == 99.99 + + def test_convert_jsonl_to_sqlite_with_nested_structures(self, test_db, test_assets_dir): + # Load JSONL file with nested structures + jsonl_file = test_assets_dir / "test_nested.jsonl" + with open(jsonl_file, 'rb') as f: + jsonl_data = f.read() + + table_name = "nested_data" + result = convert_jsonl_to_sqlite(jsonl_data, table_name) + + # Verify return structure + assert result['table_name'] == table_name + assert result['row_count'] == 3 + + # Verify flattened columns exist + assert 'user__name' in result['schema'] + assert 'user__email' in result['schema'] + assert 'user__address__city' in result['schema'] + assert 'user__address__zip' in result['schema'] + + # Verify array columns (tags_0, tags_1, tags_2 for max 3 elements) + assert 'tags_0' in result['schema'] + assert 'tags_1' in result['schema'] + assert 'tags_2' in result['schema'] + + # Verify data integrity after flattening + john_data = next((item for item in result['sample_data'] if item['user__name'] == 'John Doe'), None) + assert john_data is not None + assert john_data['user__email'] == 'john@example.com' + assert john_data['user__address__city'] == 'NYC' + assert john_data['user__address__zip'] == '10001' + assert john_data['tags_0'] == 'vip' + assert john_data['tags_1'] == 'early-adopter' + + # Verify record without address.city has None + bob_data = next((item for item in result['sample_data'] if item['user__name'] == 'Bob Johnson'), None) + assert bob_data is not None + assert bob_data['user__address__city'] is None + assert bob_data['tags_0'] == 'standard' + assert bob_data['tags_1'] == 'verified' + assert bob_data['tags_2'] == 'active' + + def test_convert_jsonl_to_sqlite_schema_evolution(self, test_db, test_assets_dir): + # Load test_events.jsonl which has schema evolution + jsonl_file = test_assets_dir / "test_events.jsonl" + with open(jsonl_file, 'rb') as f: + jsonl_data = f.read() + + table_name = "events_evolution" + result = convert_jsonl_to_sqlite(jsonl_data, table_name) + + # Verify all records have 'amount' column (with None for records 1 and 2) + assert 'amount' in result['schema'] + + # Verify the schema was discovered from all lines, not just the first + for record in result['sample_data']: + assert 'amount' in record + if record['event'] in ['login', 'logout']: + assert record['amount'] is None + elif record['event'] == 'purchase': + assert record['amount'] == 99.99 + + def test_convert_jsonl_to_sqlite_invalid_jsonl(self): + # Test with invalid JSON line + jsonl_data = b'{"id": 1, "name": "John"}\ninvalid json line\n{"id": 2, "name": "Jane"}' + table_name = "test_table" + + with pytest.raises(Exception) as exc_info: + convert_jsonl_to_sqlite(jsonl_data, table_name) + + assert "Error converting JSONL to SQLite" in str(exc_info.value) + assert "Invalid JSON on line 2" in str(exc_info.value) + + def test_convert_jsonl_to_sqlite_empty_file(self): + # Test with empty JSONL file + jsonl_data = b'' + table_name = "test_table" + + with pytest.raises(Exception) as exc_info: + convert_jsonl_to_sqlite(jsonl_data, table_name) + + assert "Error converting JSONL to SQLite" in str(exc_info.value) + assert "No valid JSON records found" in str(exc_info.value) \ No newline at end of file diff --git a/issue-1.md b/issue-1.md new file mode 100644 index 0000000..d6c7609 --- /dev/null +++ b/issue-1.md @@ -0,0 +1,13 @@ +Title: Jsonl Support + +Add support uploading jsonl files. + +Be sure to look through entire jsonl first to get all possible fields. Use std lib - no new libraries outside of what we have. + +One jsonl files will generate one new table just like our current csv and json uploads + +Concat nested fields and any possible nested lists with\_\_store as updatable delimiter in a constants file. + +Use \_0 to denote list items (use delimiter and specify by index) + +Update UI to let users know they can upload jsonl files as well then inside the codebase in the test dir create a couple jsonl file we can use to test uploads. diff --git a/specs/jsonl-file-upload-support.md b/specs/jsonl-file-upload-support.md new file mode 100644 index 0000000..630312a --- /dev/null +++ b/specs/jsonl-file-upload-support.md @@ -0,0 +1,289 @@ +# Feature: JSONL File Upload Support + +## Feature Description +Add support for uploading JSONL (JSON Lines) files to the Natural Language SQL Interface application. JSONL files contain one JSON object per line, making them efficient for streaming large datasets. This feature will parse JSONL files, flatten nested structures using configurable delimiters, scan the entire file to discover all possible fields (handling schema evolution), and create a single SQLite table just like existing CSV and JSON uploads. + +## User Story +As a data analyst +I want to upload JSONL files containing potentially nested and evolving data structures +So that I can query line-delimited JSON data using natural language without manually transforming it first + +## Problem Statement +Currently, the application only supports CSV and JSON file uploads. However, JSONL (JSON Lines) is a widely-used format for streaming data, logs, and large datasets where each line is a valid JSON object. Users cannot analyze JSONL files without first converting them to CSV or JSON array format. Additionally, JSONL files often contain: +- Nested objects that need flattening for SQL table representation +- Nested arrays that require special handling with index notation +- Schema evolution where different lines may have different fields +- Large files that benefit from line-by-line processing + +## Solution Statement +Implement JSONL file upload support by: +1. Creating a new `convert_jsonl_to_sqlite()` function that processes JSONL files line-by-line +2. Performing a full file scan to discover all possible fields across all records (handling schema evolution) +3. Flattening nested objects using a configurable delimiter (default: `__`) stored in a constants file +4. Handling nested lists by concatenating with index notation using `_0`, `_1`, etc. format +5. Updating the server endpoint to accept `.jsonl` files +6. Updating the UI to indicate JSONL support to users +7. Creating comprehensive test JSONL files with nested structures, arrays, and schema variations +8. Ensuring security validation and proper error handling throughout + +## Relevant Files +Use these files to implement the feature: + +### Core Backend Files +- **`app/server/core/file_processor.py`** - Contains `convert_csv_to_sqlite()` and `convert_json_to_sqlite()` functions. Add new `convert_jsonl_to_sqlite()` function here following existing patterns. + +- **`app/server/server.py`** - Contains `/api/upload` endpoint (lines 72-109). Update to accept `.jsonl` file extension and route to new converter function. + +- **`app/server/core/sql_security.py`** - Provides security validation functions. Ensure JSONL processing uses proper identifier validation. + +### Frontend Files +- **`app/client/src/main.ts`** - Contains upload modal and file handling logic. Update UI text to mention JSONL support (line 8 mentions ".csv and .json"). + +### Testing Files +- **`app/server/tests/core/test_file_processor.py`** - Contains tests for CSV and JSON converters. Add comprehensive tests for JSONL converter. + +- **`app/server/tests/assets/`** - Directory containing test files (test_users.csv, test_products.json, etc.). Create new JSONL test files here. + +### New Files + +- **`app/server/core/constants.py`** - New file to store configurable constants including: + - `NESTED_FIELD_DELIMITER = "__"` - Delimiter for flattening nested objects + - `ARRAY_INDEX_DELIMITER = "_"` - Delimiter for array index notation + - Other reusable constants for the application + +## Implementation Plan + +### Phase 1: Foundation +Create the constants file and helper functions needed for JSONL processing: +- Create `app/server/core/constants.py` with delimiter constants +- Design the flattening algorithm for nested objects and arrays +- Plan the two-pass approach: first pass to discover schema, second pass to populate data + +### Phase 2: Core Implementation +Implement the JSONL converter function: +- Create `convert_jsonl_to_sqlite()` in `file_processor.py` +- Implement full file scanning to discover all fields across all records +- Implement nested object flattening using `__` delimiter +- Implement nested array handling with index notation (`_0`, `_1`, etc.) +- Ensure proper column name cleaning and validation +- Return consistent response structure matching existing converters + +### Phase 3: Integration +Integrate JSONL support throughout the application: +- Update server endpoint to accept `.jsonl` files +- Update frontend UI to indicate JSONL support +- Create comprehensive test files covering various JSONL scenarios +- Write thorough unit tests +- Update documentation as needed + +## Step by Step Tasks + +### Step 1: Create Constants File +- Create `app/server/core/constants.py` with the following constants: + - `NESTED_FIELD_DELIMITER = "__"` for flattening nested objects (e.g., `user__address__city`) + - `ARRAY_INDEX_DELIMITER = "_"` for array indices (e.g., `tags_0`, `tags_1`) +- Document the purpose of each constant with clear comments +- Import this file in `file_processor.py` for use in JSONL processing + +### Step 2: Design and Implement Flattening Logic +- Create helper function `flatten_json_record(obj: Dict[str, Any], parent_key: str = "") -> Dict[str, Any]` in `file_processor.py` +- Handle nested dictionaries by recursively flattening with `__` delimiter +- Handle nested lists by creating separate columns for each index with `_N` notation +- Handle primitive values (strings, numbers, booleans, null) as-is +- Ensure the function is well-documented and handles edge cases (empty objects, null values, mixed types) + +### Step 3: Implement Schema Discovery +- Create helper function `discover_jsonl_schema(jsonl_content: bytes) -> set` that: + - Reads through entire JSONL file line by line + - Parses each line as JSON + - Flattens each record to discover all possible field names + - Collects all unique field names in a set + - Handles malformed lines gracefully with error reporting +- This ensures we capture all fields even if schema evolves throughout the file + +### Step 4: Implement JSONL to SQLite Converter +- Create `convert_jsonl_to_sqlite(jsonl_content: bytes, table_name: str) -> Dict[str, Any]` in `file_processor.py` +- Sanitize table name using existing `sanitize_table_name()` function +- First pass: Call `discover_jsonl_schema()` to get all possible columns +- Second pass: Process each line, flatten it, and ensure all discovered columns are present (fill missing with None) +- Create pandas DataFrame from flattened records +- Clean column names (lowercase, replace spaces/dashes with underscores) +- Write DataFrame to SQLite using existing patterns +- Return schema, row count, and sample data matching existing converter format +- Include comprehensive error handling with descriptive messages + +### Step 5: Add Unit Tests for Flattening Logic +- Add test in `test_file_processor.py` for `flatten_json_record()`: + - Test simple nested object: `{"user": {"name": "John", "age": 30}}` → `{"user__name": "John", "user__age": 30}` + - Test nested array: `{"tags": ["python", "api"]}` → `{"tags_0": "python", "tags_1": "api"}` + - Test deeply nested structure: `{"a": {"b": {"c": 1}}}` → `{"a__b__c": 1}` + - Test mixed nesting: objects containing arrays containing objects + - Test edge cases: empty objects, empty arrays, null values + +### Step 6: Create JSONL Test Files +- Create `app/server/tests/assets/test_events.jsonl` with simple records: + ```jsonl + {"id": 1, "event": "login", "user": "john", "timestamp": "2024-01-01T10:00:00"} + {"id": 2, "event": "logout", "user": "jane", "timestamp": "2024-01-01T11:00:00"} + {"id": 3, "event": "purchase", "user": "john", "timestamp": "2024-01-01T12:00:00", "amount": 99.99} + ``` + Note: Line 3 has an extra field `amount` to test schema evolution + +- Create `app/server/tests/assets/test_nested.jsonl` with nested structures: + ```jsonl + {"id": 1, "user": {"name": "John Doe", "email": "john@example.com", "address": {"city": "NYC", "zip": "10001"}}, "tags": ["vip", "early-adopter"]} + {"id": 2, "user": {"name": "Jane Smith", "email": "jane@example.com", "address": {"city": "LA", "zip": "90001"}}, "tags": ["new"]} + {"id": 3, "user": {"name": "Bob Johnson", "email": "bob@example.com"}, "tags": ["standard", "verified", "active"]} + ``` + Note: Different nesting depths and array lengths to test robustness + +### Step 7: Add Unit Tests for JSONL Converter +- Add test `test_convert_jsonl_to_sqlite_success()` in `test_file_processor.py`: + - Load `test_events.jsonl` + - Verify table name, schema, row count (3 rows) + - Verify sample data contains expected records + - Verify schema includes all fields including `amount` from line 3 + +- Add test `test_convert_jsonl_to_sqlite_with_nested_structures()`: + - Load `test_nested.jsonl` + - Verify flattened columns exist: `user__name`, `user__email`, `user__address__city`, `user__address__zip` + - Verify array columns: `tags_0`, `tags_1`, `tags_2` + - Verify data integrity after flattening + +- Add test `test_convert_jsonl_to_sqlite_schema_evolution()`: + - Load `test_events.jsonl` + - Verify all records have `amount` column (with None for records 1 and 2) + - Verify the schema was discovered from all lines, not just the first + +- Add test `test_convert_jsonl_to_sqlite_invalid_jsonl()`: + - Test with invalid JSON line + - Verify appropriate error message + +- Add test `test_convert_jsonl_to_sqlite_empty_file()`: + - Test with empty JSONL file + - Verify appropriate error handling + +### Step 8: Update Server Endpoint +- Modify `/api/upload` endpoint in `app/server/server.py` (line 77-78): + - Change file extension validation from `('.csv', '.json')` to `('.csv', '.json', '.jsonl')` + - Add conditional routing for JSONL files (line 87-90): + ```python + if file.filename.endswith('.csv'): + result = convert_csv_to_sqlite(content, table_name) + elif file.filename.endswith('.jsonl'): + result = convert_jsonl_to_sqlite(content, table_name) + else: + result = convert_json_to_sqlite(content, table_name) + ``` +- Ensure import statement includes `convert_jsonl_to_sqlite` (line 40) + +### Step 9: Update Frontend UI +- Update `app/client/src/main.ts` to reflect JSONL support in user-facing text: + - Update README.md line 8: Change "Drag-and-drop file upload (.csv and .json)" to "Drag-and-drop file upload (.csv, .json, and .jsonl)" + - Update any help text or tooltips in the upload modal to mention JSONL support + - No changes needed to actual file handling logic (it already accepts any file type the backend supports) + +### Step 10: Integration Testing with Real JSONL Files +- Manually test the feature end-to-end: + - Start the server and client + - Upload `test_events.jsonl` via the UI + - Verify table is created successfully + - Query the table to verify data integrity + - Upload `test_nested.jsonl` via the UI + - Verify nested structures are properly flattened + - Verify array elements are accessible with index notation + - Test with malformed JSONL to verify error handling + +### Step 11: Run All Validation Commands +- Execute all validation commands listed below to ensure: + - All tests pass with zero regressions + - The feature works end-to-end + - No security vulnerabilities introduced + - Code follows existing patterns and conventions + +## Testing Strategy + +### Unit Tests +- **Flattening Logic Tests**: Verify nested objects and arrays are correctly flattened with proper delimiters +- **Schema Discovery Tests**: Verify all fields are discovered across entire JSONL file +- **Converter Tests**: Verify JSONL files are correctly parsed and converted to SQLite tables +- **Error Handling Tests**: Verify invalid JSONL, malformed lines, and empty files are handled gracefully + +### Integration Tests +- **End-to-End Upload Tests**: Upload JSONL files through the API and verify table creation +- **Query Tests**: Verify flattened columns can be queried using natural language +- **Security Tests**: Verify JSONL processing respects SQL injection protections + +### Edge Cases +- **Empty JSONL file**: Should return appropriate error +- **Single line JSONL**: Should create table with one row +- **JSONL with schema evolution**: Fields appearing in later lines should be discovered +- **JSONL with inconsistent nesting**: Different nesting depths across records +- **JSONL with varying array lengths**: Arrays with different lengths should create columns for max length +- **JSONL with special characters in keys**: Should be sanitized properly +- **JSONL with null values**: Should be handled as None in database +- **Very deeply nested structures**: Should flatten correctly without errors +- **Large JSONL files**: Should process efficiently line-by-line +- **Mixed data types in same field**: Should handle type coercion appropriately +- **Malformed JSON lines**: Should report which line failed and continue or fail gracefully + +## Acceptance Criteria +- [ ] Users can upload `.jsonl` files through the UI +- [ ] JSONL files are processed line-by-line efficiently +- [ ] All fields across all records are discovered (schema evolution support) +- [ ] Nested objects are flattened using `__` delimiter (e.g., `user__name`) +- [ ] Nested arrays are flattened with index notation (e.g., `tags_0`, `tags_1`) +- [ ] Delimiter constants are stored in `app/server/core/constants.py` and are easily updatable +- [ ] One JSONL file creates one table, just like CSV and JSON +- [ ] Upload response includes table name, schema, row count, and sample data +- [ ] UI clearly indicates JSONL is a supported format +- [ ] Comprehensive unit tests cover all scenarios including edge cases +- [ ] Test JSONL files exist in `tests/assets/` directory +- [ ] All existing tests continue to pass (zero regressions) +- [ ] Security validation is applied to table/column names +- [ ] Error messages are clear and helpful for invalid JSONL files +- [ ] Documentation in README.md reflects JSONL support + +## Validation Commands +Execute every command to validate the feature works correctly with zero regressions. + +- `cd app/server && uv run pytest tests/core/test_file_processor.py -v` - Run file processor tests specifically to validate JSONL converter +- `cd app/server && uv run pytest -v` - Run all server tests to validate zero regressions +- `cd app/server && uv run pytest tests/test_sql_injection.py -v` - Verify security tests still pass +- `cd app/server && uv run python -c "from core.constants import NESTED_FIELD_DELIMITER, ARRAY_INDEX_DELIMITER; print(f'Delimiters: {NESTED_FIELD_DELIMITER}, {ARRAY_INDEX_DELIMITER}')"` - Verify constants are accessible +- `./scripts/start.sh` - Start both server and client, then manually test JSONL upload through UI + +## Notes + +### Implementation Considerations +- Use Python's standard library `json` module (no new dependencies) +- Follow existing patterns from `convert_csv_to_sqlite()` and `convert_json_to_sqlite()` +- Reuse existing security validation from `sql_security.py` +- Use pandas DataFrame for consistency with existing converters + +### Performance Considerations +- JSONL files are processed line-by-line, so memory usage is bounded +- Two-pass approach (schema discovery + data loading) may seem inefficient but is necessary for schema evolution +- For very large files, consider adding a progress indicator in future iterations + +### Future Enhancements (Out of Scope) +- Configurable max array index depth (currently unbounded) +- Option to skip schema discovery and use only first record's schema +- Streaming upload for very large JSONL files +- Custom delimiter configuration per upload via UI +- Automatic type inference and constraints for columns + +### Schema Evolution Example +If a JSONL file contains: +```jsonl +{"id": 1, "name": "Alice"} +{"id": 2, "name": "Bob", "age": 30} +``` +The resulting table should have columns: `id`, `name`, `age` (with None for Alice's age). + +### Nested Structure Example +If a JSONL file contains: +```jsonl +{"user": {"name": "Alice", "address": {"city": "NYC"}}, "tags": ["a", "b"]} +``` +The resulting columns should be: `user__name`, `user__address__city`, `tags_0`, `tags_1`.