diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..e26f8e5
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,69 @@
+# Embedding & Vector Search System
+
+AI-powered embedding and vector search system that processes images, PDFs, and text to generate embeddings, stores them in vector databases, and provides semantic search via a Flask API and React frontend. Includes GPT-4 powered document analysis for resume/job description matching and M&A document processing.
+
+## Architecture
+
+```
+                         ┌──────────────────────┐
+                         │    React Frontend     │
+                         │  (Mantine + Vite)     │
+                         └──────┬───────┬────────┘
+                                │       │
+                     REST API   │       │  GraphQL
+                     (Axios)    │       │  (Apollo Client)
+                                │       │
+                 ┌──────────────▼─┐   ┌─▼──────────────┐
+                 │  Flask API     │   │    Weaviate     │
+                 │  (clip_app.py) │   │  Vector DB      │
+                 │  Port 5000     │   │  Port 8080      │
+                 └──────┬─────────┘   └────────────────┘
+                        │
+              ┌─────────▼─────────┐
+              │   CLIP Model      │
+              │   (ViT-L/14)      │
+              └───────────────────┘
+
+  ┌─────────────────┐  ┌──────────────────┐  ┌──────────────────┐
+  │  embedding.py   │  │  extract_pdf.py  │  │   process.py     │
+  │  CLIP/MediaPipe │  │  PyMuPDF         │  │   GPT-4 Analysis │
+  │  → SQLite       │  │  Text + Images   │  │   Resume/JD/M&A  │
+  └─────────────────┘  └──────────────────┘  └──────────────────┘
+
+  ┌─────────────────┐  ┌──────────────────┐
+  │  clickhouse.py  │  │  use_weaviate.py │
+  │  Vector Storage │  │  Vector Storage  │
+  └─────────────────┘  └──────────────────┘
+```
+
+## Data Flow
+
+1. **Ingestion** -- Images and PDFs are processed by `embedding.py` or `extract_pdf.py`
+2. **Embedding** -- CLIP (ViT-L/14) or MediaPipe generates vector embeddings
+3. **Storage** -- Embeddings are stored in SQLite, ClickHouse, or Weaviate
+4. **Search** -- The React frontend sends text queries to the Flask API, which generates a CLIP text embedding and queries Weaviate for nearest neighbors
+5. **Display** -- Results are shown as image cards with similarity distances, album views, and "find similar" functionality
+
+## Components
+
+| Component | Description | Doc |
+|-----------|-------------|-----|
+| Flask API | REST API for embedding text and serving images | [backend-api.md](backend-api.md) |
+| Embedding Generation | CLIP and MediaPipe embedding CLI | [embedding.md](embedding.md) |
+| PDF Extraction | Text and image extraction from PDFs | [pdf-extraction.md](pdf-extraction.md) |
+| Document Processing | GPT-4 document analysis (resume/JD matching) | [document-processing.md](document-processing.md) |
+| Pose Detection | MediaPipe pose landmark detection | [pose-detection.md](pose-detection.md) |
+| Vector Databases | ClickHouse, Weaviate, and SQLite integrations | [vector-databases.md](vector-databases.md) |
+| React Frontend | Search UI with image gallery | [frontend.md](frontend.md) |
+| Playground | Experimental multi-pass GPT processing | [playground.md](playground.md) |
+| Setup | Installation and configuration | [setup.md](setup.md) |
+
+## Tech Stack
+
+**Backend:** Python, Flask, PyTorch, CLIP (ViT-L/14), MediaPipe, PyMuPDF, OpenAI API, Pydantic, SQLAlchemy, pandas
+
+**Frontend:** React 18, TypeScript, Vite, Mantine UI, Apollo Client, Axios, weaviate-ts-client
+
+**Databases:** Weaviate (vector search + GraphQL), ClickHouse (vector storage), SQLite (local embeddings), PostgreSQL (playground)
+
+**Infrastructure:** Docker (multi-stage Nginx build for frontend), CUDA auto-detection for GPU acceleration
diff --git a/docs/backend-api.md b/docs/backend-api.md
new file mode 100644
index 0000000..a87ad27
--- /dev/null
+++ b/docs/backend-api.md
@@ -0,0 +1,98 @@
+# Backend API (`clip_app.py`)
+
+Flask REST API that provides CLIP text embedding and image serving endpoints. Runs on port 5000 with CORS enabled.
+
+## Startup
+
+Loads the CLIP ViT-L/14 model on startup. Automatically uses CUDA if available, otherwise falls back to CPU.
+
+```bash
+python backend/clip_app.py
+```
+
+The server runs on `0.0.0.0:5000` in debug mode.
+
+## Endpoints
+
+### `POST /embed`
+
+Generates a CLIP text embedding.
+
+**Request body:**
+```json
+{
+  "text": "a sleepy ridgeback dog"
+}
+```
+
+**Response:**
+```json
+{
+  "embedding": [0.123, -0.456, ...]
+}
+```
+
+The embedding is a 768-dimensional float array (CLIP ViT-L/14 output).
+
+**Errors:**
+- `400` -- No `text` field provided in request body
+
+### `GET /image`
+
+Returns a full-size image file.
+
+**Query parameters:**
+- `fileName` (required) -- Image filename. The date folder is extracted from the first 8 characters (YYYYMMDD format).
+
+**Response:** JPEG image file
+
+**Path resolution:** `{MOUNT}/{YYYYMMDD}/{fileName}`
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found at resolved path
+
+### `GET /thumbnail`
+
+Returns a resized thumbnail (max 600x600 pixels).
+
+**Query parameters:**
+- `fileName` (required) -- Same format as `/image`
+
+**Response:** JPEG image (resized to fit within 600x600 while maintaining aspect ratio)
+
+Thumbnails are generated on-the-fly using PIL. Not cached.
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found
+
+### `GET /list_files_by_date`
+
+Lists all files in a date-based directory.
+
+**Query parameters:**
+- `fileName` (required) -- Any string with at least 8 characters. The first 8 characters are used as the date folder name (YYYYMMDD).
+
+**Response:**
+```json
+{
+  "files": ["20230101_001.jpg", "20230101_002.jpg"]
+}
+```
+
+**Errors:**
+- `400` -- Invalid or missing `fileName` (less than 8 characters)
+- `404` -- Directory not found
+
+## Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `MOUNT` | Base path for image files | `""` (empty string) |
+
+## Integration
+
+The frontend (`EmbeddingSearch.tsx`) uses this API in two ways:
+1. `POST /embed` to convert search text into a vector, then queries Weaviate directly with the vector
+2. `GET /image`, `GET /thumbnail`, and `GET /list_files_by_date` to display results
diff --git a/docs/document-processing.md b/docs/document-processing.md
new file mode 100644
index 0000000..74fd1e8
--- /dev/null
+++ b/docs/document-processing.md
@@ -0,0 +1,74 @@
+# Document Processing (`process.py`)
+
+GPT-4 powered document analysis tool. Supports two modes: single-file resume/job description matching, and batch directory processing with concurrent API calls.
+
+## Usage
+
+```bash
+# Single file: resume against a job description
+python backend/process.py --txtfile resume.txt --prompt resume_prompt.txt --job_description posting.txt
+
+# Batch: process all .txt files in a directory
+python backend/process.py --dir ../outfolder --prompt resume_prompt.txt --job_description posting.txt
+```
+
+## CLI Options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--txtfile` | -- | Path to a single text file (resume) |
+| `--dir` | `../outfolder` | Directory of text files to process |
+| `--prompt` | `resume_prompt.txt` | Prompt template file |
+| `--job_description` | `posting.txt` | Job description text file |
+| `--csv` | `results.csv` | Output CSV file path |
+
+Either `--txtfile` or `--dir` must be provided.
+
+## Pipeline
+
+### Single File Mode (`--txtfile`)
+
+1. Reads the resume text file, prompt template, and job description
+2. Substitutes `[resume_text]` and `[job_description]` placeholders in the prompt
+3. Sends the assembled prompt to GPT-4 (`gpt-4-1106-preview`) with JSON response format
+4. Saves the response as `{txtfile}.json`
+
+### Batch Directory Mode (`--dir`)
+
+1. Scans the directory for all `.txt` files
+2. Processes each file using `ThreadPoolExecutor` with 20 workers
+3. Each file goes through the single-file pipeline concurrently
+
+### Multi-Page Mode (internal)
+
+The `process_file_and_prompt_multi_pages` function:
+1. Reads a text file and splits it by `<PAGE ` markers
+2. Prepares a prompt for each page using a separate template (with `resume_jd.txt`)
+3. Sends all pages concurrently via `ThreadPoolExecutor` with 8 workers
+4. Collects responses into a DataFrame
+
+## Prompt Templates
+
+Prompt files use simple placeholder substitution:
+- `[resume_text]` -- Replaced with the resume/document text
+- `[job_description]` -- Replaced with the job description text
+
+## Concurrent Processing
+
+- Single file mode: sequential (one API call)
+- Batch directory mode: 20 concurrent workers
+- Multi-page mode: 8 concurrent workers
+
+All API calls use `concurrent.futures.ThreadPoolExecutor`.
+
+## API Configuration
+
+- Model: `gpt-4-1106-preview`
+- Response format: JSON object
+- Errors are caught and returned as `{"error": "..."}` dictionaries
+
+## Environment Variables
+
+| Variable | Description |
+|----------|-------------|
+| `OPENAI_API_KEY` | OpenAI API key (required, loaded from `.env`) |
diff --git a/docs/embedding.md b/docs/embedding.md
new file mode 100644
index 0000000..be49bab
--- /dev/null
+++ b/docs/embedding.md
@@ -0,0 +1,80 @@
+# Embedding Generation (`embedding.py`)
+
+CLI tool to generate embeddings for images or text using CLIP (ViT-L/14) or MediaPipe, with SQLite storage and cosine similarity-based deduplication.
+
+## Usage
+
+```bash
+# Text embedding (CLIP only)
+python backend/embedding.py --text "a sleepy ridgeback dog" --method clip
+
+# Image embedding with CLIP
+python backend/embedding.py --image "./images/*.jpg" --method clip
+
+# Image embedding with MediaPipe
+python backend/embedding.py --image "path/to/image.jpg" --method mediapipe --mediapipe_model_path "mobilenet_v3_large.tflite"
+```
+
+## CLI Options
+
+| Flag | Required | Default | Description |
+|------|----------|---------|-------------|
+| `--text` | mutually exclusive with `--image` | -- | Text string to embed |
+| `--image` | mutually exclusive with `--text` | -- | Image path or glob pattern |
+| `--method` | yes | -- | `clip` or `mediapipe` |
+| `--database` | no | `embedding.sqlite` | SQLite database file path |
+| `--cosine_threshold` | no | `1.00` | Threshold for cosine similarity dedup |
+| `--mediapipe_model_path` | no | `mobilenet_v3_large.tflite` | Path to MediaPipe TFLite model |
+
+`--text` and `--image` are mutually exclusive (one is required).
+
+## Embedding Methods
+
+### CLIP (ViT-L/14)
+
+- Produces 768-dimensional float vectors
+- ~10ms per image with CUDA, ~3s without CUDA
+- Supports both text and image embedding
+- Auto-detects CUDA availability
+
+### MediaPipe
+
+- Uses the `ImageEmbedder` task with quantization enabled
+- Requires a TFLite model file (e.g., `mobilenet_v3_large.tflite`)
+- ~50ms per image
+- Image embedding only (no text support)
+
+## SQLite Storage Schema
+
+Database table: `embeddings`
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | INTEGER PRIMARY KEY AUTOINCREMENT | Auto-incrementing ID |
+| `filename` | TEXT | Base filename of the image |
+| `seconds` | REAL | Time taken to compute embedding |
+| `size` | INTEGER | File size in bytes |
+| `md5` | TEXT | MD5 hash of the file |
+| `method` | TEXT | `clip` or `mediapipe` |
+| `embedding` | TEXT | String representation of the embedding vector |
+
+Text embeddings are not stored in the database.
+
+## Deduplication Logic
+
+When processing images sequentially, the tool compares consecutive images:
+
+1. **File size check** -- If the file size differs by more than 2% from the previous image, skip similarity check and insert directly
+2. **Cosine similarity** -- If file sizes are similar, compute cosine similarity between the current and previous embedding vectors
+3. **Threshold** -- If similarity exceeds `--cosine_threshold`, the image is moved to a `similar/` subdirectory instead of being stored
+
+This approach detects near-duplicate images efficiently by first filtering on file size as a cheap pre-check.
+
+## Image Path Resolution
+
+The `--image` argument accepts:
+- A single file path
+- A glob pattern (e.g., `./images/*.jpg`)
+- Paths with `~` (user home directory expansion)
+
+Files are sorted alphabetically before processing.
diff --git a/docs/frontend.md b/docs/frontend.md
new file mode 100644
index 0000000..db9a6f2
--- /dev/null
+++ b/docs/frontend.md
@@ -0,0 +1,133 @@
+# React Frontend
+
+React 18 + TypeScript single-page application for semantic image search. Uses Mantine UI for components, queries Weaviate via GraphQL, and communicates with the Flask backend for embeddings and image serving.
+
+## Component Hierarchy
+
+```
+main.tsx
+├── ApolloProvider (GraphQL client for Weaviate)
+└── MantineProvider (UI theme)
+    └── App.tsx
+        └── EmbeddingSearch.tsx (main search interface)
+            ├── ImageCard.tsx (individual result card)
+            ├── ImageAlbum.tsx (date-based image gallery, in modal)
+            └── FileEmbeddingCount.tsx (aggregate count display)
+```
+
+## Components
+
+### `EmbeddingSearch.tsx`
+
+Main search component. Handles the full search flow:
+
+1. User enters text in a `TextInput` field (submits on Enter or button click)
+2. Sends `POST /embed` to the Flask API to get a CLIP text embedding
+3. Queries Weaviate with `nearVector` to find similar images
+4. Displays results as `ImageCard` components with pagination (5 results per page)
+
+Also supports "Find Similar" -- takes a Weaviate object ID and queries with `nearObject` to find visually similar images.
+
+### `ImageCard.tsx`
+
+Displays a single search result with:
+- Thumbnail image (links to full-size image in new tab)
+- Filename and similarity distance
+- "Find Similar Images" button (queries Weaviate by object ID)
+- "Show Full Size" / "Show Thumbnail" toggle
+- "View Images from {date}" button (opens `ImageAlbum` in a modal)
+
+Date is extracted from the first 8 characters of the filename (YYYYMMDD format), with a "days ago" display.
+
+### `ImageAlbum.tsx`
+
+Displays all images from a specific date. Fetches the file list via `GET /list_files_by_date` from the Flask API and renders each as an `ImageCard`.
+
+### `FileEmbeddingCount.tsx`
+
+Uses Apollo Client to query the Weaviate aggregate count of `FileEmbedding` objects. Polls every 5 seconds.
+
+```graphql
+{
+  Aggregate {
+    FileEmbedding {
+      meta {
+        count
+      }
+    }
+  }
+}
+```
+
+## Client Configuration
+
+### Weaviate Client (`weaviateClient.ts`)
+
+Direct Weaviate client using `weaviate-ts-client`:
+```typescript
+weaviate.client({ scheme: "http", host: `${backendHost}:8080` })
+```
+
+Used by `EmbeddingSearch` for vector searches (nearVector, nearObject).
+
+### Apollo Client (`apolloClient.ts`)
+
+GraphQL client pointing to Weaviate's GraphQL endpoint:
+```
+http://{backendHost}:8080/v1/graphql
+```
+
+Used by `FileEmbeddingCount` for aggregate queries.
+
+### Backend Host (`env.ts`)
+
+Backend host is configured in `src/env.ts`:
+```typescript
+const backendHost = "localhost";
+```
+
+The Flask API is accessed at `http://{backendHost}:5000` and Weaviate at `http://{backendHost}:8080`.
+
+## Environment Variables
+
+The frontend `.env.example` contains:
+```
+BACKEND=0.0.0.0
+```
+
+Currently, the backend host is hardcoded in `env.ts` as `"localhost"`.
+
+## Build & Development
+
+```bash
+cd frontend
+npm install
+npm run dev        # Vite dev server
+npm run build      # tsc && vite build
+npm run lint       # ESLint (strict: max-warnings 0)
+npm run preview    # Preview production build
+```
+
+## Docker Deployment
+
+Multi-stage Docker build:
+
+1. **Build stage** -- Node 20, installs dependencies, runs `npm run build`
+2. **Production stage** -- Nginx Alpine, serves the built static files on port 80
+
+```bash
+docker build -t embedding-frontend .
+docker run -p 8080:80 embedding-frontend
+```
+
+## Dependencies
+
+| Package | Purpose |
+|---------|---------|
+| `react` / `react-dom` | UI framework |
+| `@mantine/core` / `@mantine/hooks` | Component library |
+| `@apollo/client` / `graphql` | GraphQL client for Weaviate |
+| `weaviate-ts-client` | Direct Weaviate API client |
+| `axios` | HTTP client for Flask API |
+| `vite` / `@vitejs/plugin-react-swc` | Build tooling |
+| `typescript` | Type checking |
diff --git a/docs/pdf-extraction.md b/docs/pdf-extraction.md
new file mode 100644
index 0000000..c688665
--- /dev/null
+++ b/docs/pdf-extraction.md
@@ -0,0 +1,61 @@
+# PDF Extraction (`extract_pdf.py`)
+
+CLI tool to extract text and images from PDF files using PyMuPDF (fitz). Supports optional token injection for downstream GPT processing.
+
+## Usage
+
+```bash
+# Extract from a single PDF
+python backend/extract_pdf.py --pdf file.pdf
+
+# Extract from all PDFs in a folder
+python backend/extract_pdf.py --folder /path/to/pdfs --outfolder output
+
+# Extract with token injection
+python backend/extract_pdf.py --pdf file.pdf --inject_tokens
+```
+
+## CLI Options
+
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--pdf` | Path to a single PDF file | -- |
+| `--folder` | Path to folder containing PDF files | -- |
+| `--outfolder` | Output folder path | `outfolder` |
+| `--inject_tokens` | Enable segment tagging in extracted text | disabled |
+
+Either `--pdf` or `--folder` should be provided. If neither is given, a hardcoded default path is used.
+
+## Output
+
+For each PDF, the tool produces:
+- **Text file** -- `{outfolder}/{filename}.txt` containing all extracted text
+- **Images** -- `{outfolder}/{page_num}-{image_index}.jpg` for each embedded image
+
+## Text Extraction Modes
+
+### Standard Mode (default)
+
+Each page's text is written to the output file with a `<PAGE {page_num}/>` marker separating pages. Newlines are preserved as-is.
+
+### Token Injection Mode (`--inject_tokens`)
+
+Text is processed with additional normalization and tagging:
+
+1. **Whitespace normalization** -- Newlines replaced with spaces, multiple spaces collapsed
+2. **Segment extraction** -- Text is split at Roman numeral markers `(I)`, `(II)`, etc. and alphabetical markers `(a)`, `(b)`, etc.
+3. **Sentence normalization** -- Segments are adjusted to be between 120-360 characters:
+   - Short segments (< 120 chars) are merged with the next segment
+   - Long segments (> 360 chars) are split at the last comma or period before the 360-char limit
+4. **Tag injection** -- Each normalized segment is tagged as `<P{page}S{index}/>{text}`
+
+### Constants
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `MIN_CHARS` | 120 | Minimum characters for a text segment |
+| `MAX_CHARS` | 360 | Maximum characters for a text segment |
+
+## Image Extraction
+
+Images are extracted from each page using PyMuPDF's `get_images(full=True)`. Each image is saved as a JPEG file named `{page_number}-{image_index}.jpg`.
diff --git a/docs/playground.md b/docs/playground.md
new file mode 100644
index 0000000..26452b0
--- /dev/null
+++ b/docs/playground.md
@@ -0,0 +1,107 @@
+# Playground (Experimental Scripts)
+
+The `playground/` directory contains experimental scripts for multi-pass GPT document processing with PostgreSQL and Weaviate storage. These scripts implement a more advanced pipeline than the main `backend/process.py`, with multi-step analysis, PostgreSQL persistence, and structured data models.
+
+## `playground/process.py`
+
+Multi-pass GPT-4 document processing for financial/legal document analysis (e.g., credit agreements, lending documents).
+
+### Architecture
+
+Two-step pipeline with concurrent processing:
+
+**Step 1:** Process document pages through GPT-4 to extract structured data
+- Reads text files split by `<PAGE ` markers
+- Sends each page to GPT-4 (`gpt-4-1106-preview`) with JSON response format
+- Uses `ThreadPoolExecutor` with 8 workers
+- Stores results in Weaviate as `ChatGPTPageResponse` objects with embeddings
+
+**Step 2:** Aggregate and analyze per-topic
+- Reads GPT responses from Weaviate
+- Groups data by topic columns (lending terms like tranche, quantum, covenants)
+- Generates follow-up prompts per topic
+- Sends via `ThreadPoolExecutor` with 8 workers
+- Exports results to CSV and Excel
+
+### Data Model
+
+```python
+class LendingTopics(BaseModel):
+    tranche, quantum, financial_maintenance_covenant,
+    addbacks_cap, MFN_threshold, MFN_exceptions,
+    portability, lender_counsel, borrower_counsel,
+    borrower, guarantor, admin_agent, collat_agent,
+    effective_date  # all Optional[str]
+
+class ChatGPTResponse(BaseModel):
+    original_page_text: str
+    page_number: int
+    summary: str
+    topics: LendingTopics
+```
+
+### CLI Options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--step` | `1` | Pipeline step to run (1 or 2) |
+| `--txtfile` | `DAVEBUSTER.txt` | Input text file |
+| `--prompt` | `credit.prompt.txt` | Prompt template file |
+| `--csv` | `DAVEBUSTER.csv` | Output CSV file |
+| `--enable_postgres` | disabled | Enable PostgreSQL storage |
+
+### Weaviate Schema
+
+Creates a `ChatGPTPageResponse` class with properties for all lending topic fields, plus `original_page_text`, `page_number`, `summary`, and `embedding`.
+
+### Environment Variables
+
+| Variable | Description |
+|----------|-------------|
+| `OPENAI_API_KEY` | OpenAI API key |
+| `WEAVIATE_CREDENTIALS` | JSON string with `URL`, `API_KEY`, `HTTP_SCHEME` |
+| `POSTGRES_CREDENTIALS` | JSON string with `NAME`, `USER`, `PASSWORD`, `HOST`, `PORT` |
+
+## `playground/postgres.py`
+
+PostgreSQL database management for the document processing pipeline.
+
+### Schema Initialization
+
+Creates the `experiments` schema with three tables:
+
+**`experiments.documents`**
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | SERIAL PRIMARY KEY | Document ID |
+| `filename` | VARCHAR | Source filename |
+
+**`experiments.documents_tags`**
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | SERIAL PRIMARY KEY | Tag ID |
+| `document_id` | INTEGER (FK) | Reference to documents table |
+| `line` | TEXT | Text content |
+| `tag` | TEXT | Extracted tag (e.g., `<P0S1/>`) |
+
+**`experiments.in_out_files`**
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | SERIAL PRIMARY KEY | File ID |
+| `document_id` | INTEGER (FK) | Reference to documents table |
+| `file_type` | VARCHAR(50) | `in` or `out` |
+| `file_content` | TEXT | Prompt input or GPT response |
+
+### Functions
+
+- `init_schema()` -- Creates schema and tables
+- `process_and_insert_txt_to_db(filename)` -- Reads a tagged text file, extracts tags, stores in `documents_tags`
+- `insert_in_out_table(prompt_path, document_id)` -- Imports `.in.txt` and `.out.json` files
+- `insert_metrics(id)` -- Imports `metrics.csv` into the database
+- `insert_pass1_results(id)` / `insert_pass2_results(id)` -- Imports processing results
+
+## `playground/df_psql.py`
+
+Shared PostgreSQL utilities. Provides:
+- SQLAlchemy engine setup from `POSTGRES_DB_CREDENTIALS` environment variable
+- Helper functions for reading/writing DataFrames to PostgreSQL
diff --git a/docs/pose-detection.md b/docs/pose-detection.md
new file mode 100644
index 0000000..1a57492
--- /dev/null
+++ b/docs/pose-detection.md
@@ -0,0 +1,68 @@
+# Pose Detection (`detect_pose_mediapipe.py`)
+
+MediaPipe-based pose landmark detection tool. Detects up to 10 poses per image, generates annotated images, and saves landmark data as JSON.
+
+## Usage
+
+```bash
+# Single image
+python backend/detect_pose_mediapipe.py --image path/to/image.jpg
+
+# Batch processing with glob pattern
+python backend/detect_pose_mediapipe.py --pattern "./*.jpg"
+```
+
+## CLI Options
+
+| Flag | Description |
+|------|-------------|
+| `--image` | Path to a single image file |
+| `--pattern` | Glob pattern for batch processing (e.g., `./*.jpg`) |
+
+One of `--image` or `--pattern` is required.
+
+## Model Configuration
+
+- Model: `pose_landmarker_lite.task` (must be present in the working directory)
+- Running mode: `IMAGE` (single image, not video/live stream)
+- Segmentation masks: enabled
+- Maximum poses per image: 10
+
+## Output
+
+For each image where landmarks are detected:
+
+### Annotated Image
+
+Saved as `{original_name}.pose.jpg`. The original image is overlaid with pose landmark connections drawn using MediaPipe's default pose landmark style.
+
+### JSON Landmarks
+
+Saved as `{original_name}.json`. Contains all detected landmark coordinates:
+
+```json
+{
+  "landmarks": [
+    {
+      "x": 0.543,
+      "y": 0.312,
+      "z": -0.05,
+      "visibility": 0.98
+    }
+  ]
+}
+```
+
+Each pose has 33 landmarks (MediaPipe Pose standard). Multiple poses are flattened into a single `landmarks` array.
+
+- `x`, `y` -- Normalized coordinates (0.0 to 1.0 relative to image dimensions)
+- `z` -- Depth estimate relative to the hip midpoint
+- `visibility` -- Confidence that the landmark is visible (0.0 to 1.0)
+
+## Batch Processing
+
+When using `--pattern`, files are:
+1. Matched with `glob.glob`
+2. Filtered to exclude files containing `.mask.` or `.pose.` in the filename (avoids reprocessing outputs)
+3. Sorted alphabetically
+4. Processed sequentially
diff --git a/docs/setup.md b/docs/setup.md
new file mode 100644
index 0000000..99f50d1
--- /dev/null
+++ b/docs/setup.md
@@ -0,0 +1,135 @@
+# Setup & Configuration
+
+## Python Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+Core dependencies from `requirements.txt`:
+- `pandas==2.1.3`
+- `sqlalchemy==2.0.27`
+- `python-dotenv==1.0.0`
+- `openai==1.9.0`
+- `psycopg2-binary==2.9.9`
+- `pydantic==2.5.3`
+- `more-itertools==10.2.0`
+- `PyMuPDF==1.23.26`
+
+Additional dependencies not listed in `requirements.txt` (install separately):
+- `torch` + `clip` (for CLIP embeddings)
+- `mediapipe` (for MediaPipe embeddings and pose detection)
+- `flask` + `flask-cors` (for the API server)
+- `clickhouse-driver` (for ClickHouse integration)
+- `weaviate-client` (for Weaviate integration)
+- `numpy`, `Pillow`, `opencv-python` (for image processing)
+- `scikit-learn`, `plotly` (for visualization)
+- `transformers` (for classify.py)
+
+## Frontend Dependencies
+
+```bash
+cd frontend
+npm install
+```
+
+See [frontend.md](frontend.md) for the full dependency list.
+
+## Environment Variables
+
+### Backend (`backend/.env`)
+
+Copy `backend/.env.example` to `backend/.env` and configure:
+
+| Variable | Example | Used By |
+|----------|---------|---------|
+| `CH_HOST` | `localhost` | `clickhouse.py` |
+| `CH_PORT` | `9000` | `clickhouse.py` |
+| `CH_USER` | `default` | `clickhouse.py` |
+| `CH_PASSWORD` | `foobar` | `clickhouse.py` |
+| `CH_DATABASE` | `db` | `clickhouse.py` |
+| `MOUNT` | `/mnt/data` | `clip_app.py` |
+| `OPENAI_API_KEY` | `sk-...` | `process.py`, playground scripts |
+
+### Frontend (`frontend/.env`)
+
+Copy `frontend/.env.example` to `frontend/.env`:
+
+| Variable | Example | Description |
+|----------|---------|-------------|
+| `BACKEND` | `0.0.0.0` | Backend host address |
+
+Note: The frontend currently hardcodes `"localhost"` in `src/env.ts` rather than reading from the environment variable.
+
+### Playground
+
+The playground scripts require additional environment variables:
+
+| Variable | Format | Used By |
+|----------|--------|---------|
+| `POSTGRES_DB_CREDENTIALS` | PostgreSQL connection string | `playground/process.py`, `playground/df_psql.py` |
+| `MODEL_NAME` | e.g., `gpt-4-turbo-preview` | `playground/process.py` |
+| `WEAVIATE_CREDENTIALS` | JSON: `{"URL":"...","API_KEY":"...","HTTP_SCHEME":"http"}` | `playground/postgres.py` |
+| `POSTGRES_CREDENTIALS` | JSON: `{"NAME":"...","USER":"...","PASSWORD":"...","HOST":"...","PORT":"5432"}` | `playground/postgres.py` |
+
+## Database Setup
+
+### Weaviate
+
+Weaviate must be running and accessible. Default expected at `localhost:8080`.
+
+The `FileEmbedding` class is created automatically by the Weaviate client when storing embeddings. No manual schema setup is required.
+
+### ClickHouse
+
+Create the vector storage table:
+
+```sql
+CREATE TABLE vector_storage (
+    embedding_vector Array(UInt8),
+    md5_hash FixedString(32),
+    file_name String
+) ENGINE = MergeTree()
+ORDER BY md5_hash;
+```
+
+### PostgreSQL (Playground)
+
+Initialize the schema using the playground utility:
+
+```python
+from playground.postgres import init_schema
+init_schema()  # Creates experiments schema and tables
+```
+
+This creates:
+- `experiments.documents` -- Document metadata
+- `experiments.documents_tags` -- Tagged text segments
+- `experiments.in_out_files` -- Prompt inputs/outputs
+
+The `playground/process.py` pipeline also creates additional tables:
+- `experiments.prompts` -- Stored prompt templates
+- `experiments.promptios` -- Prompt I/O logs with metrics
+- `experiments.pass1_results` -- Step 1 processing results
+- `experiments.pass2_results` -- Step 2 processing results
+
+### SQLite
+
+Created automatically by `embedding.py`. Default file: `embedding.sqlite`. No manual setup needed.
+
+## Docker Deployment (Frontend)
+
+```bash
+cd frontend
+docker build -t embedding-frontend .
+docker run -p 8080:80 embedding-frontend
+```
+
+This builds the React app and serves it via Nginx on port 80 (mapped to host port 8080).
+
+## Running the System
+
+1. Start Weaviate (port 8080)
+2. Start the Flask API: `python backend/clip_app.py` (port 5000)
+3. Start the frontend: `cd frontend && npm run dev`
+4. Open the frontend in a browser and enter text queries to search for similar images
diff --git a/docs/vector-databases.md b/docs/vector-databases.md
new file mode 100644
index 0000000..cf12344
--- /dev/null
+++ b/docs/vector-databases.md
@@ -0,0 +1,90 @@
+# Vector Database Integrations
+
+The system supports three vector storage backends: ClickHouse, Weaviate, and SQLite.
+
+## ClickHouse (`clickhouse.py`)
+
+Stores embedding vectors with file metadata using the `clickhouse-driver` Python client.
+
+### Schema
+
+```sql
+CREATE TABLE vector_storage (
+    embedding_vector Array(UInt8),
+    md5_hash FixedString(32),
+    file_name String
+) ENGINE = MergeTree()
+ORDER BY md5_hash;
+```
+
+### Operations
+
+- `insert_into_db(client, embedding_vector, file_name)` -- Inserts an embedding vector, auto-computes MD5 hash from the file
+- `get_md5_hash(file_name)` -- Generates MD5 hash for a file
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `CH_HOST` | `localhost` | ClickHouse server host |
+| `CH_PORT` | `9000` | ClickHouse native protocol port |
+| `CH_USER` | `default` | Database user |
+| `CH_PASSWORD` | `""` | Database password |
+| `CH_DATABASE` | `default` | Database name |
+
+### Verification
+
+```bash
+clickhouse-client --user <user> --password <pass> --database <db>
+SELECT * FROM vector_storage;
+```
+
+## Weaviate (`use_weaviate.py`)
+
+Full-featured vector database integration using the `WeaviateEmbeddingStore` class.
+
+### Class: `WeaviateEmbeddingStore`
+
+Manages a `FileEmbedding` class in Weaviate.
+
+**Schema properties:**
+- `fileName` (string) -- Name of the source file
+- `md5Hash` (string) -- MD5 hash of the file
+- `fileSize` (int) -- File size in bytes
+- `modelName` (string) -- Name of the embedding model used
+
+**Methods:**
+
+| Method | Description |
+|--------|-------------|
+| `store_embedding(file_name, md5_hash, file_size, model_name, embedding)` | Store an embedding with metadata |
+| `get_nearest_neighbors(embedding, max_results=5)` | Find nearest neighbors by vector similarity |
+| `check_md5_exists(md5_hash, model_name)` | Check if an MD5/model combination exists |
+| `check_multiple_md5_exists(md5_hashes, model_name)` | Batch existence check |
+| `delete_embedding(md5_hash)` | Delete embeddings by MD5 hash |
+| `delete_all_embeddings()` | Delete the entire `FileEmbedding` class |
+
+### Querying
+
+Weaviate supports both the Python client API and raw GraphQL:
+
+```graphql
+{
+  Get {
+    FileEmbedding {
+      fileName
+      md5Hash
+      fileSize
+      modelName
+    }
+  }
+}
+```
+
+The frontend queries Weaviate directly via GraphQL through the Apollo Client and the `weaviate-ts-client`.
+
+## SQLite (via `embedding.py`)
+
+Local storage for embedding results. See [embedding.md](embedding.md) for the schema and usage details.
+
+The SQLite database (`embedding.sqlite` by default) stores embeddings as text-serialized float arrays alongside file metadata (filename, size, MD5 hash, computation time, method).