diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..e26f8e5
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,69 @@
+# Embedding & Vector Search System
+
+AI-powered embedding and vector search system that processes images, PDFs, and text to generate embeddings, stores them in vector databases, and provides semantic search via a Flask API and React frontend. Includes GPT-4 powered document analysis for resume/job description matching and M&A document processing.
+
+## Architecture
+
+```
+ ┌──────────────────────┐
+ │ React Frontend │
+ │ (Mantine + Vite) │
+ └──────┬───────┬────────┘
+ │ │
+ REST API │ │ GraphQL
+ (Axios) │ │ (Apollo Client)
+ │ │
+ ┌──────────────▼─┐ ┌─▼──────────────┐
+ │ Flask API │ │ Weaviate │
+ │ (clip_app.py) │ │ Vector DB │
+ │ Port 5000 │ │ Port 8080 │
+ └──────┬─────────┘ └────────────────┘
+ │
+ ┌─────────▼─────────┐
+ │ CLIP Model │
+ │ (ViT-L/14) │
+ └───────────────────┘
+
+ ┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
+ │ embedding.py │ │ extract_pdf.py │ │ process.py │
+ │ CLIP/MediaPipe │ │ PyMuPDF │ │ GPT-4 Analysis │
+ │ → SQLite │ │ Text + Images │ │ Resume/JD/M&A │
+ └─────────────────┘ └──────────────────┘ └──────────────────┘
+
+ ┌─────────────────┐ ┌──────────────────┐
+ │ clickhouse.py │ │ use_weaviate.py │
+ │ Vector Storage │ │ Vector Storage │
+ └─────────────────┘ └──────────────────┘
+```
+
+## Data Flow
+
+1. **Ingestion** -- Images and PDFs are processed by `embedding.py` or `extract_pdf.py`
+2. **Embedding** -- CLIP (ViT-L/14) or MediaPipe generates vector embeddings
+3. **Storage** -- Embeddings are stored in SQLite, ClickHouse, or Weaviate
+4. **Search** -- The React frontend sends text queries to the Flask API, which generates a CLIP text embedding and queries Weaviate for nearest neighbors
+5. **Display** -- Results are shown as image cards with similarity distances, album views, and "find similar" functionality
+
+## Components
+
+| Component | Description | Doc |
+|-----------|-------------|-----|
+| Flask API | REST API for embedding text and serving images | [backend-api.md](backend-api.md) |
+| Embedding Generation | CLIP and MediaPipe embedding CLI | [embedding.md](embedding.md) |
+| PDF Extraction | Text and image extraction from PDFs | [pdf-extraction.md](pdf-extraction.md) |
+| Document Processing | GPT-4 document analysis (resume/JD matching) | [document-processing.md](document-processing.md) |
+| Pose Detection | MediaPipe pose landmark detection | [pose-detection.md](pose-detection.md) |
+| Vector Databases | ClickHouse, Weaviate, and SQLite integrations | [vector-databases.md](vector-databases.md) |
+| React Frontend | Search UI with image gallery | [frontend.md](frontend.md) |
+| Playground | Experimental multi-pass GPT processing | [playground.md](playground.md) |
+| Setup | Installation and configuration | [setup.md](setup.md) |
+
+## Tech Stack
+
+**Backend:** Python, Flask, PyTorch, CLIP (ViT-L/14), MediaPipe, PyMuPDF, OpenAI API, Pydantic, SQLAlchemy, pandas
+
+**Frontend:** React 18, TypeScript, Vite, Mantine UI, Apollo Client, Axios, weaviate-ts-client
+
+**Databases:** Weaviate (vector search + GraphQL), ClickHouse (vector storage), SQLite (local embeddings), PostgreSQL (playground)
+
+**Infrastructure:** Docker (multi-stage Nginx build for frontend), CUDA auto-detection for GPU acceleration
diff --git a/docs/backend-api.md b/docs/backend-api.md
new file mode 100644
index 0000000..a87ad27
--- /dev/null
+++ b/docs/backend-api.md
@@ -0,0 +1,98 @@
+# Backend API (`clip_app.py`)
+
+Flask REST API that provides CLIP text embedding and image serving endpoints. Runs on port 5000 with CORS enabled.
+
+## Startup
+
+Loads the CLIP ViT-L/14 model on startup. Automatically uses CUDA if available, otherwise falls back to CPU.
+
+```bash
+python backend/clip_app.py
+```
+
+The server runs on `0.0.0.0:5000` in debug mode.
+
+## Endpoints
+
+### `POST /embed`
+
+Generates a CLIP text embedding.
+
+**Request body:**
+```json
+{
+ "text": "a sleepy ridgeback dog"
+}
+```
+
+**Response:**
+```json
+{
+ "embedding": [0.123, -0.456, ...]
+}
+```
+
+The embedding is a 768-dimensional float array (CLIP ViT-L/14 output).
+
+**Errors:**
+- `400` -- No `text` field provided in request body
+
+### `GET /image`
+
+Returns a full-size image file.
+
+**Query parameters:**
+- `fileName` (required) -- Image filename. The date folder is extracted from the first 8 characters (YYYYMMDD format).
+
+**Response:** JPEG image file
+
+**Path resolution:** `{MOUNT}/{YYYYMMDD}/{fileName}`
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found at resolved path
+
+### `GET /thumbnail`
+
+Returns a resized thumbnail (max 600x600 pixels).
+
+**Query parameters:**
+- `fileName` (required) -- Same format as `/image`
+
+**Response:** JPEG image (resized to fit within 600x600 while maintaining aspect ratio)
+
+Thumbnails are generated on-the-fly using PIL. Not cached.
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found
+
+### `GET /list_files_by_date`
+
+Lists all files in a date-based directory.
+
+**Query parameters:**
+- `fileName` (required) -- Any string with at least 8 characters. The first 8 characters are used as the date folder name (YYYYMMDD).
+
+**Response:**
+```json
+{
+ "files": ["20230101_001.jpg", "20230101_002.jpg"]
+}
+```
+
+**Errors:**
+- `400` -- Invalid or missing `fileName` (less than 8 characters)
+- `404` -- Directory not found
+
+## Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `MOUNT` | Base path for image files | `""` (empty string) |
+
+## Integration
+
+The frontend (`EmbeddingSearch.tsx`) uses this API in two ways:
+1. `POST /embed` to convert search text into a vector, then queries Weaviate directly with the vector
+2. `GET /image`, `GET /thumbnail`, and `GET /list_files_by_date` to display results
diff --git a/docs/document-processing.md b/docs/document-processing.md
new file mode 100644
index 0000000..74fd1e8
--- /dev/null
+++ b/docs/document-processing.md
@@ -0,0 +1,74 @@
+# Document Processing (`process.py`)
+
+GPT-4 powered document analysis tool. Supports two modes: single-file resume/job description matching, and batch directory processing with concurrent API calls.
+
+## Usage
+
+```bash
+# Single file: resume against a job description
+python backend/process.py --txtfile resume.txt --prompt resume_prompt.txt --job_description posting.txt
+
+# Batch: process all .txt files in a directory
+python backend/process.py --dir ../outfolder --prompt resume_prompt.txt --job_description posting.txt
+```
+
+## CLI Options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--txtfile` | -- | Path to a single text file (resume) |
+| `--dir` | `../outfolder` | Directory of text files to process |
+| `--prompt` | `resume_prompt.txt` | Prompt template file |
+| `--job_description` | `posting.txt` | Job description text file |
+| `--csv` | `results.csv` | Output CSV file path |
+
+Either `--txtfile` or `--dir` must be provided.
+
+## Pipeline
+
+### Single File Mode (`--txtfile`)
+
+1. Reads the resume text file, prompt template, and job description
+2. Substitutes `[resume_text]` and `[job_description]` placeholders in the prompt
+3. Sends the assembled prompt to GPT-4 (`gpt-4-1106-preview`) with JSON response format
+4. Saves the response as `{txtfile}.json`
+
+### Batch Directory Mode (`--dir`)
+
+1. Scans the directory for all `.txt` files
+2. Processes each file using `ThreadPoolExecutor` with 20 workers
+3. Each file goes through the single-file pipeline concurrently
+
+### Multi-Page Mode (internal)
+
+The `process_file_and_prompt_multi_pages` function:
+1. Reads a text file and splits it by `` marker separating pages. Newlines are preserved as-is.
+
+### Token Injection Mode (`--inject_tokens`)
+
+Text is processed with additional normalization and tagging:
+
+1. **Whitespace normalization** -- Newlines replaced with spaces, multiple spaces collapsed
+2. **Segment extraction** -- Text is split at Roman numeral markers `(I)`, `(II)`, etc. and alphabetical markers `(a)`, `(b)`, etc.
+3. **Sentence normalization** -- Segments are adjusted to be between 120-360 characters:
+ - Short segments (< 120 chars) are merged with the next segment
+ - Long segments (> 360 chars) are split at the last comma or period before the 360-char limit
+4. **Tag injection** -- Each normalized segment is tagged as `
{text}`
+
+### Constants
+
+| Constant | Value | Description |
+|----------|-------|-------------|
+| `MIN_CHARS` | 120 | Minimum characters for a text segment |
+| `MAX_CHARS` | 360 | Maximum characters for a text segment |
+
+## Image Extraction
+
+Images are extracted from each page using PyMuPDF's `get_images(full=True)`. Each image is saved as a JPEG file named `{page_number}-{image_index}.jpg`.
diff --git a/docs/playground.md b/docs/playground.md
new file mode 100644
index 0000000..26452b0
--- /dev/null
+++ b/docs/playground.md
@@ -0,0 +1,107 @@
+# Playground (Experimental Scripts)
+
+The `playground/` directory contains experimental scripts for multi-pass GPT document processing with PostgreSQL and Weaviate storage. These scripts implement a more advanced pipeline than the main `backend/process.py`, with multi-step analysis, PostgreSQL persistence, and structured data models.
+
+## `playground/process.py`
+
+Multi-pass GPT-4 document processing for financial/legal document analysis (e.g., credit agreements, lending documents).
+
+### Architecture
+
+Two-step pipeline with concurrent processing:
+
+**Step 1:** Process document pages through GPT-4 to extract structured data
+- Reads text files split by ``) |
+
+**`experiments.in_out_files`**
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | SERIAL PRIMARY KEY | File ID |
+| `document_id` | INTEGER (FK) | Reference to documents table |
+| `file_type` | VARCHAR(50) | `in` or `out` |
+| `file_content` | TEXT | Prompt input or GPT response |
+
+### Functions
+
+- `init_schema()` -- Creates schema and tables
+- `process_and_insert_txt_to_db(filename)` -- Reads a tagged text file, extracts tags, stores in `documents_tags`
+- `insert_in_out_table(prompt_path, document_id)` -- Imports `.in.txt` and `.out.json` files
+- `insert_metrics(id)` -- Imports `metrics.csv` into the database
+- `insert_pass1_results(id)` / `insert_pass2_results(id)` -- Imports processing results
+
+## `playground/df_psql.py`
+
+Shared PostgreSQL utilities. Provides:
+- SQLAlchemy engine setup from `POSTGRES_DB_CREDENTIALS` environment variable
+- Helper functions for reading/writing DataFrames to PostgreSQL
diff --git a/docs/pose-detection.md b/docs/pose-detection.md
new file mode 100644
index 0000000..1a57492
--- /dev/null
+++ b/docs/pose-detection.md
@@ -0,0 +1,68 @@
+# Pose Detection (`detect_pose_mediapipe.py`)
+
+MediaPipe-based pose landmark detection tool. Detects up to 10 poses per image, generates annotated images, and saves landmark data as JSON.
+
+## Usage
+
+```bash
+# Single image
+python backend/detect_pose_mediapipe.py --image path/to/image.jpg
+
+# Batch processing with glob pattern
+python backend/detect_pose_mediapipe.py --pattern "./*.jpg"
+```
+
+## CLI Options
+
+| Flag | Description |
+|------|-------------|
+| `--image` | Path to a single image file |
+| `--pattern` | Glob pattern for batch processing (e.g., `./*.jpg`) |
+
+One of `--image` or `--pattern` is required.
+
+## Model Configuration
+
+- Model: `pose_landmarker_lite.task` (must be present in the working directory)
+- Running mode: `IMAGE` (single image, not video/live stream)
+- Segmentation masks: enabled
+- Maximum poses per image: 10
+
+## Output
+
+For each image where landmarks are detected:
+
+### Annotated Image
+
+Saved as `{original_name}.pose.jpg`. The original image is overlaid with pose landmark connections drawn using MediaPipe's default pose landmark style.
+
+### JSON Landmarks
+
+Saved as `{original_name}.json`. Contains all detected landmark coordinates:
+
+```json
+{
+ "landmarks": [
+ {
+ "x": 0.543,
+ "y": 0.312,
+ "z": -0.05,
+ "visibility": 0.98
+ }
+ ]
+}
+```
+
+Each pose has 33 landmarks (MediaPipe Pose standard). Multiple poses are flattened into a single `landmarks` array.
+
+- `x`, `y` -- Normalized coordinates (0.0 to 1.0 relative to image dimensions)
+- `z` -- Depth estimate relative to the hip midpoint
+- `visibility` -- Confidence that the landmark is visible (0.0 to 1.0)
+
+## Batch Processing
+
+When using `--pattern`, files are:
+1. Matched with `glob.glob`
+2. Filtered to exclude files containing `.mask.` or `.pose.` in the filename (avoids reprocessing outputs)
+3. Sorted alphabetically
+4. Processed sequentially
diff --git a/docs/setup.md b/docs/setup.md
new file mode 100644
index 0000000..99f50d1
--- /dev/null
+++ b/docs/setup.md
@@ -0,0 +1,135 @@
+# Setup & Configuration
+
+## Python Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+Core dependencies from `requirements.txt`:
+- `pandas==2.1.3`
+- `sqlalchemy==2.0.27`
+- `python-dotenv==1.0.0`
+- `openai==1.9.0`
+- `psycopg2-binary==2.9.9`
+- `pydantic==2.5.3`
+- `more-itertools==10.2.0`
+- `PyMuPDF==1.23.26`
+
+Additional dependencies not listed in `requirements.txt` (install separately):
+- `torch` + `clip` (for CLIP embeddings)
+- `mediapipe` (for MediaPipe embeddings and pose detection)
+- `flask` + `flask-cors` (for the API server)
+- `clickhouse-driver` (for ClickHouse integration)
+- `weaviate-client` (for Weaviate integration)
+- `numpy`, `Pillow`, `opencv-python` (for image processing)
+- `scikit-learn`, `plotly` (for visualization)
+- `transformers` (for classify.py)
+
+## Frontend Dependencies
+
+```bash
+cd frontend
+npm install
+```
+
+See [frontend.md](frontend.md) for the full dependency list.
+
+## Environment Variables
+
+### Backend (`backend/.env`)
+
+Copy `backend/.env.example` to `backend/.env` and configure:
+
+| Variable | Example | Used By |
+|----------|---------|---------|
+| `CH_HOST` | `localhost` | `clickhouse.py` |
+| `CH_PORT` | `9000` | `clickhouse.py` |
+| `CH_USER` | `default` | `clickhouse.py` |
+| `CH_PASSWORD` | `foobar` | `clickhouse.py` |
+| `CH_DATABASE` | `db` | `clickhouse.py` |
+| `MOUNT` | `/mnt/data` | `clip_app.py` |
+| `OPENAI_API_KEY` | `sk-...` | `process.py`, playground scripts |
+
+### Frontend (`frontend/.env`)
+
+Copy `frontend/.env.example` to `frontend/.env`:
+
+| Variable | Example | Description |
+|----------|---------|-------------|
+| `BACKEND` | `0.0.0.0` | Backend host address |
+
+Note: The frontend currently hardcodes `"localhost"` in `src/env.ts` rather than reading from the environment variable.
+
+### Playground
+
+The playground scripts require additional environment variables:
+
+| Variable | Format | Used By |
+|----------|--------|---------|
+| `POSTGRES_DB_CREDENTIALS` | PostgreSQL connection string | `playground/process.py`, `playground/df_psql.py` |
+| `MODEL_NAME` | e.g., `gpt-4-turbo-preview` | `playground/process.py` |
+| `WEAVIATE_CREDENTIALS` | JSON: `{"URL":"...","API_KEY":"...","HTTP_SCHEME":"http"}` | `playground/postgres.py` |
+| `POSTGRES_CREDENTIALS` | JSON: `{"NAME":"...","USER":"...","PASSWORD":"...","HOST":"...","PORT":"5432"}` | `playground/postgres.py` |
+
+## Database Setup
+
+### Weaviate
+
+Weaviate must be running and accessible. Default expected at `localhost:8080`.
+
+The `FileEmbedding` class is created automatically by the Weaviate client when storing embeddings. No manual schema setup is required.
+
+### ClickHouse
+
+Create the vector storage table:
+
+```sql
+CREATE TABLE vector_storage (
+ embedding_vector Array(UInt8),
+ md5_hash FixedString(32),
+ file_name String
+) ENGINE = MergeTree()
+ORDER BY md5_hash;
+```
+
+### PostgreSQL (Playground)
+
+Initialize the schema using the playground utility:
+
+```python
+from playground.postgres import init_schema
+init_schema() # Creates experiments schema and tables
+```
+
+This creates:
+- `experiments.documents` -- Document metadata
+- `experiments.documents_tags` -- Tagged text segments
+- `experiments.in_out_files` -- Prompt inputs/outputs
+
+The `playground/process.py` pipeline also creates additional tables:
+- `experiments.prompts` -- Stored prompt templates
+- `experiments.promptios` -- Prompt I/O logs with metrics
+- `experiments.pass1_results` -- Step 1 processing results
+- `experiments.pass2_results` -- Step 2 processing results
+
+### SQLite
+
+Created automatically by `embedding.py`. Default file: `embedding.sqlite`. No manual setup needed.
+
+## Docker Deployment (Frontend)
+
+```bash
+cd frontend
+docker build -t embedding-frontend .
+docker run -p 8080:80 embedding-frontend
+```
+
+This builds the React app and serves it via Nginx on port 80 (mapped to host port 8080).
+
+## Running the System
+
+1. Start Weaviate (port 8080)
+2. Start the Flask API: `python backend/clip_app.py` (port 5000)
+3. Start the frontend: `cd frontend && npm run dev`
+4. Open the frontend in a browser and enter text queries to search for similar images
diff --git a/docs/vector-databases.md b/docs/vector-databases.md
new file mode 100644
index 0000000..cf12344
--- /dev/null
+++ b/docs/vector-databases.md
@@ -0,0 +1,90 @@
+# Vector Database Integrations
+
+The system supports three vector storage backends: ClickHouse, Weaviate, and SQLite.
+
+## ClickHouse (`clickhouse.py`)
+
+Stores embedding vectors with file metadata using the `clickhouse-driver` Python client.
+
+### Schema
+
+```sql
+CREATE TABLE vector_storage (
+ embedding_vector Array(UInt8),
+ md5_hash FixedString(32),
+ file_name String
+) ENGINE = MergeTree()
+ORDER BY md5_hash;
+```
+
+### Operations
+
+- `insert_into_db(client, embedding_vector, file_name)` -- Inserts an embedding vector, auto-computes MD5 hash from the file
+- `get_md5_hash(file_name)` -- Generates MD5 hash for a file
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `CH_HOST` | `localhost` | ClickHouse server host |
+| `CH_PORT` | `9000` | ClickHouse native protocol port |
+| `CH_USER` | `default` | Database user |
+| `CH_PASSWORD` | `""` | Database password |
+| `CH_DATABASE` | `default` | Database name |
+
+### Verification
+
+```bash
+clickhouse-client --user --password --database
+SELECT * FROM vector_storage;
+```
+
+## Weaviate (`use_weaviate.py`)
+
+Full-featured vector database integration using the `WeaviateEmbeddingStore` class.
+
+### Class: `WeaviateEmbeddingStore`
+
+Manages a `FileEmbedding` class in Weaviate.
+
+**Schema properties:**
+- `fileName` (string) -- Name of the source file
+- `md5Hash` (string) -- MD5 hash of the file
+- `fileSize` (int) -- File size in bytes
+- `modelName` (string) -- Name of the embedding model used
+
+**Methods:**
+
+| Method | Description |
+|--------|-------------|
+| `store_embedding(file_name, md5_hash, file_size, model_name, embedding)` | Store an embedding with metadata |
+| `get_nearest_neighbors(embedding, max_results=5)` | Find nearest neighbors by vector similarity |
+| `check_md5_exists(md5_hash, model_name)` | Check if an MD5/model combination exists |
+| `check_multiple_md5_exists(md5_hashes, model_name)` | Batch existence check |
+| `delete_embedding(md5_hash)` | Delete embeddings by MD5 hash |
+| `delete_all_embeddings()` | Delete the entire `FileEmbedding` class |
+
+### Querying
+
+Weaviate supports both the Python client API and raw GraphQL:
+
+```graphql
+{
+ Get {
+ FileEmbedding {
+ fileName
+ md5Hash
+ fileSize
+ modelName
+ }
+ }
+}
+```
+
+The frontend queries Weaviate directly via GraphQL through the Apollo Client and the `weaviate-ts-client`.
+
+## SQLite (via `embedding.py`)
+
+Local storage for embedding results. See [embedding.md](embedding.md) for the schema and usage details.
+
+The SQLite database (`embedding.sqlite` by default) stores embeddings as text-serialized float arrays alongside file metadata (filename, size, MD5 hash, computation time, method).