adanomad · sunapi386 · Feb 5, 2026
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,69 @@
+# Embedding & Vector Search System
+
+AI-powered embedding and vector search system that processes images, PDFs, and text to generate embeddings, stores them in vector databases, and provides semantic search via a Flask API and React frontend. Includes GPT-4 powered document analysis for resume/job description matching and M&A document processing.
+
+## Architecture
+
+```
+                         ┌──────────────────────┐
+                         │    React Frontend     │
+                         │  (Mantine + Vite)     │
+                         └──────┬───────┬────────┘
+                                │       │
+                     REST API   │       │  GraphQL
+                     (Axios)    │       │  (Apollo Client)
+                                │       │
+                 ┌──────────────▼─┐   ┌─▼──────────────┐
+                 │  Flask API     │   │    Weaviate     │
+                 │  (clip_app.py) │   │  Vector DB      │
+                 │  Port 5000     │   │  Port 8080      │
+                 └──────┬─────────┘   └────────────────┘
+                        │
+              ┌─────────▼─────────┐
+              │   CLIP Model      │
+              │   (ViT-L/14)      │
+              └───────────────────┘
+
+  ┌─────────────────┐  ┌──────────────────┐  ┌──────────────────┐
+  │  embedding.py   │  │  extract_pdf.py  │  │   process.py     │
+  │  CLIP/MediaPipe │  │  PyMuPDF         │  │   GPT-4 Analysis │
+  │  → SQLite       │  │  Text + Images   │  │   Resume/JD/M&A  │
+  └─────────────────┘  └──────────────────┘  └──────────────────┘
+
+  ┌─────────────────┐  ┌──────────────────┐
+  │  clickhouse.py  │  │  use_weaviate.py │
+  │  Vector Storage │  │  Vector Storage  │
+  └─────────────────┘  └──────────────────┘
+```
+
+## Data Flow
+
+1. **Ingestion** -- Images and PDFs are processed by `embedding.py` or `extract_pdf.py`
+2. **Embedding** -- CLIP (ViT-L/14) or MediaPipe generates vector embeddings
+3. **Storage** -- Embeddings are stored in SQLite, ClickHouse, or Weaviate
+4. **Search** -- The React frontend sends text queries to the Flask API, which generates a CLIP text embedding and queries Weaviate for nearest neighbors
+5. **Display** -- Results are shown as image cards with similarity distances, album views, and "find similar" functionality
+
+## Components
+
+| Component | Description | Doc |
+|-----------|-------------|-----|
+| Flask API | REST API for embedding text and serving images | [backend-api.md](backend-api.md) |
+| Embedding Generation | CLIP and MediaPipe embedding CLI | [embedding.md](embedding.md) |
+| PDF Extraction | Text and image extraction from PDFs | [pdf-extraction.md](pdf-extraction.md) |
+| Document Processing | GPT-4 document analysis (resume/JD matching) | [document-processing.md](document-processing.md) |
+| Pose Detection | MediaPipe pose landmark detection | [pose-detection.md](pose-detection.md) |
+| Vector Databases | ClickHouse, Weaviate, and SQLite integrations | [vector-databases.md](vector-databases.md) |
+| React Frontend | Search UI with image gallery | [frontend.md](frontend.md) |
+| Playground | Experimental multi-pass GPT processing | [playground.md](playground.md) |
+| Setup | Installation and configuration | [setup.md](setup.md) |
+
+## Tech Stack
+
+**Backend:** Python, Flask, PyTorch, CLIP (ViT-L/14), MediaPipe, PyMuPDF, OpenAI API, Pydantic, SQLAlchemy, pandas
+
+**Frontend:** React 18, TypeScript, Vite, Mantine UI, Apollo Client, Axios, weaviate-ts-client
+
+**Databases:** Weaviate (vector search + GraphQL), ClickHouse (vector storage), SQLite (local embeddings), PostgreSQL (playground)
+
+**Infrastructure:** Docker (multi-stage Nginx build for frontend), CUDA auto-detection for GPU acceleration
diff --git a/docs/backend-api.md b/docs/backend-api.md
@@ -0,0 +1,98 @@
+# Backend API (`clip_app.py`)
+
+Flask REST API that provides CLIP text embedding and image serving endpoints. Runs on port 5000 with CORS enabled.
+
+## Startup
+
+Loads the CLIP ViT-L/14 model on startup. Automatically uses CUDA if available, otherwise falls back to CPU.
+
+```bash
+python backend/clip_app.py
+```
+
+The server runs on `0.0.0.0:5000` in debug mode.
+
+## Endpoints
+
+### `POST /embed`
+
+Generates a CLIP text embedding.
+
+**Request body:**
+```json
+{
+  "text": "a sleepy ridgeback dog"
+}
+```
+
+**Response:**
+```json
+{
+  "embedding": [0.123, -0.456, ...]
+}
+```
+
+The embedding is a 768-dimensional float array (CLIP ViT-L/14 output).
+
+**Errors:**
+- `400` -- No `text` field provided in request body
+
+### `GET /image`
+
+Returns a full-size image file.
+
+**Query parameters:**
+- `fileName` (required) -- Image filename. The date folder is extracted from the first 8 characters (YYYYMMDD format).
+
+**Response:** JPEG image file
+
+**Path resolution:** `{MOUNT}/{YYYYMMDD}/{fileName}`
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found at resolved path
+
+### `GET /thumbnail`
+
+Returns a resized thumbnail (max 600x600 pixels).
+
+**Query parameters:**
+- `fileName` (required) -- Same format as `/image`
+
+**Response:** JPEG image (resized to fit within 600x600 while maintaining aspect ratio)
+
+Thumbnails are generated on-the-fly using PIL. Not cached.
+
+**Errors:**
+- `400` -- No `fileName` provided
+- `404` -- File not found
+
+### `GET /list_files_by_date`
+
+Lists all files in a date-based directory.
+
+**Query parameters:**
+- `fileName` (required) -- Any string with at least 8 characters. The first 8 characters are used as the date folder name (YYYYMMDD).
+
+**Response:**
+```json
+{
+  "files": ["20230101_001.jpg", "20230101_002.jpg"]
+}
+```
+
+**Errors:**
+- `400` -- Invalid or missing `fileName` (less than 8 characters)
+- `404` -- Directory not found
+
+## Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `MOUNT` | Base path for image files | `""` (empty string) |
+
+## Integration
+
+The frontend (`EmbeddingSearch.tsx`) uses this API in two ways:
+1. `POST /embed` to convert search text into a vector, then queries Weaviate directly with the vector
+2. `GET /image`, `GET /thumbnail`, and `GET /list_files_by_date` to display results
diff --git a/docs/document-processing.md b/docs/document-processing.md
@@ -0,0 +1,74 @@
+# Document Processing (`process.py`)
+
+GPT-4 powered document analysis tool. Supports two modes: single-file resume/job description matching, and batch directory processing with concurrent API calls.
+
+## Usage
+
+```bash
+# Single file: resume against a job description
+python backend/process.py --txtfile resume.txt --prompt resume_prompt.txt --job_description posting.txt
+
+# Batch: process all .txt files in a directory
+python backend/process.py --dir ../outfolder --prompt resume_prompt.txt --job_description posting.txt
+```
+
+## CLI Options
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--txtfile` | -- | Path to a single text file (resume) |
+| `--dir` | `../outfolder` | Directory of text files to process |
+| `--prompt` | `resume_prompt.txt` | Prompt template file |
+| `--job_description` | `posting.txt` | Job description text file |
+| `--csv` | `results.csv` | Output CSV file path |
+
+Either `--txtfile` or `--dir` must be provided.
+
+## Pipeline
+
+### Single File Mode (`--txtfile`)
+
+1. Reads the resume text file, prompt template, and job description
+2. Substitutes `[resume_text]` and `[job_description]` placeholders in the prompt
+3. Sends the assembled prompt to GPT-4 (`gpt-4-1106-preview`) with JSON response format
+4. Saves the response as `{txtfile}.json`
+
+### Batch Directory Mode (`--dir`)
+
+1. Scans the directory for all `.txt` files
+2. Processes each file using `ThreadPoolExecutor` with 20 workers
+3. Each file goes through the single-file pipeline concurrently
+
+### Multi-Page Mode (internal)
+
+The `process_file_and_prompt_multi_pages` function:
+1. Reads a text file and splits it by `<PAGE ` markers
+2. Prepares a prompt for each page using a separate template (with `resume_jd.txt`)
+3. Sends all pages concurrently via `ThreadPoolExecutor` with 8 workers
+4. Collects responses into a DataFrame
+
+## Prompt Templates
+
+Prompt files use simple placeholder substitution:
+- `[resume_text]` -- Replaced with the resume/document text
+- `[job_description]` -- Replaced with the job description text
+
+## Concurrent Processing
+
+- Single file mode: sequential (one API call)
+- Batch directory mode: 20 concurrent workers
+- Multi-page mode: 8 concurrent workers
+
+All API calls use `concurrent.futures.ThreadPoolExecutor`.
+
+## API Configuration
+
+- Model: `gpt-4-1106-preview`
+- Response format: JSON object
+- Errors are caught and returned as `{"error": "..."}` dictionaries
+
+## Environment Variables
+
+| Variable | Description |
+|----------|-------------|
+| `OPENAI_API_KEY` | OpenAI API key (required, loaded from `.env`) |
diff --git a/docs/embedding.md b/docs/embedding.md
@@ -0,0 +1,80 @@
+# Embedding Generation (`embedding.py`)
+
+CLI tool to generate embeddings for images or text using CLIP (ViT-L/14) or MediaPipe, with SQLite storage and cosine similarity-based deduplication.
+
+## Usage
+
+```bash
+# Text embedding (CLIP only)
+python backend/embedding.py --text "a sleepy ridgeback dog" --method clip
+
+# Image embedding with CLIP
+python backend/embedding.py --image "./images/*.jpg" --method clip
+
+# Image embedding with MediaPipe
+python backend/embedding.py --image "path/to/image.jpg" --method mediapipe --mediapipe_model_path "mobilenet_v3_large.tflite"
+```
+
+## CLI Options
+
+| Flag | Required | Default | Description |
+|------|----------|---------|-------------|
+| `--text` | mutually exclusive with `--image` | -- | Text string to embed |
+| `--image` | mutually exclusive with `--text` | -- | Image path or glob pattern |
+| `--method` | yes | -- | `clip` or `mediapipe` |
+| `--database` | no | `embedding.sqlite` | SQLite database file path |
+| `--cosine_threshold` | no | `1.00` | Threshold for cosine similarity dedup |
+| `--mediapipe_model_path` | no | `mobilenet_v3_large.tflite` | Path to MediaPipe TFLite model |
+
+`--text` and `--image` are mutually exclusive (one is required).
+
+## Embedding Methods
+
+### CLIP (ViT-L/14)
+
+- Produces 768-dimensional float vectors
+- ~10ms per image with CUDA, ~3s without CUDA
+- Supports both text and image embedding
+- Auto-detects CUDA availability
+
+### MediaPipe
+
+- Uses the `ImageEmbedder` task with quantization enabled
+- Requires a TFLite model file (e.g., `mobilenet_v3_large.tflite`)
+- ~50ms per image
+- Image embedding only (no text support)
+
+## SQLite Storage Schema
+
+Database table: `embeddings`
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `id` | INTEGER PRIMARY KEY AUTOINCREMENT | Auto-incrementing ID |
+| `filename` | TEXT | Base filename of the image |
+| `seconds` | REAL | Time taken to compute embedding |
+| `size` | INTEGER | File size in bytes |
+| `md5` | TEXT | MD5 hash of the file |
+| `method` | TEXT | `clip` or `mediapipe` |
+| `embedding` | TEXT | String representation of the embedding vector |
+
+Text embeddings are not stored in the database.
+
+## Deduplication Logic
+
+When processing images sequentially, the tool compares consecutive images:
+
+1. **File size check** -- If the file size differs by more than 2% from the previous image, skip similarity check and insert directly
+2. **Cosine similarity** -- If file sizes are similar, compute cosine similarity between the current and previous embedding vectors
+3. **Threshold** -- If similarity exceeds `--cosine_threshold`, the image is moved to a `similar/` subdirectory instead of being stored
+
+This approach detects near-duplicate images efficiently by first filtering on file size as a cheap pre-check.
+
+## Image Path Resolution
+
+The `--image` argument accepts:
+- A single file path
+- A glob pattern (e.g., `./images/*.jpg`)
+- Paths with `~` (user home directory expansion)
+
+Files are sorted alphabetically before processing.