Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Embedding & Vector Search System

AI-powered embedding and vector search system that processes images, PDFs, and text to generate embeddings, stores them in vector databases, and provides semantic search via a Flask API and React frontend. Includes GPT-4 powered document analysis for resume/job description matching and M&A document processing.

## Architecture

```
┌──────────────────────┐
│ React Frontend │
│ (Mantine + Vite) │
└──────┬───────┬────────┘
│ │
REST API │ │ GraphQL
(Axios) │ │ (Apollo Client)
│ │
┌──────────────▼─┐ ┌─▼──────────────┐
│ Flask API │ │ Weaviate │
│ (clip_app.py) │ │ Vector DB │
│ Port 5000 │ │ Port 8080 │
└──────┬─────────┘ └────────────────┘
┌─────────▼─────────┐
│ CLIP Model │
│ (ViT-L/14) │
└───────────────────┘

┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ embedding.py │ │ extract_pdf.py │ │ process.py │
│ CLIP/MediaPipe │ │ PyMuPDF │ │ GPT-4 Analysis │
│ → SQLite │ │ Text + Images │ │ Resume/JD/M&A │
└─────────────────┘ └──────────────────┘ └──────────────────┘

┌─────────────────┐ ┌──────────────────┐
│ clickhouse.py │ │ use_weaviate.py │
│ Vector Storage │ │ Vector Storage │
└─────────────────┘ └──────────────────┘
```

## Data Flow

1. **Ingestion** -- Images and PDFs are processed by `embedding.py` or `extract_pdf.py`
2. **Embedding** -- CLIP (ViT-L/14) or MediaPipe generates vector embeddings
3. **Storage** -- Embeddings are stored in SQLite, ClickHouse, or Weaviate
4. **Search** -- The React frontend sends text queries to the Flask API, which generates a CLIP text embedding and queries Weaviate for nearest neighbors
5. **Display** -- Results are shown as image cards with similarity distances, album views, and "find similar" functionality

## Components

| Component | Description | Doc |
|-----------|-------------|-----|
| Flask API | REST API for embedding text and serving images | [backend-api.md](backend-api.md) |
| Embedding Generation | CLIP and MediaPipe embedding CLI | [embedding.md](embedding.md) |
| PDF Extraction | Text and image extraction from PDFs | [pdf-extraction.md](pdf-extraction.md) |
| Document Processing | GPT-4 document analysis (resume/JD matching) | [document-processing.md](document-processing.md) |
| Pose Detection | MediaPipe pose landmark detection | [pose-detection.md](pose-detection.md) |
| Vector Databases | ClickHouse, Weaviate, and SQLite integrations | [vector-databases.md](vector-databases.md) |
| React Frontend | Search UI with image gallery | [frontend.md](frontend.md) |
| Playground | Experimental multi-pass GPT processing | [playground.md](playground.md) |
| Setup | Installation and configuration | [setup.md](setup.md) |

## Tech Stack

**Backend:** Python, Flask, PyTorch, CLIP (ViT-L/14), MediaPipe, PyMuPDF, OpenAI API, Pydantic, SQLAlchemy, pandas

**Frontend:** React 18, TypeScript, Vite, Mantine UI, Apollo Client, Axios, weaviate-ts-client

**Databases:** Weaviate (vector search + GraphQL), ClickHouse (vector storage), SQLite (local embeddings), PostgreSQL (playground)

**Infrastructure:** Docker (multi-stage Nginx build for frontend), CUDA auto-detection for GPU acceleration
98 changes: 98 additions & 0 deletions docs/backend-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Backend API (`clip_app.py`)

Flask REST API that provides CLIP text embedding and image serving endpoints. Runs on port 5000 with CORS enabled.

## Startup

Loads the CLIP ViT-L/14 model on startup. Automatically uses CUDA if available, otherwise falls back to CPU.

```bash
python backend/clip_app.py
```

The server runs on `0.0.0.0:5000` in debug mode.

## Endpoints

### `POST /embed`

Generates a CLIP text embedding.

**Request body:**
```json
{
"text": "a sleepy ridgeback dog"
}
```

**Response:**
```json
{
"embedding": [0.123, -0.456, ...]
}
```

The embedding is a 768-dimensional float array (CLIP ViT-L/14 output).

**Errors:**
- `400` -- No `text` field provided in request body

### `GET /image`

Returns a full-size image file.

**Query parameters:**
- `fileName` (required) -- Image filename. The date folder is extracted from the first 8 characters (YYYYMMDD format).

**Response:** JPEG image file

**Path resolution:** `{MOUNT}/{YYYYMMDD}/{fileName}`

**Errors:**
- `400` -- No `fileName` provided
- `404` -- File not found at resolved path

### `GET /thumbnail`

Returns a resized thumbnail (max 600x600 pixels).

**Query parameters:**
- `fileName` (required) -- Same format as `/image`

**Response:** JPEG image (resized to fit within 600x600 while maintaining aspect ratio)

Thumbnails are generated on-the-fly using PIL. Not cached.

**Errors:**
- `400` -- No `fileName` provided
- `404` -- File not found

### `GET /list_files_by_date`

Lists all files in a date-based directory.

**Query parameters:**
- `fileName` (required) -- Any string with at least 8 characters. The first 8 characters are used as the date folder name (YYYYMMDD).

**Response:**
```json
{
"files": ["20230101_001.jpg", "20230101_002.jpg"]
}
```

**Errors:**
- `400` -- Invalid or missing `fileName` (less than 8 characters)
- `404` -- Directory not found

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `MOUNT` | Base path for image files | `""` (empty string) |

## Integration

The frontend (`EmbeddingSearch.tsx`) uses this API in two ways:
1. `POST /embed` to convert search text into a vector, then queries Weaviate directly with the vector
2. `GET /image`, `GET /thumbnail`, and `GET /list_files_by_date` to display results
74 changes: 74 additions & 0 deletions docs/document-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Document Processing (`process.py`)

GPT-4 powered document analysis tool. Supports two modes: single-file resume/job description matching, and batch directory processing with concurrent API calls.

## Usage

```bash
# Single file: resume against a job description
python backend/process.py --txtfile resume.txt --prompt resume_prompt.txt --job_description posting.txt

# Batch: process all .txt files in a directory
python backend/process.py --dir ../outfolder --prompt resume_prompt.txt --job_description posting.txt
```

## CLI Options

| Flag | Default | Description |
|------|---------|-------------|
| `--txtfile` | -- | Path to a single text file (resume) |
| `--dir` | `../outfolder` | Directory of text files to process |
| `--prompt` | `resume_prompt.txt` | Prompt template file |
| `--job_description` | `posting.txt` | Job description text file |
| `--csv` | `results.csv` | Output CSV file path |

Either `--txtfile` or `--dir` must be provided.

## Pipeline

### Single File Mode (`--txtfile`)

1. Reads the resume text file, prompt template, and job description
2. Substitutes `[resume_text]` and `[job_description]` placeholders in the prompt
3. Sends the assembled prompt to GPT-4 (`gpt-4-1106-preview`) with JSON response format
4. Saves the response as `{txtfile}.json`

### Batch Directory Mode (`--dir`)

1. Scans the directory for all `.txt` files
2. Processes each file using `ThreadPoolExecutor` with 20 workers
3. Each file goes through the single-file pipeline concurrently

### Multi-Page Mode (internal)

The `process_file_and_prompt_multi_pages` function:
1. Reads a text file and splits it by `<PAGE ` markers
2. Prepares a prompt for each page using a separate template (with `resume_jd.txt`)
3. Sends all pages concurrently via `ThreadPoolExecutor` with 8 workers
4. Collects responses into a DataFrame

## Prompt Templates

Prompt files use simple placeholder substitution:
- `[resume_text]` -- Replaced with the resume/document text
- `[job_description]` -- Replaced with the job description text

## Concurrent Processing

- Single file mode: sequential (one API call)
- Batch directory mode: 20 concurrent workers
- Multi-page mode: 8 concurrent workers

All API calls use `concurrent.futures.ThreadPoolExecutor`.

## API Configuration

- Model: `gpt-4-1106-preview`
- Response format: JSON object
- Errors are caught and returned as `{"error": "..."}` dictionaries

## Environment Variables

| Variable | Description |
|----------|-------------|
| `OPENAI_API_KEY` | OpenAI API key (required, loaded from `.env`) |
80 changes: 80 additions & 0 deletions docs/embedding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Embedding Generation (`embedding.py`)

CLI tool to generate embeddings for images or text using CLIP (ViT-L/14) or MediaPipe, with SQLite storage and cosine similarity-based deduplication.

## Usage

```bash
# Text embedding (CLIP only)
python backend/embedding.py --text "a sleepy ridgeback dog" --method clip

# Image embedding with CLIP
python backend/embedding.py --image "./images/*.jpg" --method clip

# Image embedding with MediaPipe
python backend/embedding.py --image "path/to/image.jpg" --method mediapipe --mediapipe_model_path "mobilenet_v3_large.tflite"
```

## CLI Options

| Flag | Required | Default | Description |
|------|----------|---------|-------------|
| `--text` | mutually exclusive with `--image` | -- | Text string to embed |
| `--image` | mutually exclusive with `--text` | -- | Image path or glob pattern |
| `--method` | yes | -- | `clip` or `mediapipe` |
| `--database` | no | `embedding.sqlite` | SQLite database file path |
| `--cosine_threshold` | no | `1.00` | Threshold for cosine similarity dedup |
| `--mediapipe_model_path` | no | `mobilenet_v3_large.tflite` | Path to MediaPipe TFLite model |

`--text` and `--image` are mutually exclusive (one is required).

## Embedding Methods

### CLIP (ViT-L/14)

- Produces 768-dimensional float vectors
- ~10ms per image with CUDA, ~3s without CUDA
- Supports both text and image embedding
- Auto-detects CUDA availability

### MediaPipe

- Uses the `ImageEmbedder` task with quantization enabled
- Requires a TFLite model file (e.g., `mobilenet_v3_large.tflite`)
- ~50ms per image
- Image embedding only (no text support)

## SQLite Storage Schema

Database table: `embeddings`

| Column | Type | Description |
|--------|------|-------------|
| `id` | INTEGER PRIMARY KEY AUTOINCREMENT | Auto-incrementing ID |
| `filename` | TEXT | Base filename of the image |
| `seconds` | REAL | Time taken to compute embedding |
| `size` | INTEGER | File size in bytes |
| `md5` | TEXT | MD5 hash of the file |
| `method` | TEXT | `clip` or `mediapipe` |
| `embedding` | TEXT | String representation of the embedding vector |

Text embeddings are not stored in the database.

## Deduplication Logic

When processing images sequentially, the tool compares consecutive images:

1. **File size check** -- If the file size differs by more than 2% from the previous image, skip similarity check and insert directly
2. **Cosine similarity** -- If file sizes are similar, compute cosine similarity between the current and previous embedding vectors
3. **Threshold** -- If similarity exceeds `--cosine_threshold`, the image is moved to a `similar/` subdirectory instead of being stored

This approach detects near-duplicate images efficiently by first filtering on file size as a cheap pre-check.

## Image Path Resolution

The `--image` argument accepts:
- A single file path
- A glob pattern (e.g., `./images/*.jpg`)
- Paths with `~` (user home directory expansion)

Files are sorted alphabetically before processing.
Loading